Dario Faggioli
2013-Nov-13 19:10 UTC
[PATCH v2 00/16] Implement vcpu soft affinity for credit1
Hi, This is take 2 of the per-vcpu NUMA affinity seies... Which is not about per-vcpu NUMA affinity any more! :-) As agreed with Jan and George, I re-architected the thing and made it more general. So, now it''s about allowing each vcpu to have: - an hard affinity, which they already do, and we usually call pinning. This is the list of pcpus where a vcpu is allowed to run; - a soft affinity, which this series introduces. This is the list of pcpus where a vcpu *prefers* to run. Once that is done, per-vcpu NUMA-aware scheduling is easily implemented on top of that, just by instructing libxl to issue the proper call to setup the soft affinity of the domain''s vcpus to be equal to its node-affinity. (see, for instance, patch 16). I''m sorry it took way longer than I expected it to take, but it turned out that just renaming variables and moving hunks around the series is an activity very very prone to stupid mistakes! :-P Also, I had to update, and then double check to have done it correcly, all the changes to the in-tree docs and manpages, to make sure they actually reflect the new architecture. As it was before, the first part of the series (patches 1-4) carries the updated syntax for specifying cpu (both hard and soft) affinity, which has been discussed and (mostly) acked already, either in v1 or in a previous submission (under a different series name). The whole series should be pretty straightforward, as it''s mostly renaming variables and moving or rewiring existing stuff. Perhaps (and I''m staring at you tool''s maintainers :-D) have a thorougher look at patche 11 and 12 (especially 12), where the interface for this new feature is crafted, at the libxc (patch 11) and libxl (patch 12) level. Release wise, I think this is a nice feature to have, especially now that we''ve made it more general, and it can find users and use cases outside of the NUMa domain, where it was confined before. Also, I don''t think it''s that much of a complex series (as I said, it''s mosly renaming/reshuffling), so perhaps it can go in. At the same time, the natural consumer for this feature is vNUMA. Thus, since vNUMA is probably going to be a 4.5 thing, having this series slipping wouldn''t be too bad. The series is also available here: git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v2 Thanks and Regards, Dario --- Dario Faggioli (16): xl: match output of vcpu-list with pinning syntax xl: allow for node-wise specification of vcpu pinning * xl: implement and enable dryrun mode for `xl vcpu-pin'' * xl: test script for the cpumap parser (for vCPU pinning) xen: fix leaking of v->cpu_affinity_saved xen: sched: make space for cpu_soft_affinity xen: sched: rename v->cpu_affinity into v->cpu_hard_affinity xen: derive NUMA node affinity from hard and soft CPU affinity xen: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity xen: sched: use soft-affinity instead of domain''s node-affinity libxc: get and set soft and hard affinity libxl: get and set soft affinity xl: show soft affinity in `xl vcpu-list'' xl: enable setting soft affinity xl: enable for specifying node-affinity in the config file libxl: automatic NUMA placement affects soft affinity docs/man/xl.cfg.pod.5 | 66 +++ docs/man/xl.pod.1 | 34 ++ docs/misc/xl-numa-placement.markdown | 164 ++++++--- tools/libxc/xc_domain.c | 153 +++++++- tools/libxc/xenctrl.h | 53 +++ tools/libxl/Makefile | 1 tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 +++ tools/libxl/libxl.c | 206 ++++++++++- tools/libxl/libxl.h | 23 + tools/libxl/libxl_create.c | 6 tools/libxl/libxl_dom.c | 22 + tools/libxl/libxl_types.idl | 4 tools/libxl/libxl_utils.h | 13 + tools/libxl/xl_cmdimpl.c | 439 ++++++++++++++++------- tools/libxl/xl_cmdtable.c | 8 xen/arch/x86/traps.c | 13 - xen/common/domain.c | 48 ++- xen/common/domctl.c | 38 ++ xen/common/keyhandler.c | 4 xen/common/sched_credit.c | 161 +++----- xen/common/sched_sedf.c | 2 xen/common/schedule.c | 56 ++- xen/common/wait.c | 10 - xen/include/public/domctl.h | 15 + xen/include/xen/sched.h | 14 - 26 files changed, 1515 insertions(+), 385 deletions(-) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example -- Signature
Dario Faggioli
2013-Nov-13 19:11 UTC
[PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax
in fact, pinning to all the pcpus happens by specifying "all" (either on the command line or in the config file), while `xl vcpu-list'' report it as "any cpu". Change this into something more consistent, by using "all" everywhere. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes since v1: * this patch was not there in v1. It is now as using the same syntax for both input and output was requested during review. --- tools/libxl/xl_cmdimpl.c | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 8690ec7..13e97b3 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -3101,8 +3101,7 @@ out: } } -/* If map is not full, prints it and returns 0. Returns 1 otherwise. */ -static int print_bitmap(uint8_t *map, int maplen, FILE *stream) +static void print_bitmap(uint8_t *map, int maplen, FILE *stream) { int i; uint8_t pmap = 0, bitmask = 0; @@ -3140,28 +3139,16 @@ static int print_bitmap(uint8_t *map, int maplen, FILE *stream) case 2: break; case 1: - if (firstset == 0) - return 1; + if (firstset == 0) { + fprintf(stream, "all"); + break; + } case 3: fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); if (i - 1 > firstset) fprintf(stream, "-%d", i - 1); break; } - - return 0; -} - -static void print_cpumap(uint8_t *map, int maplen, FILE *stream) -{ - if (print_bitmap(map, maplen, stream)) - fprintf(stream, "any cpu"); -} - -static void print_nodemap(uint8_t *map, int maplen, FILE *stream) -{ - if (print_bitmap(map, maplen, stream)) - fprintf(stream, "any node"); } static void list_domains(int verbose, int context, int claim, int numa, @@ -3234,7 +3221,7 @@ static void list_domains(int verbose, int context, int claim, int numa, libxl_domain_get_nodeaffinity(ctx, info[i].domid, &nodemap); putchar('' ''); - print_nodemap(nodemap.map, physinfo.nr_nodes, stdout); + print_bitmap(nodemap.map, physinfo.nr_nodes, stdout); } putchar(''\n''); } @@ -4446,7 +4433,7 @@ static void print_vcpuinfo(uint32_t tdomid, /* TIM */ printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); /* CPU AFFINITY */ - print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout); + print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout); printf("\n"); }
Dario Faggioli
2013-Nov-13 19:11 UTC
[PATCH v2 02/16] xl: allow for node-wise specification of vcpu pinning
Making it possible to use something like the following: * "nodes:0-3": all pCPUs of nodes 0,1,2,3; * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; * ... In both domain config file and `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1 (of this very series): * actually checking for both "nodes:" and "node:" as per the doc says; * using strcmp() (rather than strncmp()) when matching "all", to avoid returning success on any longer string that just begins with "all"; * fixing the handling (well, the rejection, actually) of "^all" and "^nodes:all"; * make some string pointers const. Changes from v2 (of original series): * turned a ''return'' into ''goto out'', consistently with the most of exit patterns; * harmonized error handling: now parse_range() return a libxl error code, as requested during review; * dealing with "all" moved inside update_cpumap_range(). It''s tricky to move it in parse_range() (as requested during review), since we need the cpumap being modified handy when dealing with it. However, having it in update_cpumap_range() simplifies the code just as much as that; * explicitly checking for junk after a valid value or range in parse_range(), as requested during review; * xl exits on parsing failing, so no need to reset the cpumap to something sensible in vcpupin_parse(), as suggested during review; Changes from v1 (of original series): * code rearranged in order to look more simple to follow and understand, as requested during review; * improved docs in xl.cfg.pod.5, as requested during review; * strtoul() now returns into unsigned long, and the case where it returns ULONG_MAX is now taken into account, as requested during review; * stuff like "all,^7" now works, as requested during review. Specifying just "^7" does not work either before or after this change * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`) by introducing STR_SKIP_PREFIX() macro, as requested during review. --- docs/man/xl.cfg.pod.5 | 20 ++++++ tools/libxl/xl_cmdimpl.c | 153 +++++++++++++++++++++++++++++++++------------- 2 files changed, 128 insertions(+), 45 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index e6fc83f..5dbc73c 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the cpus on the host. =item "0-3,5,^1" -To allow all the vcpus of the guest to run on cpus 0,2,3,5. +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining +this with "all" is possible, meaning "all,^7" results in all the vcpus +of the guest running on all the cpus on the host except cpu 7. + +=item "nodes:0-3,node:^2" + +To allow all the vcpus of the guest to run on the cpus from NUMA nodes +0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs +to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus +of the guest will run on cpus 0-3,8-11. + +Combining this notation with the one above is possible. For instance, +"1,node:2,^6", means all the vcpus of the guest will run on cpu 1 and +on all the cpus of NUMA node 2, but not on cpu 6. Following the same +example as above, that would be cpus 1,4,5,7. + +Combining this with "all" is also possible, meaning "all,^nodes:1" +results in all the vcpus of the guest running on all the cpus on the +host, except for the cpus belonging to the host NUMA node 1. =item ["2", "3"] (or [2, 3]) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 13e97b3..5f5cc43 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -59,6 +59,11 @@ } \ }) +#define STR_HAS_PREFIX( a, b ) \ + ( strncmp(a, b, strlen(b)) == 0 ) +#define STR_SKIP_PREFIX( a, b ) \ + ( STR_HAS_PREFIX(a, b) ? ((a) += strlen(b), 1) : 0 ) + int logfile = 2; @@ -513,61 +518,121 @@ static void split_string_into_string_list(const char *str, free(s); } -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +static int parse_range(const char *str, unsigned long *a, unsigned long *b) { - libxl_bitmap exclude_cpumap; - uint32_t cpuida, cpuidb; - char *endptr, *toka, *tokb, *saveptr = NULL; - int i, rc = 0, rmcpu; + const char *nstr; + char *endptr; - if (!strcmp(cpu, "all")) { - libxl_bitmap_set_any(cpumap); - return 0; + *a = *b = strtoul(str, &endptr, 10); + if (endptr == str || *a == ULONG_MAX) + return ERROR_INVAL; + + if (*endptr == ''-'') { + nstr = endptr + 1; + + *b = strtoul(nstr, &endptr, 10); + if (endptr == nstr || *b == ULONG_MAX || *b < *a) + return ERROR_INVAL; + } + + /* Valid value or range so far, but we also don''t want junk after that */ + if (*endptr != ''\0'') + return ERROR_INVAL; + + return 0; +} + +/* + * Add or removes a specific set of cpus (specified in str, either as + * single cpus or as entire NUMA nodes) to/from cpumap. + */ +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) +{ + unsigned long ida, idb; + libxl_bitmap node_cpumap; + bool is_not = false, is_nodes = false; + int rc = 0; + + libxl_bitmap_init(&node_cpumap); + + rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0); + if (rc) { + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); + goto out; } - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { - fprintf(stderr, "Error: Failed to allocate cpumap.\n"); - return ENOMEM; + /* Are we adding or removing cpus/nodes? */ + if (STR_SKIP_PREFIX(str, "^")) { + is_not = true; } - for (toka = strtok_r(cpu, ",", &saveptr); toka; - toka = strtok_r(NULL, ",", &saveptr)) { - rmcpu = 0; - if (*toka == ''^'') { - /* This (These) Cpu(s) will be removed from the map */ - toka++; - rmcpu = 1; - } - /* Extract a valid (range of) cpu(s) */ - cpuida = cpuidb = strtoul(toka, &endptr, 10); - if (endptr == toka) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; - } - if (*endptr == ''-'') { - tokb = endptr + 1; - cpuidb = strtoul(tokb, &endptr, 10); - if (endptr == tokb || cpuida > cpuidb) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; + /* Are we dealing with cpus or full nodes? */ + if (STR_SKIP_PREFIX(str, "node:") || STR_SKIP_PREFIX(str, "nodes:")) { + is_nodes = true; + } + + if (strcmp(str, "all") == 0) { + /* We do not accept "^all" or "^nodes:all" */ + if (is_not) { + fprintf(stderr, "Can''t combine \"^\" and \"all\".\n"); + rc = ERROR_INVAL; + } else + libxl_bitmap_set_any(cpumap); + goto out; + } + + rc = parse_range(str, &ida, &idb); + if (rc) { + fprintf(stderr, "Invalid pcpu range: %s.\n", str); + goto out; + } + + /* Add or remove the specified cpus in the range */ + while (ida <= idb) { + if (is_nodes) { + /* Add/Remove all the cpus of a NUMA node */ + int i; + + rc = libxl_node_to_cpumap(ctx, ida, &node_cpumap); + if (rc) { + fprintf(stderr, "libxl_node_to_cpumap failed.\n"); + goto out; } + + /* Add/Remove all the cpus in the node cpumap */ + libxl_for_each_set_bit(i, node_cpumap) { + is_not ? libxl_bitmap_reset(cpumap, i) : + libxl_bitmap_set(cpumap, i); + } + } else { + /* Add/Remove this cpu */ + is_not ? libxl_bitmap_reset(cpumap, ida) : + libxl_bitmap_set(cpumap, ida); } - while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : - libxl_bitmap_set(&exclude_cpumap, cpuida); - cpuida++; - } + ida++; } - /* Clear all the cpus from the removal list */ - libxl_for_each_set_bit(i, exclude_cpumap) { - libxl_bitmap_reset(cpumap, i); - } + out: + libxl_bitmap_dispose(&node_cpumap); + return rc; +} -vcpp_out: - libxl_bitmap_dispose(&exclude_cpumap); +/* + * Takes a string representing a set of cpus (specified either as + * single cpus or as eintire NUMA nodes) and turns it into the + * corresponding libxl_bitmap (in cpumap). + */ +static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +{ + char *ptr, *saveptr = NULL; + int rc = 0; + + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; + ptr = strtok_r(NULL, ",", &saveptr)) { + rc = update_cpumap_range(ptr, cpumap); + if (rc) + break; + } return rc; }
Dario Faggioli
2013-Nov-13 19:11 UTC
[PATCH v2 03/16] xl: implement and enable dryrun mode for `xl vcpu-pin''
As it can be useful to see if the outcome of some complex vCPU pinning bitmap specification looks as expected. This also allow for the introduction of some automatic testing and verification for the bitmap parsing code, as it happens already in check-xl-disk-parse and check-xl-vif-parse. In particular, to make the above possible, this commit also changes the implementation of the vcpu-pin command so that, instead of always returning 0, it returns an error if the parsing fails. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Changes since v2 (of this very series): * fixed a typo in the changelog --- tools/libxl/xl_cmdimpl.c | 49 +++++++++++++++++++++++++++++++++------------ tools/libxl/xl_cmdtable.c | 2 +- 2 files changed, 37 insertions(+), 14 deletions(-) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 5f5cc43..cf237c4 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4566,30 +4566,53 @@ int main_vcpulist(int argc, char **argv) return 0; } -static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) +static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) { libxl_vcpuinfo *vcpuinfo; libxl_bitmap cpumap; uint32_t vcpuid; char *endptr; - int i, nb_vcpu; + int i = 0, nb_vcpu, rc = -1; + + libxl_bitmap_init(&cpumap); vcpuid = strtoul(vcpu, &endptr, 10); if (vcpu == endptr) { if (strcmp(vcpu, "all")) { fprintf(stderr, "Error: Invalid argument.\n"); - return; + goto out; } vcpuid = -1; } - if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) { - goto vcpupin_out; - } + if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) + goto out; if (vcpupin_parse(cpu, &cpumap)) - goto vcpupin_out1; + goto out; + + if (dryrun_only) { + libxl_cputopology *info = libxl_get_cpu_topology(ctx, &i); + + if (!info) { + fprintf(stderr, "libxl_get_cpu_topology failed.\n"); + goto out; + } + libxl_cputopology_list_free(info, i); + + fprintf(stdout, "cpumap: "); + print_bitmap(cpumap.map, i, stdout); + fprintf(stdout, "\n"); + + if (ferror(stdout) || fflush(stdout)) { + perror("stdout"); + exit(-1); + } + + rc = 0; + goto out; + } if (vcpuid != -1) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap) == -1) { @@ -4599,7 +4622,7 @@ static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) else { if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) { fprintf(stderr, "libxl_list_vcpu failed.\n"); - goto vcpupin_out1; + goto out; } for (i = 0; i < nb_vcpu; i++) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid, @@ -4610,10 +4633,11 @@ static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) } libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } - vcpupin_out1: + + rc = 0; + out: libxl_bitmap_dispose(&cpumap); - vcpupin_out: - ; + return rc; } int main_vcpupin(int argc, char **argv) @@ -4624,8 +4648,7 @@ int main_vcpupin(int argc, char **argv) /* No options */ } - vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); - return 0; + return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); } static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index 326a660..d3dcbf0 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -211,7 +211,7 @@ struct cmd_spec cmd_table[] = { "[Domain, ...]", }, { "vcpu-pin", - &main_vcpupin, 0, 1, + &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", "<Domain> <VCPU|all> <CPUs|all>", },
Dario Faggioli
2013-Nov-13 19:11 UTC
[PATCH v2 04/16] xl: test script for the cpumap parser (for vCPU pinning)
This commit introduces "check-xl-vcpupin-parse" for helping verifying and debugging the (v)CPU bitmap parsing code in xl. The script runs "xl -N vcpu-pin 0 all <some strings>" repeatedly, with various input strings, and checks that the output is as expected. This is what the script can do: # ./check-xl-vcpupin-parse -h usage: ./check-xl-vcpupin-parse [options] Tests various vcpu-pinning strings. If run without arguments acts as follows: - generates some test data and saves them in check-xl-vcpupin-parse.data; - tests all the generated configurations (reading them back from check-xl-vcpupin-parse.data). An example of a test vector file is provided in check-xl-vcpupin-parse.data-example. Options: -h prints this message -r seed uses seed for initializing the rundom number generator (default: the script PID) -s string tries using string as a vcpu pinning configuration and reports whether that succeeds or not -o ofile save the test data in ofile (default: check-xl-vcpupin-parse.data) -i ifile read test data from ifile An example test data file (generated on a 2 NUMA nodes, 16 CPUs host) is being provided in check-xl-vcpupin-parse.data-example. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v2 (of original series): * killed the `sleep 1'', as requested during review; * allow for passing a custom randon seed, and report what is the actual random seed used, as requested during review; * allow for testing for specific pinning configuration strings, as suggested during review; * stores the test data in a file, after generating them, and read them back from there for actual testing, as suggested during review; * allow for reading the test data from an existing test file instead than always generating new ones. Changes from v1 (of original series): * this was not there in v1, and adding it has been requested during review. --- tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 ++++ 2 files changed, 347 insertions(+) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example diff --git a/tools/libxl/check-xl-vcpupin-parse b/tools/libxl/check-xl-vcpupin-parse new file mode 100755 index 0000000..21f8421 --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse @@ -0,0 +1,294 @@ +#!/bin/bash + +set -e + +if [ -x ./xl ] ; then + export LD_LIBRARY_PATH=.:../libxc:../xenstore: + XL=./xl +else + XL=xl +fi + +fprefix=tmp.check-xl-vcpupin-parse +outfile=check-xl-vcpupin-parse.data + +usage () { +cat <<END +usage: $0 [options] + +Tests various vcpu-pinning strings. If run without arguments acts +as follows: + - generates some test data and saves them in $outfile; + - tests all the generated configurations (reading them back from + $outfile). + +An example of a test vector file is provided in ${outfile}-example. + +Options: + -h prints this message + -r seed uses seed for initializing the rundom number generator + (default: the script PID) + -s string tries using string as a vcpu pinning configuration and + reports whether that succeeds or not + -o ofile save the test data in ofile (default: $outfile) + -i ifile read test data from ifile +END +} + +expected () { + cat >$fprefix.expected +} + +# by default, re-seed with our PID +seed=$$ +failures=0 + +# Execute one test and check the result against the provided +# rc value and output +one () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" </dev/null >$fprefix.actual 2>/dev/null + actual_rc=$? + if [ $actual_rc != $expected_rc ]; then + diff -u $fprefix.expected $fprefix.actual + echo >&2 "test case \`$*'' failed ($actual_rc $diff_rc)" + failures=$(( $failures + 1 )) + fi + set -e +} + +# Write an entry in the test vector file. Format is as follows: +# test-string*expected-rc*expected-output +write () { + printf "$1*$2*$3\n" >> $outfile +} + +complete () { + if [ "$failures" = 0 ]; then + echo all ok.; exit 0 + else + echo "$failures tests failed."; exit 1 + fi +} + +# Test a specific pinning string +string () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" &> /dev/null + actual_rc=$? + set -e + + if [ $actual_rc != $expected_rc ]; then + echo >&2 "test case \`$*'' failed ($actual_rc)" + else + echo >&2 "test case \`$*'' succeeded" + fi + + exit 0 +} + +# Read a test vector file (provided as $1) line by line and +# test all the entries it contains +run () +{ + while read line + do + if [ ${line:0:1} != ''#'' ]; then + test_string="`echo $line | cut -f1 -d''*''`" + exp_rc="`echo $line | cut -f2 -d''*''`" + exp_output="`echo $line | cut -f3 -d''*''`" + + expected <<END +$exp_output +END + one $exp_rc "$test_string" + fi + done < $1 + + complete + + exit 0 +} + +while getopts "hr:s:o:i:" option +do + case $option in + h) + usage + exit 0 + ;; + r) + seed=$OPTARG + ;; + s) + string 0 "$OPTARG" + ;; + o) + outfile=$OPTARG + ;; + i) + run $OPTARG + ;; + esac +done + +#---------- test data ---------- +# +nr_cpus=`xl info | grep nr_cpus | cut -f2 -d'':''` +nr_nodes=`xl info | grep nr_nodes | cut -f2 -d'':''` +nr_cpus_per_node=`xl info -n | sed ''/cpu:/,/numa_info/!d'' | head -n -1 | \ + awk ''{print $4}'' | uniq -c | tail -1 | awk ''{print $1}''` +cat >$outfile <<END +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: $nr_cpus CPUs, $nr_nodes nodes, $nr_cpus_per_node CPUs per node. +# +# seed used for random number generation: seed=${seed}. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +END + +# Re-seed the random number generator +RANDOM=$seed + +echo "# Testing a wrong configuration" >> $outfile +write foo 255 "" + +echo "# Testing the ''all'' syntax" >> $outfile +write "all" 0 "cpumap: all" +write "nodes:all" 0 "cpumap: all" +write "all,nodes:all" 0 "cpumap: all" +write "all,^nodes:0,all" 0 "cpumap: all" + +echo "# Testing the empty cpumap case" >> $outfile +write "^0" 0 "cpumap: none" + +echo "# A few attempts of pinning to just one random cpu" >> $outfile +if [ $nr_cpus -gt 1 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + write "$cpu" 0 "cpumap: $cpu" + done +fi + +echo "# A few attempts of pinning to all but one random cpu" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + if [ $cpu -eq 0 ]; then + expected_range="1-$((nr_cpus - 1))" + elif [ $cpu -eq 1 ]; then + expected_range="0,2-$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 2)) ]; then + expected_range="0-$((cpu - 1)),$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 1)) ]; then + expected_range="0-$((nr_cpus - 2))" + else + expected_range="0-$((cpu - 1)),$((cpu + 1))-$((nr_cpus - 1))" + fi + write "all,^$cpu" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of cpus" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpua=$(($RANDOM % nr_cpus)) + range=$((nr_cpus - cpua)) + cpub=$(($RANDOM % range)) + cpubb=$((cpua + cpub)) + if [ $cpua -eq $cpubb ]; then + expected_range="$cpua" + else + expected_range="$cpua-$cpubb" + fi + write "$expected_range" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to just one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + write "nodes:$node" 0 "cpumap: $((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1)-1))" + done +fi + +echo "# A few attempts of pinning to all but one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $node -eq 0 ]; then + expected_range="$nr_cpus_per_node-$((nr_cpus - 1))" + elif [ $node -eq $((nr_nodes - 1)) ]; then + expected_range="0-$((nr_cpus - nr_cpus_per_node - 1))" + else + expected_range="0-$((nr_cpus_per_node*node-1)),$((nr_cpus_per_node*(node+1)))-$nr_cpus" + fi + write "all,^nodes:$node" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of nodes" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + nodea=$(($RANDOM % nr_nodes)) + range=$((nr_nodes - nodea)) + nodeb=$(($RANDOM % range)) + nodebb=$((nodea + nodeb)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $nodea -eq 0 ] && [ $nodebb -eq $((nr_nodes - 1)) ]; then + expected_range="all" + else + expected_range="$((nr_cpus_per_node*nodea))-$((nr_cpus_per_node*(nodebb+1) - 1))" + fi + write "nodes:$nodea-$nodebb" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a node but excluding one random cpu" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + cpu=$(($RANDOM % nr_cpus_per_node + nr_cpus_per_node*node)) + if [ $cpu -eq $((nr_cpus_per_node*node)) ]; then + expected_range="$((nr_cpus_per_node*node + 1))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*node + 1)) ]; then + expected_range="$((nr_cpus_per_node*node)),$((nr_cpus_per_node*node + 2))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 2)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 3)),$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 1)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 2))" + else + expected_range="$((nr_cpus_per_node*node))-$((cpu - 1)),$((cpu + 1))-$((nr_cpus_per_node*(node+1) - 1))" + fi + write "nodes:$node,^$cpu" 0 "cpumap: $expected_range" + done +fi + +run $outfile diff --git a/tools/libxl/check-xl-vcpupin-parse.data-example b/tools/libxl/check-xl-vcpupin-parse.data-example new file mode 100644 index 0000000..4bbd5de --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse.data-example @@ -0,0 +1,53 @@ +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: 16 CPUs, 2 nodes, 8 CPUs per node. +# +# seed used for random number generation: seed=13328. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +# Testing a wrong configuration +foo*255* +# Testing the ''all'' syntax +all*0*cpumap: all +nodes:all*0*cpumap: all +all,nodes:all*0*cpumap: all +all,^nodes:0,all*0*cpumap: all +# Testing the empty cpumap case +^0*0*cpumap: none +# A few attempts of pinning to just one random cpu +0*0*cpumap: 0 +9*0*cpumap: 9 +6*0*cpumap: 6 +0*0*cpumap: 0 +# A few attempts of pinning to all but one random cpu +all,^12*0*cpumap: 0-11,13-15 +all,^6*0*cpumap: 0-5,7-15 +all,^3*0*cpumap: 0-2,4-15 +all,^7*0*cpumap: 0-6,8-15 +# A few attempts of pinning to a random range of cpus +13-15*0*cpumap: 13-15 +7*0*cpumap: 7 +3-5*0*cpumap: 3-5 +8-11*0*cpumap: 8-11 +# A few attempts of pinning to just one random node +nodes:1*0*cpumap: 8-15 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +# A few attempts of pinning to all but one random node +all,^nodes:0*0*cpumap: 8-15 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:0*0*cpumap: 8-15 +# A few attempts of pinning to a random range of nodes +nodes:1-1*0*cpumap: 8-15 +nodes:1-1*0*cpumap: 8-15 +nodes:0-1*0*cpumap: all +nodes:0-0*0*cpumap: 0-7 +# A few attempts of pinning to a node but excluding one random cpu +nodes:1,^8*0*cpumap: 9-15 +nodes:0,^6*0*cpumap: 0-5,7 +nodes:1,^9*0*cpumap: 8,10-15 +nodes:0,^5*0*cpumap: 0-4,6-7
Dario Faggioli
2013-Nov-13 19:11 UTC
[PATCH v2 05/16] xen: fix leaking of v->cpu_affinity_saved
on domain destruction. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- xen/common/domain.c | 1 + 1 file changed, 1 insertion(+) diff --git a/xen/common/domain.c b/xen/common/domain.c index 1162e55..2cbc489 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -736,6 +736,7 @@ static void complete_domain_destroy(struct rcu_head *head) { free_cpumask_var(v->cpu_affinity); free_cpumask_var(v->cpu_affinity_tmp); + free_cpumask_var(v->cpu_affinity_saved); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); }
Dario Faggioli
2013-Nov-13 19:11 UTC
[PATCH v2 06/16] xen: sched: make space for cpu_soft_affinity
Before this change, each vcpu had its own vcpu-affinity (in v->cpu_affinity), representing the set of pcpus where the vcpu is allowed to run. Since when NUMA-aware scheduling was introduced the (credit1 only, for now) scheduler also tries as much as it can to run all the vcpus of a domain on one of the nodes that constitutes the domain''s node-affinity. The idea here is making the mechanism more general by: * allowing for this ''preference'' for some pcpus/nodes to be expressed on a per-vcpu basis, instead than for the domain as a whole. That is to say, each vcpu should have its own set of preferred pcpus/nodes, instead than it being the very same for all the vcpus of the domain; * generalizing the idea of ''preferred pcpus'' to not only NUMA awareness and support. That is to say, independently from it being or not (mostly) useful on NUMA systems, it should be possible to specify, for each vcpu, a set of pcpus where it prefers to run (in addition, and possibly unrelated to, the set of pcpus where it is allowed to run). We will be calling this set of *preferred* pcpus the vcpu''s soft affinity, and this change introduces, allocates, frees and initializes the data structure required to host that in struct vcpu (cpu_soft_affinity). Also, the new field is not used anywhere yet, so no real functional change yet. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1: * this patch does something similar to what, in v1, was being done in "5/12 xen: numa-sched: make space for per-vcpu node-affinity" --- xen/common/domain.c | 3 +++ xen/common/keyhandler.c | 2 ++ xen/common/schedule.c | 2 ++ xen/include/xen/sched.h | 3 +++ 4 files changed, 10 insertions(+) diff --git a/xen/common/domain.c b/xen/common/domain.c index 2cbc489..c33b876 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -128,6 +128,7 @@ struct vcpu *alloc_vcpu( if ( !zalloc_cpumask_var(&v->cpu_affinity) || !zalloc_cpumask_var(&v->cpu_affinity_tmp) || !zalloc_cpumask_var(&v->cpu_affinity_saved) || + !zalloc_cpumask_var(&v->cpu_soft_affinity) || !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) goto fail_free; @@ -159,6 +160,7 @@ struct vcpu *alloc_vcpu( free_cpumask_var(v->cpu_affinity); free_cpumask_var(v->cpu_affinity_tmp); free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->cpu_soft_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); return NULL; @@ -737,6 +739,7 @@ static void complete_domain_destroy(struct rcu_head *head) free_cpumask_var(v->cpu_affinity); free_cpumask_var(v->cpu_affinity_tmp); free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->cpu_soft_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); } diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c index 8e4b3f8..33c9a37 100644 --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -298,6 +298,8 @@ static void dump_domains(unsigned char key) printk("dirty_cpus=%s ", tmpstr); cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity); printk("cpu_affinity=%s\n", tmpstr); + cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_soft_affinity); + printk("cpu_soft_affinity=%s\n", tmpstr); printk(" pause_count=%d pause_flags=%lx\n", atomic_read(&v->pause_count), v->pause_flags); arch_dump_vcpu_info(v); diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 0f45f07..5731622 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -198,6 +198,8 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) else cpumask_setall(v->cpu_affinity); + cpumask_setall(v->cpu_soft_affinity); + /* Initialise the per-vcpu timers. */ init_timer(&v->periodic_timer, vcpu_periodic_timer_fn, v, v->processor); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index cbdf377..7e00caf 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -198,6 +198,9 @@ struct vcpu /* Used to restore affinity across S3. */ cpumask_var_t cpu_affinity_saved; + /* Bitmask of CPUs on which this VCPU prefers to run. */ + cpumask_var_t cpu_soft_affinity; + /* Bitmask of CPUs which are holding onto this VCPU''s state. */ cpumask_var_t vcpu_dirty_cpumask;
Dario Faggioli
2013-Nov-13 19:12 UTC
[PATCH v2 07/16] xen: sched: rename v->cpu_affinity into v->cpu_hard_affinity
in order to distinguish it from the cpu_soft_affinity introduced by the previous commit ("xen: sched: make space for cpu_soft_affinity"). This patch does not imply any functional change, it is basically the result of something like the following: s/cpu_affinity/cpu_hard_affinity/g s/cpu_affinity_tmp/cpu_hard_affinity_tmp/g s/cpu_affinity_saved/cpu_hard_affinity_saved/g Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- xen/arch/x86/traps.c | 11 ++++++----- xen/common/domain.c | 22 +++++++++++----------- xen/common/domctl.c | 2 +- xen/common/keyhandler.c | 2 +- xen/common/sched_credit.c | 12 ++++++------ xen/common/sched_sedf.c | 2 +- xen/common/schedule.c | 21 +++++++++++---------- xen/common/wait.c | 4 ++-- xen/include/xen/sched.h | 8 ++++---- 9 files changed, 43 insertions(+), 41 deletions(-) diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index e5b3585..4279cad 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -3083,7 +3083,8 @@ static void nmi_mce_softirq(void) /* Set the tmp value unconditionally, so that * the check in the iret hypercall works. */ - cpumask_copy(st->vcpu->cpu_affinity_tmp, st->vcpu->cpu_affinity); + cpumask_copy(st->vcpu->cpu_hard_affinity_tmp, + st->vcpu->cpu_hard_affinity); if ((cpu != st->processor) || (st->processor != st->vcpu->processor)) @@ -3118,11 +3119,11 @@ void async_exception_cleanup(struct vcpu *curr) return; /* Restore affinity. */ - if ( !cpumask_empty(curr->cpu_affinity_tmp) && - !cpumask_equal(curr->cpu_affinity_tmp, curr->cpu_affinity) ) + if ( !cpumask_empty(curr->cpu_hard_affinity_tmp) && + !cpumask_equal(curr->cpu_hard_affinity_tmp, curr->cpu_hard_affinity) ) { - vcpu_set_affinity(curr, curr->cpu_affinity_tmp); - cpumask_clear(curr->cpu_affinity_tmp); + vcpu_set_affinity(curr, curr->cpu_hard_affinity_tmp); + cpumask_clear(curr->cpu_hard_affinity_tmp); } if ( !(curr->async_exception_mask & (curr->async_exception_mask - 1)) ) diff --git a/xen/common/domain.c b/xen/common/domain.c index c33b876..2916490 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -125,9 +125,9 @@ struct vcpu *alloc_vcpu( tasklet_init(&v->continue_hypercall_tasklet, NULL, 0); - if ( !zalloc_cpumask_var(&v->cpu_affinity) || - !zalloc_cpumask_var(&v->cpu_affinity_tmp) || - !zalloc_cpumask_var(&v->cpu_affinity_saved) || + if ( !zalloc_cpumask_var(&v->cpu_hard_affinity) || + !zalloc_cpumask_var(&v->cpu_hard_affinity_tmp) || + !zalloc_cpumask_var(&v->cpu_hard_affinity_saved) || !zalloc_cpumask_var(&v->cpu_soft_affinity) || !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) goto fail_free; @@ -157,9 +157,9 @@ struct vcpu *alloc_vcpu( fail_wq: destroy_waitqueue_vcpu(v); fail_free: - free_cpumask_var(v->cpu_affinity); - free_cpumask_var(v->cpu_affinity_tmp); - free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->cpu_hard_affinity); + free_cpumask_var(v->cpu_hard_affinity_tmp); + free_cpumask_var(v->cpu_hard_affinity_saved); free_cpumask_var(v->cpu_soft_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); @@ -373,7 +373,7 @@ void domain_update_node_affinity(struct domain *d) for_each_vcpu ( d, v ) { - cpumask_and(online_affinity, v->cpu_affinity, online); + cpumask_and(online_affinity, v->cpu_hard_affinity, online); cpumask_or(cpumask, cpumask, online_affinity); } @@ -736,9 +736,9 @@ static void complete_domain_destroy(struct rcu_head *head) for ( i = d->max_vcpus - 1; i >= 0; i-- ) if ( (v = d->vcpu[i]) != NULL ) { - free_cpumask_var(v->cpu_affinity); - free_cpumask_var(v->cpu_affinity_tmp); - free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->cpu_hard_affinity); + free_cpumask_var(v->cpu_hard_affinity_tmp); + free_cpumask_var(v->cpu_hard_affinity_saved); free_cpumask_var(v->cpu_soft_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); @@ -878,7 +878,7 @@ int vcpu_reset(struct vcpu *v) v->async_exception_mask = 0; memset(v->async_exception_state, 0, sizeof(v->async_exception_state)); #endif - cpumask_clear(v->cpu_affinity_tmp); + cpumask_clear(v->cpu_hard_affinity_tmp); clear_bit(_VPF_blocked, &v->pause_flags); clear_bit(_VPF_in_reset, &v->pause_flags); diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 904d27b..5e0ac5c 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -629,7 +629,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) else { ret = cpumask_to_xenctl_bitmap( - &op->u.vcpuaffinity.cpumap, v->cpu_affinity); + &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); } } break; diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c index 33c9a37..42fb418 100644 --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -296,7 +296,7 @@ static void dump_domains(unsigned char key) !vcpu_event_delivery_is_enabled(v)); cpuset_print(tmpstr, sizeof(tmpstr), v->vcpu_dirty_cpumask); printk("dirty_cpus=%s ", tmpstr); - cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity); + cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_hard_affinity); printk("cpu_affinity=%s\n", tmpstr); cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_soft_affinity); printk("cpu_soft_affinity=%s\n", tmpstr); diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index 28dafcf..398b095 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -332,13 +332,13 @@ csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) if ( step == CSCHED_BALANCE_NODE_AFFINITY ) { cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask, - vc->cpu_affinity); + vc->cpu_hard_affinity); if ( unlikely(cpumask_empty(mask)) ) - cpumask_copy(mask, vc->cpu_affinity); + cpumask_copy(mask, vc->cpu_hard_affinity); } else /* step == CSCHED_BALANCE_CPU_AFFINITY */ - cpumask_copy(mask, vc->cpu_affinity); + cpumask_copy(mask, vc->cpu_hard_affinity); } static void burn_credits(struct csched_vcpu *svc, s_time_t now) @@ -407,7 +407,7 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY && !__vcpu_has_node_affinity(new->vcpu, - new->vcpu->cpu_affinity) ) + new->vcpu->cpu_hard_affinity) ) continue; /* Are there idlers suitable for new (for this balance step)? */ @@ -642,7 +642,7 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) /* Store in cpus the mask of online cpus on which the domain can run */ online = cpupool_scheduler_cpumask(vc->domain->cpupool); - cpumask_and(&cpus, vc->cpu_affinity, online); + cpumask_and(&cpus, vc->cpu_hard_affinity, online); for_each_csched_balance_step( balance_step ) { @@ -1487,7 +1487,7 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) * or counter. */ if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc, vc->cpu_affinity) ) + && !__vcpu_has_node_affinity(vc, vc->cpu_hard_affinity) ) continue; csched_balance_cpumask(vc, balance_step, csched_balance_mask); diff --git a/xen/common/sched_sedf.c b/xen/common/sched_sedf.c index 7c24171..c219aed 100644 --- a/xen/common/sched_sedf.c +++ b/xen/common/sched_sedf.c @@ -396,7 +396,7 @@ static int sedf_pick_cpu(const struct scheduler *ops, struct vcpu *v) cpumask_t *online; online = cpupool_scheduler_cpumask(v->domain->cpupool); - cpumask_and(&online_affinity, v->cpu_affinity, online); + cpumask_and(&online_affinity, v->cpu_hard_affinity, online); return cpumask_cycle(v->vcpu_id % cpumask_weight(&online_affinity) - 1, &online_affinity); } diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 5731622..28099d6 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -194,9 +194,9 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) */ v->processor = processor; if ( is_idle_domain(d) || d->is_pinned ) - cpumask_copy(v->cpu_affinity, cpumask_of(processor)); + cpumask_copy(v->cpu_hard_affinity, cpumask_of(processor)); else - cpumask_setall(v->cpu_affinity); + cpumask_setall(v->cpu_hard_affinity); cpumask_setall(v->cpu_soft_affinity); @@ -287,7 +287,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c) migrate_timer(&v->singleshot_timer, new_p); migrate_timer(&v->poll_timer, new_p); - cpumask_setall(v->cpu_affinity); + cpumask_setall(v->cpu_hard_affinity); lock = vcpu_schedule_lock_irq(v); v->processor = new_p; @@ -459,7 +459,7 @@ static void vcpu_migrate(struct vcpu *v) */ if ( pick_called && (new_lock == per_cpu(schedule_data, new_cpu).schedule_lock) && - cpumask_test_cpu(new_cpu, v->cpu_affinity) && + cpumask_test_cpu(new_cpu, v->cpu_hard_affinity) && cpumask_test_cpu(new_cpu, v->domain->cpupool->cpu_valid) ) break; @@ -563,7 +563,7 @@ void restore_vcpu_affinity(struct domain *d) { printk(XENLOG_DEBUG "Restoring affinity for d%dv%d\n", d->domain_id, v->vcpu_id); - cpumask_copy(v->cpu_affinity, v->cpu_affinity_saved); + cpumask_copy(v->cpu_hard_affinity, v->cpu_hard_affinity_saved); v->affinity_broken = 0; } @@ -606,20 +606,21 @@ int cpu_disable_scheduler(unsigned int cpu) unsigned long flags; spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags); - cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid); + cpumask_and(&online_affinity, v->cpu_hard_affinity, c->cpu_valid); if ( cpumask_empty(&online_affinity) && - cpumask_test_cpu(cpu, v->cpu_affinity) ) + cpumask_test_cpu(cpu, v->cpu_hard_affinity) ) { printk(XENLOG_DEBUG "Breaking affinity for d%dv%d\n", d->domain_id, v->vcpu_id); if (system_state == SYS_STATE_suspend) { - cpumask_copy(v->cpu_affinity_saved, v->cpu_affinity); + cpumask_copy(v->cpu_hard_affinity_saved, + v->cpu_hard_affinity); v->affinity_broken = 1; } - cpumask_setall(v->cpu_affinity); + cpumask_setall(v->cpu_hard_affinity); } if ( v->processor == cpu ) @@ -667,7 +668,7 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) lock = vcpu_schedule_lock_irq(v); - cpumask_copy(v->cpu_affinity, affinity); + cpumask_copy(v->cpu_hard_affinity, affinity); /* Always ask the scheduler to re-evaluate placement * when changing the affinity */ diff --git a/xen/common/wait.c b/xen/common/wait.c index 3c9366c..3f6ff41 100644 --- a/xen/common/wait.c +++ b/xen/common/wait.c @@ -134,7 +134,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) /* Save current VCPU affinity; force wakeup on *this* CPU only. */ wqv->wakeup_cpu = smp_processor_id(); - cpumask_copy(&wqv->saved_affinity, curr->cpu_affinity); + cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); @@ -183,7 +183,7 @@ void check_wakeup_from_wait(void) { /* Re-set VCPU affinity and re-enter the scheduler. */ struct vcpu *curr = current; - cpumask_copy(&wqv->saved_affinity, curr->cpu_affinity); + cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 7e00caf..3575312 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -192,11 +192,11 @@ struct vcpu spinlock_t virq_lock; /* Bitmask of CPUs on which this VCPU may run. */ - cpumask_var_t cpu_affinity; + cpumask_var_t cpu_hard_affinity; /* Used to change affinity temporarily. */ - cpumask_var_t cpu_affinity_tmp; + cpumask_var_t cpu_hard_affinity_tmp; /* Used to restore affinity across S3. */ - cpumask_var_t cpu_affinity_saved; + cpumask_var_t cpu_hard_affinity_saved; /* Bitmask of CPUs on which this VCPU prefers to run. */ cpumask_var_t cpu_soft_affinity; @@ -795,7 +795,7 @@ void watchdog_domain_destroy(struct domain *d); #define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv) #define has_hvm_container_vcpu(v) (has_hvm_container_domain((v)->domain)) #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \ - cpumask_weight((v)->cpu_affinity) == 1) + cpumask_weight((v)->cpu_hard_affinity) == 1) #ifdef HAS_PASSTHROUGH #define need_iommu(d) ((d)->need_iommu) #else
Dario Faggioli
2013-Nov-13 19:12 UTC
[PATCH v2 08/16] xen: derive NUMA node affinity from hard and soft CPU affinity
if a domain''s NUMA node-affinity (which is what controls memory allocations) is provided by the user/toolstack, it just is not touched. However, if the user does not say anything, leaving it all to Xen, let''s compute it in the following way: 1. cpupool''s cpus & hard-affinity & soft-affinity 2. if (1) is empty: cpupool''s cpus & hard-affinity This guarantees memory to be allocated from the narrowest possible set of NUMA nodes, ad makes it relatively easy to set up NUMA-aware scheduling on top of soft affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- xen/common/domain.c | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index 2916490..4b8fca8 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -353,7 +353,7 @@ struct domain *domain_create( void domain_update_node_affinity(struct domain *d) { - cpumask_var_t cpumask; + cpumask_var_t cpumask, cpumask_soft; cpumask_var_t online_affinity; const cpumask_t *online; struct vcpu *v; @@ -361,9 +361,15 @@ void domain_update_node_affinity(struct domain *d) if ( !zalloc_cpumask_var(&cpumask) ) return; + if ( !zalloc_cpumask_var(&cpumask_soft) ) + { + free_cpumask_var(cpumask); + return; + } if ( !alloc_cpumask_var(&online_affinity) ) { free_cpumask_var(cpumask); + free_cpumask_var(cpumask_soft); return; } @@ -373,8 +379,12 @@ void domain_update_node_affinity(struct domain *d) for_each_vcpu ( d, v ) { + /* Build up the mask of online pcpus we have hard affinity with */ cpumask_and(online_affinity, v->cpu_hard_affinity, online); cpumask_or(cpumask, cpumask, online_affinity); + + /* As well as the mask of all pcpus we have soft affinity with */ + cpumask_or(cpumask_soft, cpumask_soft, v->cpu_soft_affinity); } /* @@ -386,6 +396,15 @@ void domain_update_node_affinity(struct domain *d) */ if ( d->auto_node_affinity ) { + /* + * We''re looking for the narower possible set of nodes. So, if + * possible (i.e., if not empty!) let''s use the intersection + * between online, hard and soft affinity. If not, just fall back + * to online & hard affinity. + */ + if ( cpumask_intersects(cpumask, cpumask_soft) ) + cpumask_and(cpumask, cpumask, cpumask_soft); + nodes_clear(d->node_affinity); for_each_online_node ( node ) if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) @@ -397,6 +416,7 @@ void domain_update_node_affinity(struct domain *d) spin_unlock(&d->node_affinity_lock); free_cpumask_var(online_affinity); + free_cpumask_var(cpumask_soft); free_cpumask_var(cpumask); }
Dario Faggioli
2013-Nov-13 19:12 UTC
[PATCH v2 09/16] xen: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
by adding a flag for the caller to specify which one he cares about. Add also another cpumap there. This way, in case of DOMCTL_setvcpuaffinity, Xen can return back to the caller the "effective affinity" of the vcpu. We call the effective affinity the intersection between cpupool''s cpus, the new hard affinity and (if asking to set soft affinity) the new soft affinity. The purpose of this is allowing the toolstack to figure out whether or not the requested change in affinity produced sensible results, when combined with the other settings that are already in place. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxc/xc_domain.c | 4 +++- xen/arch/x86/traps.c | 4 ++-- xen/common/domctl.c | 38 ++++++++++++++++++++++++++++++++++---- xen/common/schedule.c | 35 ++++++++++++++++++++++++----------- xen/common/wait.c | 6 +++--- xen/include/public/domctl.h | 15 +++++++++++++-- xen/include/xen/sched.h | 3 ++- 7 files changed, 81 insertions(+), 24 deletions(-) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 1ccafc5..f9ae4bf 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -215,7 +215,9 @@ int xc_vcpu_setaffinity(xc_interface *xch, domctl.cmd = XEN_DOMCTL_setvcpuaffinity; domctl.domain = (domid_t)domid; - domctl.u.vcpuaffinity.vcpu = vcpu; + domctl.u.vcpuaffinity.vcpu = vcpu; + /* Soft affinity is there, but not used anywhere for now, so... */ + domctl.u.vcpuaffinity.flags = XEN_VCPUAFFINITY_HARD; memcpy(local, cpumap, cpusize); diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 4279cad..196ff68 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -3093,7 +3093,7 @@ static void nmi_mce_softirq(void) * Make sure to wakeup the vcpu on the * specified processor. */ - vcpu_set_affinity(st->vcpu, cpumask_of(st->processor)); + vcpu_set_hard_affinity(st->vcpu, cpumask_of(st->processor)); /* Affinity is restored in the iret hypercall. */ } @@ -3122,7 +3122,7 @@ void async_exception_cleanup(struct vcpu *curr) if ( !cpumask_empty(curr->cpu_hard_affinity_tmp) && !cpumask_equal(curr->cpu_hard_affinity_tmp, curr->cpu_hard_affinity) ) { - vcpu_set_affinity(curr, curr->cpu_hard_affinity_tmp); + vcpu_set_hard_affinity(curr, curr->cpu_hard_affinity_tmp); cpumask_clear(curr->cpu_hard_affinity_tmp); } diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 5e0ac5c..b3f779e 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -617,19 +617,49 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) if ( op->cmd == XEN_DOMCTL_setvcpuaffinity ) { cpumask_var_t new_affinity; + cpumask_t *online; ret = xenctl_bitmap_to_cpumask( &new_affinity, &op->u.vcpuaffinity.cpumap); - if ( !ret ) + if ( ret ) + break; + + ret = -EINVAL; + if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_HARD ) + ret = vcpu_set_hard_affinity(v, new_affinity); + else if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_SOFT ) + ret = vcpu_set_soft_affinity(v, new_affinity); + + if ( ret ) { - ret = vcpu_set_affinity(v, new_affinity); free_cpumask_var(new_affinity); + break; } + + /* + * Report back to the caller what the "effective affinity", that + * is the intersection of cpupool''s pcpus, the (new?) hard + * affinity and the (new?) soft-affinity. + */ + online = cpupool_online_cpumask(v->domain->cpupool); + cpumask_and(new_affinity, online, v->cpu_hard_affinity); + cpumask_and(new_affinity, new_affinity , v->cpu_soft_affinity); + + ret = cpumask_to_xenctl_bitmap( + &op->u.vcpuaffinity.effective_affinity, new_affinity); + + free_cpumask_var(new_affinity); } else { - ret = cpumask_to_xenctl_bitmap( - &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); + if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_HARD ) + ret = cpumask_to_xenctl_bitmap( + &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); + else if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_SOFT ) + ret = cpumask_to_xenctl_bitmap( + &op->u.vcpuaffinity.cpumap, v->cpu_soft_affinity); + else + ret = -EINVAL; } } break; diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 28099d6..bb52366 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -653,22 +653,14 @@ void sched_set_node_affinity(struct domain *d, nodemask_t *mask) SCHED_OP(DOM2OP(d), set_node_affinity, d, mask); } -int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) +static int vcpu_set_affinity( + struct vcpu *v, const cpumask_t *affinity, cpumask_t **which) { - cpumask_t online_affinity; - cpumask_t *online; spinlock_t *lock; - if ( v->domain->is_pinned ) - return -EINVAL; - online = VCPU2ONLINE(v); - cpumask_and(&online_affinity, affinity, online); - if ( cpumask_empty(&online_affinity) ) - return -EINVAL; - lock = vcpu_schedule_lock_irq(v); - cpumask_copy(v->cpu_hard_affinity, affinity); + cpumask_copy(*which, affinity); /* Always ask the scheduler to re-evaluate placement * when changing the affinity */ @@ -687,6 +679,27 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) return 0; } +int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity) +{ + cpumask_t online_affinity; + cpumask_t *online; + + if ( v->domain->is_pinned ) + return -EINVAL; + + online = VCPU2ONLINE(v); + cpumask_and(&online_affinity, affinity, online); + if ( cpumask_empty(&online_affinity) ) + return -EINVAL; + + return vcpu_set_affinity(v, affinity, &v->cpu_hard_affinity); +} + +int vcpu_set_soft_affinity(struct vcpu *v, const cpumask_t *affinity) +{ + return vcpu_set_affinity(v, affinity, &v->cpu_soft_affinity); +} + /* Block the currently-executing domain until a pertinent event occurs. */ void vcpu_block(void) { diff --git a/xen/common/wait.c b/xen/common/wait.c index 3f6ff41..1f6b597 100644 --- a/xen/common/wait.c +++ b/xen/common/wait.c @@ -135,7 +135,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) /* Save current VCPU affinity; force wakeup on *this* CPU only. */ wqv->wakeup_cpu = smp_processor_id(); cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); - if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) + if ( vcpu_set_hard_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); domain_crash_synchronous(); @@ -166,7 +166,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) static void __finish_wait(struct waitqueue_vcpu *wqv) { wqv->esp = NULL; - (void)vcpu_set_affinity(current, &wqv->saved_affinity); + (void)vcpu_set_hard_affinity(current, &wqv->saved_affinity); } void check_wakeup_from_wait(void) @@ -184,7 +184,7 @@ void check_wakeup_from_wait(void) /* Re-set VCPU affinity and re-enter the scheduler. */ struct vcpu *curr = current; cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); - if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) + if ( vcpu_set_hard_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); domain_crash_synchronous(); diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 01a3652..aed9cd4 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -300,8 +300,19 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t); /* XEN_DOMCTL_setvcpuaffinity */ /* XEN_DOMCTL_getvcpuaffinity */ struct xen_domctl_vcpuaffinity { - uint32_t vcpu; /* IN */ - struct xenctl_bitmap cpumap; /* IN/OUT */ + /* IN variables. */ + uint32_t vcpu; + /* Set/get the hard affinity for vcpu */ +#define _XEN_VCPUAFFINITY_HARD 0 +#define XEN_VCPUAFFINITY_HARD (1U<<_XEN_VCPUAFFINITY_HARD) + /* Set/get the soft affinity for vcpu */ +#define _XEN_VCPUAFFINITY_SOFT 1 +#define XEN_VCPUAFFINITY_SOFT (1U<<_XEN_VCPUAFFINITY_SOFT) + uint32_t flags; + /* IN/OUT variables. */ + struct xenctl_bitmap cpumap; + /* OUT variables. */ + struct xenctl_bitmap effective_affinity; }; typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 3575312..0f728b3 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -755,7 +755,8 @@ void scheduler_free(struct scheduler *sched); int schedule_cpu_switch(unsigned int cpu, struct cpupool *c); void vcpu_force_reschedule(struct vcpu *v); int cpu_disable_scheduler(unsigned int cpu); -int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity); +int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity); +int vcpu_set_soft_affinity(struct vcpu *v, const cpumask_t *affinity); void restore_vcpu_affinity(struct domain *d); void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
Dario Faggioli
2013-Nov-13 19:12 UTC
[PATCH v2 10/16] xen: sched: use soft-affinity instead of domain''s node-affinity
now that we have it, use soft affinity for scheduling, and replace the indirect use of the domain''s NUMA node-affinity. This is more general, as soft affinity does not have to be related to NUMA. At the same time it allows to achieve the same results as NUMA-aware scheduling, just by making soft affinity equal to the domain''s node affinity, for all the vCPUs (e.g., from the toolstack). This also means renaming most of the NUMA-aware scheduling related functions, in credit1, to something more generic, hinting toward the concept of soft affinity rather than directly to NUMA awareness. As a side effects, this simplifies the code quit a bit. In fact, prior to this change, we needed to cache the translation of d->node_affinity (which is a nodemask_t) to a cpumask_t, since that is what scheduling decisions require (we used to keep it in node_affinity_cpumask). This, and all the complicated logic required to keep it updated, is not necessary any longer. The high level description of NUMA placement and scheduling in docs/misc/xl-numa-placement.markdown is being updated too, to match the new architecture. signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1: * in v1, "7/12 xen: numa-sched: use per-vcpu node-affinity for actual scheduling" was doing something very similar to this patch. --- docs/misc/xl-numa-placement.markdown | 148 ++++++++++++++++++++++----------- xen/common/domain.c | 2 xen/common/sched_credit.c | 153 +++++++++++++--------------------- 3 files changed, 157 insertions(+), 146 deletions(-) diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown index caa3fec..b1ed361 100644 --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually defined as a set of processor cores (typically a physical CPU package) and the memory directly attached to the set of cores. -The Xen hypervisor deals with NUMA machines by assigning to each domain -a "node affinity", i.e., a set of NUMA nodes of the host from which they -get their memory allocated. Also, even if the node affinity of a domain -is allowed to change on-line, it is very important to "place" the domain -correctly when it is fist created, as the most of its memory is allocated -at that time and can not (for now) be moved easily. - NUMA awareness becomes very important as soon as many domains start running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the @@ -27,14 +20,37 @@ performance degradation is likely to be noticeable. For more information, have a look at the [Xen NUMA Introduction][numa_intro] page on the Wiki. +## Xen and NUMA machines: the concept of _node-affinity_ ## + +The Xen hypervisor deals with NUMA machines throughout the concept of +_node-affinity_. The node-affinity of a domain is the set of NUMA nodes +of the host where the memory for the domain is being allocated (mostly, +at domain creation time). This is, at least in principle, different and +unrelated with the vCPU (hard and soft, see below) scheduling affinity, +which instead is the set of pCPUs where the vCPU is allowed (or prefers) +to run. + +Of course, despite the fact that they belong to and affect different +subsystems, the domain node-affinity and the vCPUs affinity are not +completely independent. +In fact, if the domain node-affinity is not explicitly specified by the +user, via the proper libxl calls or xl config item, it will be computed +basing on the vCPUs'' scheduling affinity. + +Notice that, even if the node affinity of a domain may change on-line, +it is very important to "place" the domain correctly when it is fist +created, as the most of its memory is allocated at that time and can +not (for now) be moved easily. + ### Placing via pinning and cpupools ### -The simplest way of placing a domain on a NUMA node is statically pinning -the domain''s vCPUs to the pCPUs of the node. This goes under the name of -CPU affinity and can be set through the "cpus=" option in the config file -(more about this below). Another option is to pool together the pCPUs -spanning the node and put the domain in such a cpupool with the "pool=" -config option (as documented in our [Wiki][cpupools_howto]). +The simplest way of placing a domain on a NUMA node is setting the hard +scheduling affinity of the domain''s vCPUs to the pCPUs of the node. This +also goes under the name of vCPU pinning, and can be done through the +"cpus=" option in the config file (more about this below). Another option +is to pool together the pCPUs spanning the node and put the domain in +such a _cpupool_ with the "pool=" config option (as documented in our +[Wiki][cpupools_howto]). In both the above cases, the domain will not be able to execute outside the specified set of pCPUs for any reasons, even if all those pCPUs are @@ -45,24 +61,45 @@ may come at he cost of some load imbalances. ### NUMA aware scheduling ### -If the credit scheduler is in use, the concept of node affinity defined -above does not only apply to memory. In fact, starting from Xen 4.3, the -scheduler always tries to run the domain''s vCPUs on one of the nodes in -its node affinity. Only if that turns out to be impossible, it will just -pick any free pCPU. - -This is, therefore, something more flexible than CPU affinity, as a domain -can still run everywhere, it just prefers some nodes rather than others. -Locality of access is less guaranteed than in the pinning case, but that -comes along with better chances to exploit all the host resources (e.g., -the pCPUs). - -In fact, if all the pCPUs in a domain''s node affinity are busy, it is -possible for the domain to run outside of there, but it is very likely that -slower execution (due to remote memory accesses) is still better than no -execution at all, as it would happen with pinning. For this reason, NUMA -aware scheduling has the potential of bringing substantial performances -benefits, although this will depend on the workload. +If using the credit1 scheduler, and starting from Xen 4.3, the scheduler +itself always tries to run the domain''s vCPUs on one of the nodes in +its node-affinity. Only if that turns out to be impossible, it will just +pick any free pCPU. Locality of access is less guaranteed than in the +pinning case, but that comes along with better chances to exploit all +the host resources (e.g., the pCPUs). + +Starting from Xen 4.4, credit1 supports two forms of affinity: hard and +soft, both on a per-vCPU basis. This means each vCPU can have its own +soft affinity, stating where such vCPU prefers to execute on. This is +less strict than what it (also starting from 4.4) is called hard affinity, +as the vCPU can potentially run everywhere, it just prefers some pCPUs +rather than others. +In Xen 4.4, therefore, NUMA-aware scheduling is achieved by matching the +soft affinity of the vCPUs of a domain with its node-affinity. + +In fact, as it was for 4.3, if all the pCPUs in a vCPU''s soft affinity +are busy, it is possible for the domain to run outside from there. The +idea is that slower execution (due to remote memory accesses) is still +better than no execution at all (as it would happen with pinning). For +this reason, NUMA aware scheduling has the potential of bringing +substantial performances benefits, although this will depend on the +workload. + +Notice that, for each vCPU, the following three scenarios are possbile: + + * a vCPU *is pinned* to some pCPUs and *does not have* any soft affinity + In this case, the vCPU is always scheduled on one of the pCPUs to which + it is pinned, without any specific peference among them. + * a vCPU *has* its own soft affinity and *is not* pinned to any particular + pCPU. In this case, the vCPU can run on every pCPU. Nevertheless, the + scheduler will try to have it running on one of the pCPUs in its soft + affinity; + * a vCPU *has* its own vCPU soft affinity and *is also* pinned to some + pCPUs. In this case, the vCPU is always scheduled on one of the pCPUs + onto which it is pinned, with, among them, a preference for the ones + that also forms its soft affinity. In case pinning and soft affinity + form two disjoint sets of pCPUs, pinning "wins", and the soft affinity + is just ignored. ## Guest placement in xl ## @@ -71,25 +108,23 @@ both manual or automatic placement of them across the host''s NUMA nodes. Note that xm/xend does a very similar thing, the only differences being the details of the heuristics adopted for automatic placement (see below), -and the lack of support (in both xm/xend and the Xen versions where that\ +and the lack of support (in both xm/xend and the Xen versions where that was the default toolstack) for NUMA aware scheduling. ### Placing the guest manually ### Thanks to the "cpus=" option, it is possible to specify where a domain should be created and scheduled on, directly in its config file. This -affects NUMA placement and memory accesses as the hypervisor constructs -the node affinity of a VM basing right on its CPU affinity when it is -created. +affects NUMA placement and memory accesses as, in this case, the +hypervisor constructs the node-affinity of a VM basing right on its +vCPU pinning when it is created. This is very simple and effective, but requires the user/system -administrator to explicitly specify affinities for each and every domain, +administrator to explicitly specify the pinning for each and every domain, or Xen won''t be able to guarantee the locality for their memory accesses. -Notice that this also pins the domain''s vCPUs to the specified set of -pCPUs, so it not only sets the domain''s node affinity (its memory will -come from the nodes to which the pCPUs belong), but at the same time -forces the vCPUs of the domain to be scheduled on those same pCPUs. +That, of course, also mean the vCPUs of the domain will only be able to +execute on those same pCPUs. ### Placing the guest automatically ### @@ -97,7 +132,9 @@ If no "cpus=" option is specified in the config file, libxl tries to figure out on its own on which node(s) the domain could fit best. If it finds one (some), the domain''s node affinity get set to there, and both memory allocations and NUMA aware scheduling (for the credit -scheduler and starting from Xen 4.3) will comply with it. +scheduler and starting from Xen 4.3) will comply with it. Starting from +Xen 4.4, this also means that the mask resulting from this "fitting" +procedure will become the soft affinity of all the vCPUs of the domain. It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host is an incarnation of the Bin Packing Problem. In fact, @@ -142,34 +179,43 @@ any placement from happening: libxl_defbool_set(&domain_build_info->numa_placement, false); -Also, if `numa_placement` is set to `true`, the domain must not -have any CPU affinity (i.e., `domain_build_info->cpumap` must -have all its bits set, as it is by default), or domain creation -will fail returning `ERROR_INVAL`. +Also, if `numa_placement` is set to `true`, the domain''s vCPUs must +not be pinned (i.e., `domain_build_info->cpumap` must have all its +bits set, as it is by default), or domain creation will fail with +`ERROR_INVAL`. Starting from Xen 4.3, in case automatic placement happens (and is -successful), it will affect the domain''s node affinity and _not_ its -CPU affinity. Namely, the domain''s vCPUs will not be pinned to any +successful), it will affect the domain''s node-affinity and _not_ its +vCPU pinning. Namely, the domain''s vCPUs will not be pinned to any pCPU on the host, but the memory from the domain will come from the selected node(s) and the NUMA aware scheduling (if the credit scheduler -is in use) will try to keep the domain there as much as possible. +is in use) will try to keep the domain''s vCPUs there as much as possible. Besides than that, looking and/or tweaking the placement algorithm search "Automatic NUMA placement" in libxl\_internal.h. Note this may change in future versions of Xen/libxl. +## Xen < 4.4 ## + +The concept of vCPU soft affinity has been introduced for the first time +in Xen 4.4. In 4.3, it is the domain''s node-affinity that drives the +NUMA-aware scheduler. The main difference is soft affinity is per-vCPU, +and so each vCPU can have its own mask of pCPUs, while node-affinity is +per-domain, that is the equivalent of having all the vCPUs with the same +soft affinity. + ## Xen < 4.3 ## As NUMA aware scheduling is a new feature of Xen 4.3, things are a little bit different for earlier version of Xen. If no "cpus=" option is specified and Xen 4.2 is in use, the automatic placement algorithm still runs, but the results is used to _pin_ the vCPUs of the domain to the output node(s). -This is consistent with what was happening with xm/xend, which were also -affecting the domain''s CPU affinity. +This is consistent with what was happening with xm/xend. On a version of Xen earlier than 4.2, there is not automatic placement at -all in xl or libxl, and hence no node or CPU affinity being affected. +all in xl or libxl, and hence no node-affinity, vCPU affinity or pinning +being introduced/modified. ## Limitations ## diff --git a/xen/common/domain.c b/xen/common/domain.c index 4b8fca8..b599223 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -411,8 +411,6 @@ void domain_update_node_affinity(struct domain *d) node_set(node, d->node_affinity); } - sched_set_node_affinity(d, &d->node_affinity); - spin_unlock(&d->node_affinity_lock); free_cpumask_var(online_affinity); diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index 398b095..0790ebb 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -112,10 +112,24 @@ /* - * Node Balancing + * Hard and soft affinity load balancing. + * + * Idea is each vcpu has some pcpus that it prefers, some that it does not + * prefer but is OK with, and some that it cannot run on at all. The first + * set of pcpus are the ones that are both in the soft affinity *and* in the + * hard affinity; the second set of pcpus are the ones that are in the hard + * affinity but *not* in the soft affinity; the third set of pcpus are the + * ones that are not in the hard affinity. + * + * We implement a two step balancing logic. Basically, every time there is + * the need to decide where to run a vcpu, we first check the soft affinity + * (well, actually, the && between soft and hard affinity), to see if we can + * send it where it prefers to (and can) run on. However, if the first step + * does not find any suitable and free pcpu, we fall back checking the hard + * affinity. */ -#define CSCHED_BALANCE_NODE_AFFINITY 0 -#define CSCHED_BALANCE_CPU_AFFINITY 1 +#define CSCHED_BALANCE_SOFT_AFFINITY 0 +#define CSCHED_BALANCE_HARD_AFFINITY 1 /* * Boot parameters @@ -138,7 +152,7 @@ struct csched_pcpu { /* * Convenience macro for accessing the per-PCPU cpumask we need for - * implementing the two steps (vcpu and node affinity) balancing logic. + * implementing the two steps (soft and hard affinity) balancing logic. * It is stored in csched_pcpu so that serialization is not an issue, * as there is a csched_pcpu for each PCPU and we always hold the * runqueue spin-lock when using this. @@ -178,9 +192,6 @@ struct csched_dom { struct list_head active_vcpu; struct list_head active_sdom_elem; struct domain *dom; - /* cpumask translated from the domain''s node-affinity. - * Basically, the CPUs we prefer to be scheduled on. */ - cpumask_var_t node_affinity_cpumask; uint16_t active_vcpu_count; uint16_t weight; uint16_t cap; @@ -261,59 +272,28 @@ __runq_remove(struct csched_vcpu *svc) list_del_init(&svc->runq_elem); } -/* - * Translates node-affinity mask into a cpumask, so that we can use it during - * actual scheduling. That of course will contain all the cpus from all the - * set nodes in the original node-affinity mask. - * - * Note that any serialization needed to access mask safely is complete - * responsibility of the caller of this function/hook. - */ -static void csched_set_node_affinity( - const struct scheduler *ops, - struct domain *d, - nodemask_t *mask) -{ - struct csched_dom *sdom; - int node; - - /* Skip idle domain since it doesn''t even have a node_affinity_cpumask */ - if ( unlikely(is_idle_domain(d)) ) - return; - - sdom = CSCHED_DOM(d); - cpumask_clear(sdom->node_affinity_cpumask); - for_each_node_mask( node, *mask ) - cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, - &node_to_cpumask(node)); -} #define for_each_csched_balance_step(step) \ - for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ ) + for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY; (step)++ ) /* - * vcpu-affinity balancing is always necessary and must never be skipped. - * OTOH, if a domain''s node-affinity is said to be automatically computed - * (or if it just spans all the nodes), we can safely avoid dealing with - * node-affinity entirely. + * Hard affinity balancing is always necessary and must never be skipped. + * OTOH, if the vcpu''s soft affinity is full (it spans all the possible + * pcpus) we can safely avoid dealing with it entirely. * - * Node-affinity is also deemed meaningless in case it has empty - * intersection with mask, to cover the cases where using the node-affinity + * A vcpu''s soft affinity is also deemed meaningless in case it has empty + * intersection with mask, to cover the cases where using the soft affinity * mask seems legit, but would instead led to trying to schedule the vcpu * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu''s - * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus + * hard affinity, or to the && of hard affinity and the set of online cpus * in the domain''s cpupool. */ -static inline int __vcpu_has_node_affinity(const struct vcpu *vc, +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc, const cpumask_t *mask) { - const struct domain *d = vc->domain; - const struct csched_dom *sdom = CSCHED_DOM(d); - - if ( d->auto_node_affinity - || cpumask_full(sdom->node_affinity_cpumask) - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) + if ( cpumask_full(vc->cpu_soft_affinity) + || !cpumask_intersects(vc->cpu_soft_affinity, mask) ) return 0; return 1; @@ -321,23 +301,22 @@ static inline int __vcpu_has_node_affinity(const struct vcpu *vc, /* * Each csched-balance step uses its own cpumask. This function determines - * which one (given the step) and copies it in mask. For the node-affinity - * balancing step, the pcpus that are not part of vc''s vcpu-affinity are + * which one (given the step) and copies it in mask. For the soft affinity + * balancing step, the pcpus that are not part of vc''s hard affinity are * filtered out from the result, to avoid running a vcpu where it would * like, but is not allowed to! */ static void csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) { - if ( step == CSCHED_BALANCE_NODE_AFFINITY ) + if ( step == CSCHED_BALANCE_SOFT_AFFINITY ) { - cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask, - vc->cpu_hard_affinity); + cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity); if ( unlikely(cpumask_empty(mask)) ) cpumask_copy(mask, vc->cpu_hard_affinity); } - else /* step == CSCHED_BALANCE_CPU_AFFINITY */ + else /* step == CSCHED_BALANCE_HARD_AFFINITY */ cpumask_copy(mask, vc->cpu_hard_affinity); } @@ -398,15 +377,15 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) else if ( !idlers_empty ) { /* - * Node and vcpu-affinity balancing loop. For vcpus without - * a useful node-affinity, consider vcpu-affinity only. + * Soft and hard affinity balancing loop. For vcpus without + * a useful soft affinity, consider hard affinity only. */ for_each_csched_balance_step( balance_step ) { int new_idlers_empty; - if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(new->vcpu, + if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(new->vcpu, new->vcpu->cpu_hard_affinity) ) continue; @@ -418,11 +397,11 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) /* * Let''s not be too harsh! If there aren''t idlers suitable - * for new in its node-affinity mask, make sure we check its - * vcpu-affinity as well, before taking final decisions. + * for new in its soft affinity mask, make sure we check its + * hard affinity as well, before taking final decisions. */ if ( new_idlers_empty - && balance_step == CSCHED_BALANCE_NODE_AFFINITY ) + && balance_step == CSCHED_BALANCE_SOFT_AFFINITY ) continue; /* @@ -649,23 +628,23 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) /* * We want to pick up a pcpu among the ones that are online and * can accommodate vc, which is basically what we computed above - * and stored in cpus. As far as vcpu-affinity is concerned, + * and stored in cpus. As far as hard affinity is concerned, * there always will be at least one of these pcpus, hence cpus * is never empty and the calls to cpumask_cycle() and * cpumask_test_cpu() below are ok. * - * On the other hand, when considering node-affinity too, it + * On the other hand, when considering soft affinity too, it * is possible for the mask to become empty (for instance, if the * domain has been put in a cpupool that does not contain any of the - * nodes in its node-affinity), which would result in the ASSERT()-s + * pcpus in its soft affinity), which would result in the ASSERT()-s * inside cpumask_*() operations triggering (in debug builds). * - * Therefore, in this case, we filter the node-affinity mask against - * cpus and, if the result is empty, we just skip the node-affinity + * Therefore, in this case, we filter the soft affinity mask against + * cpus and, if the result is empty, we just skip the soft affinity * balancing step all together. */ - if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc, &cpus) ) + if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(vc, &cpus) ) continue; /* Pick an online CPU from the proper affinity mask */ @@ -1111,13 +1090,6 @@ csched_alloc_domdata(const struct scheduler *ops, struct domain *dom) if ( sdom == NULL ) return NULL; - if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) ) - { - xfree(sdom); - return NULL; - } - cpumask_setall(sdom->node_affinity_cpumask); - /* Initialize credit and weight */ INIT_LIST_HEAD(&sdom->active_vcpu); INIT_LIST_HEAD(&sdom->active_sdom_elem); @@ -1147,9 +1119,6 @@ csched_dom_init(const struct scheduler *ops, struct domain *dom) static void csched_free_domdata(const struct scheduler *ops, void *data) { - struct csched_dom *sdom = data; - - free_cpumask_var(sdom->node_affinity_cpumask); xfree(data); } @@ -1475,19 +1444,19 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) BUG_ON( is_idle_vcpu(vc) ); /* - * If the vcpu has no useful node-affinity, skip this vcpu. - * In fact, what we want is to check if we have any node-affine - * work to steal, before starting to look at vcpu-affine work. + * If the vcpu has no useful soft affinity, skip this vcpu. + * In fact, what we want is to check if we have any "soft-affine + * work" to steal, before starting to look at "hard-affine work". * * Notice that, if not even one vCPU on this runq has a useful - * node-affinity, we could have avoid considering this runq for - * a node balancing step in the first place. This, for instance, + * soft affinity, we could have avoid considering this runq for + * a soft balancing step in the first place. This, for instance, * can be implemented by taking note of on what runq there are - * vCPUs with useful node-affinities in some sort of bitmap + * vCPUs with useful soft affinities in some sort of bitmap * or counter. */ - if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc, vc->cpu_hard_affinity) ) + if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) ) continue; csched_balance_cpumask(vc, balance_step, csched_balance_mask); @@ -1535,17 +1504,17 @@ csched_load_balance(struct csched_private *prv, int cpu, SCHED_STAT_CRANK(load_balance_other); /* - * Let''s look around for work to steal, taking both vcpu-affinity - * and node-affinity into account. More specifically, we check all + * Let''s look around for work to steal, taking both hard affinity + * and soft affinity into account. More specifically, we check all * the non-idle CPUs'' runq, looking for: - * 1. any node-affine work to steal first, - * 2. if not finding anything, any vcpu-affine work to steal. + * 1. any "soft-affine work" to steal first, + * 2. if not finding anything, any "hard-affine work" to steal. */ for_each_csched_balance_step( bstep ) { /* * We peek at the non-idling CPUs in a node-wise fashion. In fact, - * it is more likely that we find some node-affine work on our same + * it is more likely that we find some affine work on our same * node, not to mention that migrating vcpus within the same node * could well expected to be cheaper than across-nodes (memory * stays local, there might be some node-wide cache[s], etc.). @@ -1976,8 +1945,6 @@ const struct scheduler sched_credit_def = { .adjust = csched_dom_cntl, .adjust_global = csched_sys_cntl, - .set_node_affinity = csched_set_node_affinity, - .pick_cpu = csched_cpu_pick, .do_schedule = csched_schedule,
Dario Faggioli
2013-Nov-13 19:12 UTC
[PATCH v2 11/16] libxc: get and set soft and hard affinity
by using the new flag introduced in the parameters of DOMCTL_{get,set}_vcpuaffinity. This happens in two new xc calls: xc_vcpu_setaffinity_hard() and xc_vcpu_setaffinity_soft() (an in the corresponding getters, of course). The existing xc_vcpu_{set,get}affinity() call is also retained, with the following behavior: * xc_vcpu_setaffinity() sets both the hard and soft affinity; * xc_vcpu_getaffinity() gets the hard affinity. This is mainly for backward compatibility reasons, i.e., trying not to break existing callers/users. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxc/xc_domain.c | 153 ++++++++++++++++++++++++++++++++++++++++++----- tools/libxc/xenctrl.h | 53 ++++++++++++++++ 2 files changed, 190 insertions(+), 16 deletions(-) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index f9ae4bf..30bfe7b 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -189,13 +189,16 @@ int xc_domain_node_getaffinity(xc_interface *xch, return ret; } -int xc_vcpu_setaffinity(xc_interface *xch, - uint32_t domid, - int vcpu, - xc_cpumap_t cpumap) +static int _vcpu_setaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap, + uint32_t flags, + xc_cpumap_t ecpumap) { DECLARE_DOMCTL; - DECLARE_HYPERCALL_BUFFER(uint8_t, local); + DECLARE_HYPERCALL_BUFFER(uint8_t, cpumap_local); + DECLARE_HYPERCALL_BUFFER(uint8_t, ecpumap_local); int ret = -1; int cpusize; @@ -206,39 +209,119 @@ int xc_vcpu_setaffinity(xc_interface *xch, goto out; } - local = xc_hypercall_buffer_alloc(xch, local, cpusize); - if ( local == NULL ) + cpumap_local = xc_hypercall_buffer_alloc(xch, cpumap_local, cpusize); + if ( cpumap_local == NULL ) + { + PERROR("Could not allocate cpumap_local for DOMCTL_setvcpuaffinity"); + goto out; + } + ecpumap_local = xc_hypercall_buffer_alloc(xch, ecpumap_local, cpusize); + if ( cpumap_local == NULL ) { - PERROR("Could not allocate memory for setvcpuaffinity domctl hypercall"); + xc_hypercall_buffer_free(xch, cpumap_local); + PERROR("Could not allocate ecpumap_local for DOMCTL_setvcpuaffinity"); goto out; } domctl.cmd = XEN_DOMCTL_setvcpuaffinity; domctl.domain = (domid_t)domid; domctl.u.vcpuaffinity.vcpu = vcpu; - /* Soft affinity is there, but not used anywhere for now, so... */ - domctl.u.vcpuaffinity.flags = XEN_VCPUAFFINITY_HARD; - - memcpy(local, cpumap, cpusize); - - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); + domctl.u.vcpuaffinity.flags = flags; + memcpy(cpumap_local, cpumap, cpusize); + set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, cpumap_local); domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; + set_xen_guest_handle(domctl.u.vcpuaffinity.effective_affinity.bitmap, + ecpumap_local); + domctl.u.vcpuaffinity.effective_affinity.nr_bits = cpusize * 8; + ret = do_domctl(xch, &domctl); - xc_hypercall_buffer_free(xch, local); + memcpy(ecpumap, ecpumap_local, cpusize); + + xc_hypercall_buffer_free(xch, cpumap_local); + xc_hypercall_buffer_free(xch, ecpumap_local); out: return ret; } +int xc_vcpu_setaffinity_soft(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap, + xc_cpumap_t ecpumap) +{ + return _vcpu_setaffinity(xch, + domid, + vcpu, + cpumap, + XEN_VCPUAFFINITY_SOFT, + ecpumap); +} -int xc_vcpu_getaffinity(xc_interface *xch, +int xc_vcpu_setaffinity_hard(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap, + xc_cpumap_t ecpumap) +{ + return _vcpu_setaffinity(xch, + domid, + vcpu, + cpumap, + XEN_VCPUAFFINITY_HARD, + ecpumap); +} + +/* Provided for backword compattibility: sets both hard and soft affinity */ +int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu, xc_cpumap_t cpumap) { + xc_cpumap_t ecpumap; + int ret = -1; + + ecpumap = xc_cpumap_alloc(xch); + if (ecpumap == NULL) + { + PERROR("Could not allocate memory for DOMCTL_setvcpuaffinity"); + return -1; + } + + ret = _vcpu_setaffinity(xch, + domid, + vcpu, + cpumap, + XEN_VCPUAFFINITY_SOFT, + ecpumap); + + if ( ret ) + { + PERROR("Could not set soft affinity via DOMCTL_setvcpuaffinity"); + goto out; + } + + ret = _vcpu_setaffinity(xch, + domid, + vcpu, + cpumap, + XEN_VCPUAFFINITY_HARD, + ecpumap); + out: + free(ecpumap); + return ret; +} + + +static int _vcpu_getaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap, + uint32_t flags) +{ DECLARE_DOMCTL; DECLARE_HYPERCALL_BUFFER(uint8_t, local); int ret = -1; @@ -261,6 +344,7 @@ int xc_vcpu_getaffinity(xc_interface *xch, domctl.cmd = XEN_DOMCTL_getvcpuaffinity; domctl.domain = (domid_t)domid; domctl.u.vcpuaffinity.vcpu = vcpu; + domctl.u.vcpuaffinity.flags = flags; set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; @@ -274,6 +358,43 @@ out: return ret; } +int xc_vcpu_getaffinity_soft(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap) +{ + return _vcpu_getaffinity(xch, + domid, + vcpu, + cpumap, + XEN_VCPUAFFINITY_SOFT); +} + +int xc_vcpu_getaffinity_hard(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap) +{ + return _vcpu_getaffinity(xch, + domid, + vcpu, + cpumap, + XEN_VCPUAFFINITY_HARD); +} + +/* Provided for backward compatibility and wired to hard affinity */ +int xc_vcpu_getaffinity(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap) +{ + return _vcpu_getaffinity(xch, + domid, + vcpu, + cpumap, + XEN_VCPUAFFINITY_HARD); +} + int xc_domain_get_guest_width(xc_interface *xch, uint32_t domid, unsigned int *guest_width) { diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 4ac6b8a..ec80603 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -579,10 +579,63 @@ int xc_domain_node_getaffinity(xc_interface *xch, uint32_t domind, xc_nodemap_t nodemap); +/** + * This functions specify the scheduling affinity for a vcpu. Soft + * affinity is on what pcpus a vcpu prefers to run. Hard affinity is + * on what pcpus a vcpu is allowed to run. When set independently (by + * the respective _soft and _hard calls) the effective affinity is + * also returned. What we call the effective affinity it the intersection + * of soft affinity, hard affinity and the set of the cpus of the cpupool + * the domain belongs to. It''s basically what the Xen scheduler will + * actually use. Returning it to the caller allows him to check if that + * matches with, or at least is good enough for, his purposes. + * + * A xc_vcpu_setaffinity() call is provided, mainly for backward + * compatibility reasons, and what it does is setting both hard and + * soft affinity for the vcpu. + * + * @param xch a handle to an open hypervisor interface. + * @param domid the id of the domain to which the vcpu belongs + * @param vcpu the vcpu id wihin the domain + * @param cpumap the (hard, soft, both) new affinity map one wants to set + * @param ecpumap the effective affinity for the vcpu + */ +int xc_vcpu_setaffinity_soft(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap, + xc_cpumap_t ecpumap); +int xc_vcpu_setaffinity_hard(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap, + xc_cpumap_t ecpumap); int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu, xc_cpumap_t cpumap); + +/** + * This functions retrieve the hard or soft scheduling affinity for + * a vcpu. + * + * A xc_vcpu_getaffinity() call is provided, mainly for backward + * compatibility reasons, and what it does is returning the hard + * affinity, exactly as xc_vcpu_getaffinity_hard(). + * + * @param xch a handle to an open hypervisor interface. + * @param domid the id of the domain to which the vcpu belongs + * @param vcpu the vcpu id wihin the domain + * @param cpumap is where the (hard, soft) affinity is returned + */ +int xc_vcpu_getaffinity_soft(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap); +int xc_vcpu_getaffinity_hard(xc_interface *xch, + uint32_t domid, + int vcpu, + xc_cpumap_t cpumap); int xc_vcpu_getaffinity(xc_interface *xch, uint32_t domid, int vcpu,
which basically means making space for a new cpumap in both vcpu_info (for getting soft affinity) and build_info (for setting it), along with providing the get/set functions, and wiring them to the proper xc calls. Interface is as follows: * libxl_{get,set}_vcpuaffinity() deals with hard affinity, as it always has happened; * libxl_get,set}_vcpuaffinity_soft() deals with soft affinity. *_set_* functions include some logic for checking whether the affinity that would indeed be used matches the one requested by the caller, and printing some warnings if that is not the case. That is because, despite what the user asks, for instance, for soft affinity, the scheduler only considers running a vCPU where its hard affinity and cpupool mandate. So, although we want to allow any possible combinations (e.g., we do not want to error out if hard affinity and soft afinity are disjoint), we at very least print some warnings, hoping to help the sysadmin to figure out what is really going on. Also, as this apparently is the first change being checked in that breaks libxl ABI, bump MAJOR to 4.4. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- tools/libxl/Makefile | 1 tools/libxl/libxl.c | 206 ++++++++++++++++++++++++++++++++++++++++--- tools/libxl/libxl.h | 23 +++++ tools/libxl/libxl_create.c | 6 + tools/libxl/libxl_types.idl | 4 + tools/libxl/libxl_utils.h | 13 +++ 6 files changed, 239 insertions(+), 14 deletions(-) diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index cf214bb..b7f39bd 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -5,6 +5,7 @@ XEN_ROOT = $(CURDIR)/../.. include $(XEN_ROOT)/tools/Rules.mk +#MAJOR = 4.4 MAJOR = 4.3 MINOR = 0 diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 0de1112..d32414d 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4208,12 +4208,24 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap"); return NULL; } + if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap_soft, 0)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap_soft"); + return NULL; + } if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu info"); return NULL; } - if (xc_vcpu_getaffinity(ctx->xch, domid, *nb_vcpu, ptr->cpumap.map) == -1) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu affinity"); + if (xc_vcpu_getaffinity(ctx->xch, domid, *nb_vcpu, + ptr->cpumap.map) == -1) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "getting vcpu hard affinity"); + return NULL; + } + if (xc_vcpu_getaffinity_soft(ctx->xch, domid, *nb_vcpu, + ptr->cpumap_soft.map) == -1) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "getting vcpu soft affinity"); return NULL; } ptr->vcpuid = *nb_vcpu; @@ -4226,14 +4238,160 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, return ret; } +static int libxl__set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, const libxl_bitmap *cpumap, + uint32_t flags, libxl_bitmap *ecpumap) +{ + if (flags & XEN_VCPUAFFINITY_HARD) { + if (xc_vcpu_setaffinity_hard(ctx->xch, domid, vcpuid, + cpumap->map, ecpumap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "failed to set hard affinity for %d", + vcpuid); + return ERROR_FAIL; + } + } else if (flags & XEN_VCPUAFFINITY_SOFT) { + if (xc_vcpu_setaffinity_soft(ctx->xch, domid, vcpuid, + cpumap->map, ecpumap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "failed to set soft affinity for %d", + vcpuid); + return ERROR_FAIL; + } + } else + return ERROR_INVAL; + + return 0; +} + +static int libxl__get_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap, + uint32_t flags) +{ + if (flags & XEN_VCPUAFFINITY_HARD) { + if (xc_vcpu_getaffinity_hard(ctx->xch, domid, vcpuid, cpumap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "failed to get hard affinity for %d", + vcpuid); + return ERROR_FAIL; + } + } else if (flags & XEN_VCPUAFFINITY_SOFT) { + if (xc_vcpu_getaffinity_soft(ctx->xch, domid, vcpuid, cpumap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, + "failed to get soft affinity for %d", + vcpuid); + return ERROR_FAIL; + } + } else + return ERROR_INVAL; + + return 0; +} + int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, libxl_bitmap *cpumap) { - if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, cpumap->map)) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu affinity"); - return ERROR_FAIL; + libxl_bitmap ecpumap, scpumap; + int rc; + + libxl_bitmap_init(&ecpumap); + libxl_bitmap_init(&scpumap); + + rc = libxl_cpu_bitmap_alloc(ctx, &ecpumap, 0); + if (rc) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating ecpumap"); + goto out; } - return 0; + + /* + * Set the new hard affinity and check how it went. If we were + * setting it to "all", and no error occurred, there is no chance + * we are breaking the soft affinity, so we can just leave. + */ + rc = libxl__set_vcpuaffinity(ctx, domid, vcpuid, cpumap, + XEN_VCPUAFFINITY_HARD, &ecpumap); + if (rc || libxl_bitmap_is_full(cpumap)) + goto out; + + /* If not setting "all", let''s figure out what happened. */ + rc = libxl_cpu_bitmap_alloc(ctx, &scpumap, 0); + if (rc) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating scpumap"); + goto out; + } + /* Retrieve the soft affinity to check how it combines with the new hard */ + rc = libxl__get_vcpuaffinity(ctx, domid, vcpuid, &scpumap, + XEN_VCPUAFFINITY_SOFT); + if (rc) + goto out; + + /* + * If the new hard affinity breaks the current soft affinity or, even + * worse, if it makes the interesction of hard and soft affinity empty, + * inform the user about that. Just avoid bothering him in case soft + * affinity is "all", as that means something like "I don''t care much + * about it! anyway." + */ + if (!libxl_bitmap_is_full(&scpumap) && + !libxl_bitmap_equal(&scpumap, &ecpumap)) + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, + "Soft affinity for vcpu %d now contains unreachable cpus", + vcpuid); + if (libxl_bitmap_is_empty(&ecpumap)) + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, + "No reachable cpu in vcpu %d soft affinity. " + "Only hard affinity will be considered for scheduling", + vcpuid); + + out: + libxl_bitmap_dispose(&scpumap); + libxl_bitmap_dispose(&ecpumap); + return rc; +} + +int libxl_set_vcpuaffinity_soft(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap) +{ + libxl_bitmap ecpumap; + int rc; + + libxl_bitmap_init(&ecpumap); + + rc = libxl_cpu_bitmap_alloc(ctx, &ecpumap, 0); + if (rc) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating ecpumap"); + goto out; + } + + /* + * If error, or if setting the soft affinity to "all", we can just + * leave without much other checking, as a full mask already means + * something like "I don''t care much about it!". + */ + rc = libxl__set_vcpuaffinity(ctx, domid, vcpuid, cpumap, + XEN_VCPUAFFINITY_SOFT, &ecpumap); + + if (rc || libxl_bitmap_is_full(cpumap)) + goto out; + + /* + * Check if the soft affinity we just set is something that can actually + * be used by the scheduler or, because of interactions with hard affinity + * and cpupools, that won''t be entirely possible. + */ + if (!libxl_bitmap_equal(cpumap, &ecpumap)) + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, + "Soft affinity for vcpu %d contains unreachable cpus", + vcpuid); + if (libxl_bitmap_is_empty(&ecpumap)) + LIBXL__LOG(ctx, LIBXL__LOG_WARNING, + "No reachable cpu in vcpu %d soft affinity. " + "Only hard affinity will be considered for scheduling", + vcpuid); + + out: + libxl_bitmap_dispose(&ecpumap); + return rc; } int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, @@ -4241,16 +4399,38 @@ int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, { int i, rc = 0; - for (i = 0; i < max_vcpus; i++) { - if (libxl_set_vcpuaffinity(ctx, domid, i, cpumap)) { - LIBXL__LOG(ctx, LIBXL__LOG_WARNING, - "failed to set affinity for %d", i); - rc = ERROR_FAIL; - } - } + for (i = 0; i < max_vcpus; i++) + rc = libxl_set_vcpuaffinity(ctx, domid, i, cpumap); + return rc; } +int libxl_set_vcpuaffinity_all_soft(libxl_ctx *ctx, uint32_t domid, + unsigned int max_vcpus, + libxl_bitmap *cpumap) +{ + int i, rc = 0; + + for (i = 0; i < max_vcpus; i++) + rc = libxl_set_vcpuaffinity_soft(ctx, domid, i, cpumap); + + return rc; +} + +int libxl_get_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap) +{ + return libxl__get_vcpuaffinity(ctx, domid, vcpuid, cpumap, + XEN_VCPUAFFINITY_HARD); +} + +int libxl_get_vcpuaffinity_soft(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap) +{ + return libxl__get_vcpuaffinity(ctx, domid, vcpuid, cpumap, + XEN_VCPUAFFINITY_SOFT); +} + int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *nodemap) { diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index c7dceda..5020e0d 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -82,6 +82,20 @@ #define LIBXL_HAVE_DOMAIN_NODEAFFINITY 1 /* + * LIBXL_HAVE_VCPUINFO_SOFTAFFINITY indicates that a ''cpumap_soft'' + * field (of libxl_bitmap type) is present in libxl_vcpuinfo, + * containing the soft affinity for the vcpu. + */ +#define LIBXL_HAVE_VCPUINFO_SOFTAFFINITY 1 + +/* + * LIBXL_HAVE_BUILDINFO_SOFTAFFINITY indicates that a ''cpumap_soft'' + * field (of libxl_bitmap type) is present in libxl_domain_build_info, + * containing the soft affinity for the vcpu. + */ +#define LIBXL_HAVE_BUILDINFO_SOFTAFFINITY 1 + +/* * LIBXL_HAVE_BUILDINFO_HVM_VENDOR_DEVICE indicates that the * libxl_vendor_device field is present in the hvm sections of * libxl_domain_build_info. This field tells libxl which @@ -971,8 +985,17 @@ int libxl_userdata_retrieve(libxl_ctx *ctx, uint32_t domid, int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo); int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, libxl_bitmap *cpumap); +int libxl_set_vcpuaffinity_soft(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap); int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, unsigned int max_vcpus, libxl_bitmap *cpumap); +int libxl_set_vcpuaffinity_all_soft(libxl_ctx *ctx, uint32_t domid, + unsigned int max_vcpus, + libxl_bitmap *cpumap); +int libxl_get_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap); +int libxl_get_vcpuaffinity_soft(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap); int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *nodemap); int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index 5e9cdcc..c314bec 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -192,6 +192,12 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc, libxl_bitmap_set_any(&b_info->cpumap); } + if (!b_info->cpumap_soft.size) { + if (libxl_cpu_bitmap_alloc(CTX, &b_info->cpumap_soft, 0)) + return ERROR_FAIL; + libxl_bitmap_set_any(&b_info->cpumap_soft); + } + libxl_defbool_setdefault(&b_info->numa_placement, true); if (!b_info->nodemap.size) { diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index de5bac3..4001761 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -297,6 +297,7 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("max_vcpus", integer), ("avail_vcpus", libxl_bitmap), ("cpumap", libxl_bitmap), + ("cpumap_soft", libxl_bitmap), ("nodemap", libxl_bitmap), ("numa_placement", libxl_defbool), ("tsc_mode", libxl_tsc_mode), @@ -509,7 +510,8 @@ libxl_vcpuinfo = Struct("vcpuinfo", [ ("blocked", bool), ("running", bool), ("vcpu_time", uint64), # total vcpu time ran (ns) - ("cpumap", libxl_bitmap), # current cpu''s affinities + ("cpumap", libxl_bitmap), # current hard cpu affinity + ("cpumap_soft", libxl_bitmap), # current soft cpu affinity ], dir=DIR_OUT) libxl_physinfo = Struct("physinfo", [ diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h index 7b84e6a..fef83ca 100644 --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -98,6 +98,19 @@ static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit) #define libxl_for_each_set_bit(v, m) for (v = 0; v < (m).size * 8; v++) \ if (libxl_bitmap_test(&(m), v)) +static inline int libxl_bitmap_equal(const libxl_bitmap *ba, + const libxl_bitmap *bb) +{ + int i; + + libxl_for_each_bit(i, *ba) { + if (libxl_bitmap_test(ba, i) != libxl_bitmap_test(bb, i)) + return 0; + } + + return 1; +} + static inline int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap, int max_cpus) {
Dario Faggioli
2013-Nov-13 19:12 UTC
[PATCH v2 13/16] xl: show soft affinity in `xl vcpu-list''
if the ''-s''/''--soft'' option is provided. An example of such output is this: # xl vcpu-list -s Name ID VCPU CPU State Time(s) Hard Affinity / Soft Affinity Domain-0 0 0 11 -b- 5.4 8-15 / all Domain-0 0 1 11 -b- 1.0 8-15 / all Domain-0 0 14 13 -b- 1.4 8-15 / all Domain-0 0 15 8 -b- 1.6 8-15 / all vm-test 3 0 4 -b- 2.5 0-12 / 0-7 vm-test 3 1 0 -b- 3.2 0-12 / 0-7 xl manual page is updated accordingly. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.pod.1 | 10 ++++++ tools/libxl/xl_cmdimpl.c | 73 ++++++++++++++++++++++++++++----------------- tools/libxl/xl_cmdtable.c | 3 +- 3 files changed, 57 insertions(+), 29 deletions(-) diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1 index e7b9de2..c41d98f 100644 --- a/docs/man/xl.pod.1 +++ b/docs/man/xl.pod.1 @@ -619,6 +619,16 @@ after B<vcpu-set>, go to B<SEE ALSO> section for information. Lists VCPU information for a specific domain. If no domain is specified, VCPU information for all domains will be provided. +B<OPTIONS> + +=over 4 + +=item B<-s>, B<--soft> + +Print the CPU soft affinity of each VCPU too. + +=back + =item B<vcpu-pin> I<domain-id> I<vcpu> I<cpus> Pins the VCPU to only run on the specific CPUs. The keyword diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index cf237c4..543d19d 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4477,7 +4477,8 @@ int main_button_press(int argc, char **argv) static void print_vcpuinfo(uint32_t tdomid, const libxl_vcpuinfo *vcpuinfo, - uint32_t nr_cpus) + uint32_t nr_cpus, + int soft_affinity) { char *domname; @@ -4497,12 +4498,18 @@ static void print_vcpuinfo(uint32_t tdomid, } /* TIM */ printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); - /* CPU AFFINITY */ + /* CPU HARD AND SOFT AFFINITY */ print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout); + if (soft_affinity) { + printf(" / "); + print_bitmap(vcpuinfo->cpumap_soft.map, nr_cpus, stdout); + } printf("\n"); } -static void print_domain_vcpuinfo(uint32_t domid, uint32_t nr_cpus) +static void print_domain_vcpuinfo(uint32_t domid, + uint32_t nr_cpus, + int soft_affinity) { libxl_vcpuinfo *vcpuinfo; int i, nb_vcpu, nrcpus; @@ -4515,55 +4522,65 @@ static void print_domain_vcpuinfo(uint32_t domid, uint32_t nr_cpus) } for (i = 0; i < nb_vcpu; i++) { - print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus); + print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus, soft_affinity); } libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } -static void vcpulist(int argc, char **argv) +int main_vcpulist(int argc, char **argv) { libxl_dominfo *dominfo; libxl_physinfo physinfo; - int i, nb_domain; + int opt, soft_affinity = 0, rc = -1; + static struct option opts[] = { + {"soft", 0, 0, ''s''}, + COMMON_LONG_OPTS, + {0, 0, 0, 0} + }; + + SWITCH_FOREACH_OPT(opt, "s", opts, "vcpu-list", 0) { + case ''s'': + soft_affinity = 1; + break; + } if (libxl_get_physinfo(ctx, &physinfo) != 0) { fprintf(stderr, "libxl_physinfo failed.\n"); - goto vcpulist_out; + goto out; } - printf("%-32s %5s %5s %5s %5s %9s %s\n", - "Name", "ID", "VCPU", "CPU", "State", "Time(s)", "CPU Affinity"); - if (!argc) { + printf("%-32s %5s %5s %5s %5s %9s %s", + "Name", "ID", "VCPU", "CPU", "State", "Time(s)", "Hard Affinity"); + if (soft_affinity) + printf(" / Soft Affinity"); + printf("\n"); + + if (optind >= argc) { + int i, nb_domain; + if (!(dominfo = libxl_list_domain(ctx, &nb_domain))) { fprintf(stderr, "libxl_list_domain failed.\n"); - goto vcpulist_out; + goto out; } - for (i = 0; i<nb_domain; i++) - print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus); + for (i = 0; i < nb_domain; i++) + print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus, + soft_affinity); libxl_dominfo_list_free(dominfo, nb_domain); } else { - for (; argc > 0; ++argv, --argc) { - uint32_t domid = find_domain(*argv); - print_domain_vcpuinfo(domid, physinfo.nr_cpus); + for (; argc > optind; ++optind) { + uint32_t domid = find_domain(argv[optind]); + print_domain_vcpuinfo(domid, physinfo.nr_cpus, soft_affinity); } } - vcpulist_out: - libxl_physinfo_dispose(&physinfo); -} - -int main_vcpulist(int argc, char **argv) -{ - int opt; + rc = 0; - SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-list", 0) { - /* No options */ - } + out: + libxl_physinfo_dispose(&physinfo); - vcpulist(argc - optind, argv + optind); - return 0; + return rc; } static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index d3dcbf0..4f651e2 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -208,7 +208,8 @@ struct cmd_spec cmd_table[] = { { "vcpu-list", &main_vcpulist, 0, 0, "List the VCPUs for all/some domains", - "[Domain, ...]", + "[option] [Domain, ...]", + "-s, --soft Show CPU soft affinity", }, { "vcpu-pin", &main_vcpupin, 1, 1,
Also by adding a ''-s''/''--soft'' switch to `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.pod.1 | 24 ++++++++++++++++- tools/libxl/xl_cmdimpl.c | 65 +++++++++++++++++++++++++++------------------ tools/libxl/xl_cmdtable.c | 3 +- 3 files changed, 64 insertions(+), 28 deletions(-) diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1 index c41d98f..e16c331 100644 --- a/docs/man/xl.pod.1 +++ b/docs/man/xl.pod.1 @@ -629,7 +629,7 @@ Print the CPU soft affinity of each VCPU too. =back -=item B<vcpu-pin> I<domain-id> I<vcpu> I<cpus> +=item B<vcpu-pin> [I<OPTIONS>] I<domain-id> I<vcpu> I<cpus> Pins the VCPU to only run on the specific CPUs. The keyword B<all> can be used to apply the I<cpus> list to all VCPUs in the @@ -640,6 +640,28 @@ different run state is appropriate. Pinning can be used to restrict this, by ensuring certain VCPUs can only run on certain physical CPUs. +B<OPTIONS> + +=over 4 + +=item B<-s>, B<--soft> + +The same as above, but affect I<soft affinity> rather than pinning +(also called I<hard affinity>). + +Normally, VCPUs just wander among the CPUs where it is allowed to +run (either all the CPUs or the ones to which it is pinned, as said +for B<vcpu-list>). Soft affinity offer a mean to specify one or more +I<preferred> CPUs. Basically, among the ones where it can run, the +VCPU the VCPU will greately prefer to execute on one of these CPUs, +whenever that is possible. + +Notice that, in order for soft affinity to actually work, it needs +special support in the scheduler. Right now, only credit1 provides +that. + +=back + =item B<vm-list> Prints information about guests. This list excludes information about diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 543d19d..5a66d63 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4583,17 +4583,33 @@ int main_vcpulist(int argc, char **argv) return rc; } -static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) +int main_vcpupin(int argc, char **argv) { libxl_vcpuinfo *vcpuinfo; libxl_bitmap cpumap; - - uint32_t vcpuid; - char *endptr; + uint32_t vcpuid, domid; + const char *vcpu; + char *endptr, *cpu; int i = 0, nb_vcpu, rc = -1; + int opt, soft_affinity = 0; + static struct option opts[] = { + {"soft", 0, 0, ''s''}, + COMMON_LONG_OPTS, + {0, 0, 0, 0} + }; libxl_bitmap_init(&cpumap); + SWITCH_FOREACH_OPT(opt, "s", opts, "vcpu-pin", 3) { + case ''s'': + soft_affinity = 1; + break; + } + + domid = find_domain(argv[optind]); + vcpu = argv[optind+1]; + cpu = argv[optind+2]; + vcpuid = strtoul(vcpu, &endptr, 10); if (vcpu == endptr) { if (strcmp(vcpu, "all")) { @@ -4632,22 +4648,30 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) } if (vcpuid != -1) { - if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap) == -1) { + int ret; + + if (soft_affinity) + ret = libxl_set_vcpuaffinity_soft(ctx, domid, vcpuid, &cpumap); + else + ret = libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap); + if (ret) fprintf(stderr, "Could not set affinity for vcpu `%u''.\n", vcpuid); - } - } - else { + } else { + int ret; + if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) { fprintf(stderr, "libxl_list_vcpu failed.\n"); goto out; } - for (i = 0; i < nb_vcpu; i++) { - if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid, - &cpumap) == -1) { - fprintf(stderr, "libxl_set_vcpuaffinity failed" - " on vcpu `%u''.\n", vcpuinfo[i].vcpuid); - } - } + + if (soft_affinity) + ret = libxl_set_vcpuaffinity_all_soft(ctx, domid, nb_vcpu, + &cpumap); + else + ret = libxl_set_vcpuaffinity_all(ctx, domid, nb_vcpu, &cpumap); + if (ret) + fprintf(stderr, "libxl_set_vcpuaffinity failed on vcpu `%u''.\n", + vcpuinfo[i].vcpuid); libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } @@ -4657,17 +4681,6 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) return rc; } -int main_vcpupin(int argc, char **argv) -{ - int opt; - - SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-pin", 3) { - /* No options */ - } - - return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); -} - static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) { char *endptr; diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index 4f651e2..c6c7590 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -214,7 +214,8 @@ struct cmd_spec cmd_table[] = { { "vcpu-pin", &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", - "<Domain> <VCPU|all> <CPUs|all>", + "[option] <Domain> <VCPU|all> <CPUs|all>", + "-s, --soft Deal with soft affinity", }, { "vcpu-set", &main_vcpuset, 0, 1,
Dario Faggioli
2013-Nov-13 19:13 UTC
[PATCH v2 15/16] xl: enable for specifying node-affinity in the config file
in a similar way to how it is possible to specify vcpu-affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.cfg.pod.5 | 27 ++++++++++++++-- tools/libxl/xl_cmdimpl.c | 78 +++++++++++++++++++++++++++++++++++++++++++++- 2 files changed, 100 insertions(+), 5 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 5dbc73c..733c74e 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -144,19 +144,40 @@ run on cpu #3 of the host. =back If this option is not specified, no vcpu to cpu pinning is established, -and the vcpus of the guest can run on all the cpus of the host. +and the vcpus of the guest can run on all the cpus of the host. If this +option is specified, the intersection of the vcpu pinning mask, provided +here, and the soft affinity mask, provided via B<cpus\_soft=> (if any), +is utilized to compute the domain node-affinity, for driving memory +allocations. If we are on a NUMA machine (i.e., if the host has more than one NUMA node) and this option is not specified, libxl automatically tries to place the guest on the least possible number of nodes. That, however, will not affect vcpu pinning, so the guest will still be able to run on -all the cpus, it will just prefer the ones from the node it has been -placed on. A heuristic approach is used for choosing the best node (or +all the cpus. A heuristic approach is used for choosing the best node (or set of nodes), with the goals of maximizing performance for the guest and, at the same time, achieving efficient utilization of host cpus and memory. See F<docs/misc/xl-numa-placement.markdown> for more details. +=item B<cpus_soft="CPU-LIST"> + +Exactly as B<cpus=>, but specifies soft affinity, rather than pinning +(also called hard affinity). Starting from Xen 4.4, and if the credit +scheduler is used, this means the vcpus of the domain prefers to run +these pcpus. Default is either all pcpus or xl (via libxl) guesses +(depending on what other options are present). + +A C<CPU-LIST> is specified exactly as above, for B<cpus=>. + +If this option is not specified, the vcpus of the guest will not have +any preference regarding on what cpu to run, and the scheduler will +treat all the cpus where a vcpu can execute (if B<cpus=> is specified), +or all the host cpus (if not), the same. If this option is specified, +the intersection of the soft affinity mask, provided here, and the vcpu +pinning, provided via B<cpus=> (if any), is utilized to compute the +domain node-affinity, for driving memory allocations. + =back =head3 CPU Scheduling diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 5a66d63..b773679 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -76,8 +76,9 @@ xlchild children[child_max]; static const char *common_domname; static int fd_lock = -1; -/* Stash for specific vcpu to pcpu mappping */ +/* Stash for specific vcpu to pcpu hard and soft mappping */ static int *vcpu_to_pcpu; +static int *vcpu_to_pcpu_soft; static const char savefileheader_magic[32] "Xen saved domain, xl format\n \0 \r"; @@ -647,7 +648,8 @@ static void parse_config_data(const char *config_source, const char *buf; long l; XLU_Config *config; - XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms; + XLU_ConfigList *cpus, *cpus_soft, *vbds, *nics, *pcis; + XLU_ConfigList *cvfbs, *cpuids, *vtpms; XLU_ConfigList *ioports, *irqs, *iomem; int num_ioports, num_irqs, num_iomem; int pci_power_mgmt = 0; @@ -824,6 +826,50 @@ static void parse_config_data(const char *config_source, libxl_defbool_set(&b_info->numa_placement, false); } + if (!xlu_cfg_get_list (config, "cpus_soft", &cpus_soft, 0, 1)) { + int n_cpus = 0; + + if (libxl_node_bitmap_alloc(ctx, &b_info->cpumap_soft, 0)) { + fprintf(stderr, "Unable to allocate cpumap_soft\n"); + exit(1); + } + + /* As above, use a temporary storage for the single affinities */ + vcpu_to_pcpu_soft = xmalloc(sizeof(int) * b_info->max_vcpus); + memset(vcpu_to_pcpu_soft, -1, sizeof(int) * b_info->max_vcpus); + + libxl_bitmap_set_none(&b_info->cpumap_soft); + while ((buf = xlu_cfg_get_listitem(cpus_soft, n_cpus)) != NULL) { + i = atoi(buf); + if (!libxl_bitmap_cpu_valid(&b_info->cpumap_soft, i)) { + fprintf(stderr, "cpu %d illegal\n", i); + exit(1); + } + libxl_bitmap_set(&b_info->cpumap_soft, i); + if (n_cpus < b_info->max_vcpus) + vcpu_to_pcpu_soft[n_cpus] = i; + n_cpus++; + } + + /* We have a soft affinity map, disable automatic placement */ + libxl_defbool_set(&b_info->numa_placement, false); + } + else if (!xlu_cfg_get_string (config, "cpus_soft", &buf, 0)) { + char *buf2 = strdup(buf); + + if (libxl_node_bitmap_alloc(ctx, &b_info->cpumap_soft, 0)) { + fprintf(stderr, "Unable to allocate cpumap_soft\n"); + exit(1); + } + + libxl_bitmap_set_none(&b_info->cpumap_soft); + if (vcpupin_parse(buf2, &b_info->cpumap_soft)) + exit(1); + free(buf2); + + libxl_defbool_set(&b_info->numa_placement, false); + } + if (!xlu_cfg_get_long (config, "memory", &l, 0)) { b_info->max_memkb = l * 1024; b_info->target_memkb = b_info->max_memkb; @@ -2183,6 +2229,34 @@ start: free(vcpu_to_pcpu); vcpu_to_pcpu = NULL; } + /* And do the same for single vcpu to soft-affinity mapping */ + if (vcpu_to_pcpu_soft) { + libxl_bitmap soft_cpumap; + + ret = libxl_cpu_bitmap_alloc(ctx, &soft_cpumap, 0); + if (ret) + goto error_out; + for (i = 0; i < d_config.b_info.max_vcpus; i++) { + + if (vcpu_to_pcpu_soft[i] != -1) { + libxl_bitmap_set_none(&soft_cpumap); + libxl_bitmap_set(&soft_cpumap, vcpu_to_pcpu_soft[i]); + } else { + libxl_bitmap_set_any(&soft_cpumap); + } + if (libxl_set_vcpuaffinity_soft(ctx, domid, i, &soft_cpumap)) { + fprintf(stderr, "setting soft-affinity failed " + "on vcpu `%d''.\n", i); + libxl_bitmap_dispose(&soft_cpumap); + free(vcpu_to_pcpu_soft); + ret = ERROR_FAIL; + goto error_out; + } + } + libxl_bitmap_dispose(&soft_cpumap); + free(vcpu_to_pcpu_soft); vcpu_to_pcpu_soft = NULL; + } + ret = libxl_userdata_store(ctx, domid, "xl", config_data, config_len); if (ret) {
Dario Faggioli
2013-Nov-13 19:13 UTC
[PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity
vCPU soft affinity and NUMA-aware scheduling does not have to be related. However, soft affinity is how NUMA-aware scheduling is actually implemented, and therefore, by default, the results of automatic NUMA placement (at VM creation time) are also used to set the soft affinity of all the vCPUs of the domain. Of course, this only happens if automatic NUMA placement is enabled and actually takes place (for instance, if the user does not specify any hard and soft affiniy in the xl config file). This also takes care of the vice-versa, i.e., don''t trigger automatic placement if the config file specifies either an hard (the check for which was already there) or a soft (the check for which is introduced by this commit) affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/man/xl.cfg.pod.5 | 21 +++++++++++---------- docs/misc/xl-numa-placement.markdown | 16 ++++++++++++++-- tools/libxl/libxl_dom.c | 22 ++++++++++++++++++++-- 3 files changed, 45 insertions(+), 14 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 733c74e..d4a0a6f 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -150,16 +150,6 @@ here, and the soft affinity mask, provided via B<cpus\_soft=> (if any), is utilized to compute the domain node-affinity, for driving memory allocations. -If we are on a NUMA machine (i.e., if the host has more than one NUMA -node) and this option is not specified, libxl automatically tries to -place the guest on the least possible number of nodes. That, however, -will not affect vcpu pinning, so the guest will still be able to run on -all the cpus. A heuristic approach is used for choosing the best node (or -set of nodes), with the goals of maximizing performance for the guest -and, at the same time, achieving efficient utilization of host cpus -and memory. See F<docs/misc/xl-numa-placement.markdown> for more -details. - =item B<cpus_soft="CPU-LIST"> Exactly as B<cpus=>, but specifies soft affinity, rather than pinning @@ -178,6 +168,17 @@ the intersection of the soft affinity mask, provided here, and the vcpu pinning, provided via B<cpus=> (if any), is utilized to compute the domain node-affinity, for driving memory allocations. +If this option is not specified (and B<cpus=> is not specified either), +libxl automatically tries to place the guest on the least possible +number of nodes. A heuristic approach is used for choosing the best +node (or set of nodes), with the goal of maximizing performance for +the guest and, at the same time, achieving efficient utilization of +host cpus and memory. In that case, the soft affinity of all the vcpus +of the domain will be set to the pcpus belonging to the NUMA nodes +chosen during placement. + +For more details, see F<docs/misc/xl-numa-placement.markdown>. + =back =head3 CPU Scheduling diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown index b1ed361..f644758 100644 --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -126,10 +126,22 @@ or Xen won''t be able to guarantee the locality for their memory accesses. That, of course, also mean the vCPUs of the domain will only be able to execute on those same pCPUs. +Starting from 4.4, is is also possible to specify a "cpus\_soft=" option +in the xl config file. This, independently from whether or not "cpus=" is +specified too, affect the NUMA placement in a way very similar to what +is described above. In fact, the hypervisor will build up the node-affinity +of the VM basing right on it or, if both pinning (via "cpus=") and soft +affinity (via "cpus\_soft=") are present, basing on their intersection. + +Besides that, "cpus\_soft=" also means, of course, that the vCPUs of the +domain will prefer to execute on, among the pCPUs where they can run, +those particular pCPUs. + + ### Placing the guest automatically ### -If no "cpus=" option is specified in the config file, libxl tries -to figure out on its own on which node(s) the domain could fit best. +If neither "cpus=" nor "cpus\_soft=" are present in the config file, libxl +tries to figure out on its own on which node(s) the domain could fit best. If it finds one (some), the domain''s node affinity get set to there, and both memory allocations and NUMA aware scheduling (for the credit scheduler and starting from Xen 4.3) will comply with it. Starting from diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index a1c16b0..d241efc 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -222,21 +222,39 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, * some weird error manifests) the subsequent call to * libxl_domain_set_nodeaffinity() will do the actual placement, * whatever that turns out to be. + * + * As far as scheduling is concerned, we achieve NUMA-aware scheduling + * by having the results of placement affect the soft affinity of all + * the vcpus of the domain. Of course, we want that iff placement is + * enabled and actually happens, so we only change info->cpumap_soft to + * reflect the placement result if that is the case */ if (libxl_defbool_val(info->numa_placement)) { - if (!libxl_bitmap_is_full(&info->cpumap)) { + /* We require both hard and soft affinity not to be set */ + if (!libxl_bitmap_is_full(&info->cpumap) || + !libxl_bitmap_is_full(&info->cpumap_soft)) { LOG(ERROR, "Can run NUMA placement only if no vcpu " - "affinity is specified"); + "(hard or soft) affinity is specified"); return ERROR_INVAL; } rc = numa_place_domain(gc, domid, info); if (rc) return rc; + + /* + * We change the soft affinity in domain_build_info here, of course + * after converting the result of placement from nodes to cpus. the + * following call to libxl_set_vcpuaffinity_all_soft() will do the + * actual updating of the domain''s vcpus'' soft affinity. + */ + libxl_nodemap_to_cpumap(ctx, &info->nodemap, &info->cpumap_soft); } libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); + libxl_set_vcpuaffinity_all_soft(ctx, domid, info->max_vcpus, + &info->cpumap_soft); xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); xs_domid = xs_read(ctx->xsh, XBT_NULL, "/tool/xenstored/domid", NULL);
Dario Faggioli
2013-Nov-13 19:16 UTC
Re: [PATCH v2 12/16] libxl: get and set soft affinity
On mer, 2013-11-13 at 20:12 +0100, Dario Faggioli wrote:> > Also, as this apparently is the first change being checked in > that breaks libxl ABI, bump MAJOR to 4.4. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > tools/libxl/Makefile | 1 > tools/libxl/libxl.c | 206 ++++++++++++++++++++++++++++++++++++++++--- > tools/libxl/libxl.h | 23 +++++ > tools/libxl/libxl_create.c | 6 + > tools/libxl/libxl_types.idl | 4 + > tools/libxl/libxl_utils.h | 13 +++ > 6 files changed, 239 insertions(+), 14 deletions(-) > > diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile > index cf214bb..b7f39bd 100644 > --- a/tools/libxl/Makefile > +++ b/tools/libxl/Makefile > @@ -5,6 +5,7 @@ > XEN_ROOT = $(CURDIR)/../.. > include $(XEN_ROOT)/tools/Rules.mk > > +#MAJOR = 4.4 > MAJOR = 4.3 > MINOR = 0 >And, of course, this is quite wrong! :-( Sorry for that. Correct version with this line _not_ commented out attached. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-14 10:50 UTC
Re: [PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax
On 13/11/13 19:11, Dario Faggioli wrote:> in fact, pinning to all the pcpus happens by specifying "all" > (either on the command line or in the config file), while `xl > vcpu-list'' report it as "any cpu". > > Change this into something more consistent, by using "all" > everywhere.While I can see a certain logic to having the specification and the output match, there is a part of me that really prefers the "any" syntax in the output. This is just expressing a preference, not a nack; if others think it would be better to be consistent, I won''t argue the case. -George
George Dunlap
2013-Nov-14 11:02 UTC
Re: [PATCH v2 02/16] xl: allow for node-wise specification of vcpu pinning
On 13/11/13 19:11, Dario Faggioli wrote:> Making it possible to use something like the following: > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 > but not pCPU 6; > * ... > > In both domain config file and `xl vcpu-pin''. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>> --- > Changes from v1 (of this very series): > * actually checking for both "nodes:" and "node:" as per > the doc says; > * using strcmp() (rather than strncmp()) when matching > "all", to avoid returning success on any longer string > that just begins with "all"; > * fixing the handling (well, the rejection, actually) of > "^all" and "^nodes:all"; > * make some string pointers const. > > Changes from v2 (of original series): > * turned a ''return'' into ''goto out'', consistently with the > most of exit patterns; > * harmonized error handling: now parse_range() return a > libxl error code, as requested during review; > * dealing with "all" moved inside update_cpumap_range(). > It''s tricky to move it in parse_range() (as requested > during review), since we need the cpumap being modified > handy when dealing with it. However, having it in > update_cpumap_range() simplifies the code just as much > as that; > * explicitly checking for junk after a valid value or range > in parse_range(), as requested during review; > * xl exits on parsing failing, so no need to reset the > cpumap to something sensible in vcpupin_parse(), as > suggested during review; > > Changes from v1 (of original series): > * code rearranged in order to look more simple to follow > and understand, as requested during review; > * improved docs in xl.cfg.pod.5, as requested during > review; > * strtoul() now returns into unsigned long, and the > case where it returns ULONG_MAX is now taken into > account, as requested during review; > * stuff like "all,^7" now works, as requested during > review. Specifying just "^7" does not work either > before or after this change > * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`) > by introducing STR_SKIP_PREFIX() macro, as requested > during review. > --- > docs/man/xl.cfg.pod.5 | 20 ++++++ > tools/libxl/xl_cmdimpl.c | 153 +++++++++++++++++++++++++++++++++------------- > 2 files changed, 128 insertions(+), 45 deletions(-) > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > index e6fc83f..5dbc73c 100644 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the cpus on the host. > > =item "0-3,5,^1" > > -To allow all the vcpus of the guest to run on cpus 0,2,3,5. > +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining > +this with "all" is possible, meaning "all,^7" results in all the vcpus > +of the guest running on all the cpus on the host except cpu 7. > + > +=item "nodes:0-3,node:^2" > + > +To allow all the vcpus of the guest to run on the cpus from NUMA nodes > +0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs > +to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus > +of the guest will run on cpus 0-3,8-11. > + > +Combining this notation with the one above is possible. For instance, > +"1,node:2,^6", means all the vcpus of the guest will run on cpu 1 and > +on all the cpus of NUMA node 2, but not on cpu 6. Following the same > +example as above, that would be cpus 1,4,5,7. > + > +Combining this with "all" is also possible, meaning "all,^nodes:1" > +results in all the vcpus of the guest running on all the cpus on the > +host, except for the cpus belonging to the host NUMA node 1. > > =item ["2", "3"] (or [2, 3]) > > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > index 13e97b3..5f5cc43 100644 > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -59,6 +59,11 @@ > } \ > }) > > +#define STR_HAS_PREFIX( a, b ) \ > + ( strncmp(a, b, strlen(b)) == 0 ) > +#define STR_SKIP_PREFIX( a, b ) \ > + ( STR_HAS_PREFIX(a, b) ? ((a) += strlen(b), 1) : 0 ) > + > > int logfile = 2; > > @@ -513,61 +518,121 @@ static void split_string_into_string_list(const char *str, > free(s); > } > > -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) > +static int parse_range(const char *str, unsigned long *a, unsigned long *b) > { > - libxl_bitmap exclude_cpumap; > - uint32_t cpuida, cpuidb; > - char *endptr, *toka, *tokb, *saveptr = NULL; > - int i, rc = 0, rmcpu; > + const char *nstr; > + char *endptr; > > - if (!strcmp(cpu, "all")) { > - libxl_bitmap_set_any(cpumap); > - return 0; > + *a = *b = strtoul(str, &endptr, 10); > + if (endptr == str || *a == ULONG_MAX) > + return ERROR_INVAL; > + > + if (*endptr == ''-'') { > + nstr = endptr + 1; > + > + *b = strtoul(nstr, &endptr, 10); > + if (endptr == nstr || *b == ULONG_MAX || *b < *a) > + return ERROR_INVAL; > + } > + > + /* Valid value or range so far, but we also don''t want junk after that */ > + if (*endptr != ''\0'') > + return ERROR_INVAL; > + > + return 0; > +} > + > +/* > + * Add or removes a specific set of cpus (specified in str, either as > + * single cpus or as entire NUMA nodes) to/from cpumap. > + */ > +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) > +{ > + unsigned long ida, idb; > + libxl_bitmap node_cpumap; > + bool is_not = false, is_nodes = false; > + int rc = 0; > + > + libxl_bitmap_init(&node_cpumap); > + > + rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0); > + if (rc) { > + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); > + goto out; > } > > - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { > - fprintf(stderr, "Error: Failed to allocate cpumap.\n"); > - return ENOMEM; > + /* Are we adding or removing cpus/nodes? */ > + if (STR_SKIP_PREFIX(str, "^")) { > + is_not = true; > } > > - for (toka = strtok_r(cpu, ",", &saveptr); toka; > - toka = strtok_r(NULL, ",", &saveptr)) { > - rmcpu = 0; > - if (*toka == ''^'') { > - /* This (These) Cpu(s) will be removed from the map */ > - toka++; > - rmcpu = 1; > - } > - /* Extract a valid (range of) cpu(s) */ > - cpuida = cpuidb = strtoul(toka, &endptr, 10); > - if (endptr == toka) { > - fprintf(stderr, "Error: Invalid argument.\n"); > - rc = EINVAL; > - goto vcpp_out; > - } > - if (*endptr == ''-'') { > - tokb = endptr + 1; > - cpuidb = strtoul(tokb, &endptr, 10); > - if (endptr == tokb || cpuida > cpuidb) { > - fprintf(stderr, "Error: Invalid argument.\n"); > - rc = EINVAL; > - goto vcpp_out; > + /* Are we dealing with cpus or full nodes? */ > + if (STR_SKIP_PREFIX(str, "node:") || STR_SKIP_PREFIX(str, "nodes:")) { > + is_nodes = true; > + } > + > + if (strcmp(str, "all") == 0) { > + /* We do not accept "^all" or "^nodes:all" */ > + if (is_not) { > + fprintf(stderr, "Can''t combine \"^\" and \"all\".\n"); > + rc = ERROR_INVAL; > + } else > + libxl_bitmap_set_any(cpumap); > + goto out; > + } > + > + rc = parse_range(str, &ida, &idb); > + if (rc) { > + fprintf(stderr, "Invalid pcpu range: %s.\n", str); > + goto out; > + } > + > + /* Add or remove the specified cpus in the range */ > + while (ida <= idb) { > + if (is_nodes) { > + /* Add/Remove all the cpus of a NUMA node */ > + int i; > + > + rc = libxl_node_to_cpumap(ctx, ida, &node_cpumap); > + if (rc) { > + fprintf(stderr, "libxl_node_to_cpumap failed.\n"); > + goto out; > } > + > + /* Add/Remove all the cpus in the node cpumap */ > + libxl_for_each_set_bit(i, node_cpumap) { > + is_not ? libxl_bitmap_reset(cpumap, i) : > + libxl_bitmap_set(cpumap, i); > + } > + } else { > + /* Add/Remove this cpu */ > + is_not ? libxl_bitmap_reset(cpumap, ida) : > + libxl_bitmap_set(cpumap, ida); > } > - while (cpuida <= cpuidb) { > - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : > - libxl_bitmap_set(&exclude_cpumap, cpuida); > - cpuida++; > - } > + ida++; > } > > - /* Clear all the cpus from the removal list */ > - libxl_for_each_set_bit(i, exclude_cpumap) { > - libxl_bitmap_reset(cpumap, i); > - } > + out: > + libxl_bitmap_dispose(&node_cpumap); > + return rc; > +} > > -vcpp_out: > - libxl_bitmap_dispose(&exclude_cpumap); > +/* > + * Takes a string representing a set of cpus (specified either as > + * single cpus or as eintire NUMA nodes) and turns it into the > + * corresponding libxl_bitmap (in cpumap). > + */ > +static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) > +{ > + char *ptr, *saveptr = NULL; > + int rc = 0; > + > + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; > + ptr = strtok_r(NULL, ",", &saveptr)) { > + rc = update_cpumap_range(ptr, cpumap); > + if (rc) > + break; > + } > > return rc; > } >
George Dunlap
2013-Nov-14 11:11 UTC
Re: [PATCH v2 05/16] xen: fix leaking of v->cpu_affinity_saved
On 13/11/13 19:11, Dario Faggioli wrote:> on domain destruction. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Ugh, nasty. Good catch. Looks like this was introduced in 41e71c26 back in April, so this fix should be backported to 4.3. Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
Dario Faggioli
2013-Nov-14 11:11 UTC
Re: [PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax
On gio, 2013-11-14 at 10:50 +0000, George Dunlap wrote:> On 13/11/13 19:11, Dario Faggioli wrote: > > in fact, pinning to all the pcpus happens by specifying "all" > > (either on the command line or in the config file), while `xl > > vcpu-list'' report it as "any cpu". > > > > Change this into something more consistent, by using "all" > > everywhere. > > While I can see a certain logic to having the specification and the > output match, there is a part of me that really prefers the "any" syntax > in the output. >I see.> This is just expressing a preference, not a nack; if others think it > would be better to be consistent, I won''t argue the case. >I''m fine with both. As I tried (and probably failed) to say here: " Changes since v1: * this patch was not there in v1. It is now as using the same syntax for both input and output was requested during review. " It was a request (from Ian Jackson) during review of v1:> -George > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel-- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-14 11:13 UTC
Re: [PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax
On gio, 2013-11-14 at 10:50 +0000, George Dunlap wrote:> On 13/11/13 19:11, Dario Faggioli wrote: > > in fact, pinning to all the pcpus happens by specifying "all" > > (either on the command line or in the config file), while `xl > > vcpu-list'' report it as "any cpu". > > > > Change this into something more consistent, by using "all" > > everywhere. > > While I can see a certain logic to having the specification and the > output match, there is a part of me that really prefers the "any" syntax > in the output. >I see.> This is just expressing a preference, not a nack; if others think it > would be better to be consistent, I won''t argue the case. >I''m fine with both. As I tried (and probably failed) to say here: " Changes since v1: * this patch was not there in v1. It is now as using the same syntax for both input and output was requested during review. " It was a request (from Ian Jackson) during last round: http://bugs.xenproject.org/xen/mid/%3C21117.191.14394.791489@mariner.uk.xensource.com%3E Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-14 11:14 UTC
Re: [PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax
On 14/11/13 11:11, Dario Faggioli wrote:> On gio, 2013-11-14 at 10:50 +0000, George Dunlap wrote: >> On 13/11/13 19:11, Dario Faggioli wrote: >>> in fact, pinning to all the pcpus happens by specifying "all" >>> (either on the command line or in the config file), while `xl >>> vcpu-list'' report it as "any cpu". >>> >>> Change this into something more consistent, by using "all" >>> everywhere. >> While I can see a certain logic to having the specification and the >> output match, there is a part of me that really prefers the "any" syntax >> in the output. >> > I see. > >> This is just expressing a preference, not a nack; if others think it >> would be better to be consistent, I won''t argue the case. >> > I''m fine with both. As I tried (and probably failed) to say here: > > " > Changes since v1: > * this patch was not there in v1. It is now as using the > same syntax for both input and output was requested > during review. > " > > It was a request (from Ian Jackson) during review of v1:Oh, right -- sorry, missed that comment (and the discussion w/ IanJ). In that case, Acked-by: George Dunlap <george.dunlap@eu.citrix.com> -George
Dario Faggioli
2013-Nov-14 11:58 UTC
Re: [PATCH v2 05/16] xen: fix leaking of v->cpu_affinity_saved
On gio, 2013-11-14 at 11:11 +0000, George Dunlap wrote:> On 13/11/13 19:11, Dario Faggioli wrote: > > on domain destruction. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Ugh, nasty. Good catch. >:-)> Looks like this was introduced in 41e71c26 back in April, so this fix > should be backported to 4.3. >Oh, right... I really forgot to try figuring out what introduced this and what that was. I''ll try to remember at least to file a proper backport request, as soon as this hits the tree. Thanks and regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2013-Nov-14 12:44 UTC
Re: [PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax
George Dunlap writes ("Re: [PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax"):> On 13/11/13 19:11, Dario Faggioli wrote: > > in fact, pinning to all the pcpus happens by specifying "all" > > (either on the command line or in the config file), while `xl > > vcpu-list'' report it as "any cpu". > > > > Change this into something more consistent, by using "all" > > everywhere. > > While I can see a certain logic to having the specification and the > output match, there is a part of me that really prefers the "any" syntax > in the output.I definitely think they should be consistent. I don''t mind whether it''s "any" or "all". (I see Dario has explained that he did this at my request.) Thanks, Ian.
George Dunlap
2013-Nov-14 14:17 UTC
Re: [PATCH v2 07/16] xen: sched: rename v->cpu_affinity into v->cpu_hard_affinity
On 13/11/13 19:12, Dario Faggioli wrote:> in order to distinguish it from the cpu_soft_affinity introduced > by the previous commit ("xen: sched: make space for > cpu_soft_affinity"). This patch does not imply any functional > change, it is basically the result of something like the following: > > s/cpu_affinity/cpu_hard_affinity/g > s/cpu_affinity_tmp/cpu_hard_affinity_tmp/g > s/cpu_affinity_saved/cpu_hard_affinity_saved/g > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>> --- > xen/arch/x86/traps.c | 11 ++++++----- > xen/common/domain.c | 22 +++++++++++----------- > xen/common/domctl.c | 2 +- > xen/common/keyhandler.c | 2 +- > xen/common/sched_credit.c | 12 ++++++------ > xen/common/sched_sedf.c | 2 +- > xen/common/schedule.c | 21 +++++++++++---------- > xen/common/wait.c | 4 ++-- > xen/include/xen/sched.h | 8 ++++---- > 9 files changed, 43 insertions(+), 41 deletions(-) > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index e5b3585..4279cad 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -3083,7 +3083,8 @@ static void nmi_mce_softirq(void) > > /* Set the tmp value unconditionally, so that > * the check in the iret hypercall works. */ > - cpumask_copy(st->vcpu->cpu_affinity_tmp, st->vcpu->cpu_affinity); > + cpumask_copy(st->vcpu->cpu_hard_affinity_tmp, > + st->vcpu->cpu_hard_affinity); > > if ((cpu != st->processor) > || (st->processor != st->vcpu->processor)) > @@ -3118,11 +3119,11 @@ void async_exception_cleanup(struct vcpu *curr) > return; > > /* Restore affinity. */ > - if ( !cpumask_empty(curr->cpu_affinity_tmp) && > - !cpumask_equal(curr->cpu_affinity_tmp, curr->cpu_affinity) ) > + if ( !cpumask_empty(curr->cpu_hard_affinity_tmp) && > + !cpumask_equal(curr->cpu_hard_affinity_tmp, curr->cpu_hard_affinity) ) > { > - vcpu_set_affinity(curr, curr->cpu_affinity_tmp); > - cpumask_clear(curr->cpu_affinity_tmp); > + vcpu_set_affinity(curr, curr->cpu_hard_affinity_tmp); > + cpumask_clear(curr->cpu_hard_affinity_tmp); > } > > if ( !(curr->async_exception_mask & (curr->async_exception_mask - 1)) ) > diff --git a/xen/common/domain.c b/xen/common/domain.c > index c33b876..2916490 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -125,9 +125,9 @@ struct vcpu *alloc_vcpu( > > tasklet_init(&v->continue_hypercall_tasklet, NULL, 0); > > - if ( !zalloc_cpumask_var(&v->cpu_affinity) || > - !zalloc_cpumask_var(&v->cpu_affinity_tmp) || > - !zalloc_cpumask_var(&v->cpu_affinity_saved) || > + if ( !zalloc_cpumask_var(&v->cpu_hard_affinity) || > + !zalloc_cpumask_var(&v->cpu_hard_affinity_tmp) || > + !zalloc_cpumask_var(&v->cpu_hard_affinity_saved) || > !zalloc_cpumask_var(&v->cpu_soft_affinity) || > !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) > goto fail_free; > @@ -157,9 +157,9 @@ struct vcpu *alloc_vcpu( > fail_wq: > destroy_waitqueue_vcpu(v); > fail_free: > - free_cpumask_var(v->cpu_affinity); > - free_cpumask_var(v->cpu_affinity_tmp); > - free_cpumask_var(v->cpu_affinity_saved); > + free_cpumask_var(v->cpu_hard_affinity); > + free_cpumask_var(v->cpu_hard_affinity_tmp); > + free_cpumask_var(v->cpu_hard_affinity_saved); > free_cpumask_var(v->cpu_soft_affinity); > free_cpumask_var(v->vcpu_dirty_cpumask); > free_vcpu_struct(v); > @@ -373,7 +373,7 @@ void domain_update_node_affinity(struct domain *d) > > for_each_vcpu ( d, v ) > { > - cpumask_and(online_affinity, v->cpu_affinity, online); > + cpumask_and(online_affinity, v->cpu_hard_affinity, online); > cpumask_or(cpumask, cpumask, online_affinity); > } > > @@ -736,9 +736,9 @@ static void complete_domain_destroy(struct rcu_head *head) > for ( i = d->max_vcpus - 1; i >= 0; i-- ) > if ( (v = d->vcpu[i]) != NULL ) > { > - free_cpumask_var(v->cpu_affinity); > - free_cpumask_var(v->cpu_affinity_tmp); > - free_cpumask_var(v->cpu_affinity_saved); > + free_cpumask_var(v->cpu_hard_affinity); > + free_cpumask_var(v->cpu_hard_affinity_tmp); > + free_cpumask_var(v->cpu_hard_affinity_saved); > free_cpumask_var(v->cpu_soft_affinity); > free_cpumask_var(v->vcpu_dirty_cpumask); > free_vcpu_struct(v); > @@ -878,7 +878,7 @@ int vcpu_reset(struct vcpu *v) > v->async_exception_mask = 0; > memset(v->async_exception_state, 0, sizeof(v->async_exception_state)); > #endif > - cpumask_clear(v->cpu_affinity_tmp); > + cpumask_clear(v->cpu_hard_affinity_tmp); > clear_bit(_VPF_blocked, &v->pause_flags); > clear_bit(_VPF_in_reset, &v->pause_flags); > > diff --git a/xen/common/domctl.c b/xen/common/domctl.c > index 904d27b..5e0ac5c 100644 > --- a/xen/common/domctl.c > +++ b/xen/common/domctl.c > @@ -629,7 +629,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > else > { > ret = cpumask_to_xenctl_bitmap( > - &op->u.vcpuaffinity.cpumap, v->cpu_affinity); > + &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); > } > } > break; > diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c > index 33c9a37..42fb418 100644 > --- a/xen/common/keyhandler.c > +++ b/xen/common/keyhandler.c > @@ -296,7 +296,7 @@ static void dump_domains(unsigned char key) > !vcpu_event_delivery_is_enabled(v)); > cpuset_print(tmpstr, sizeof(tmpstr), v->vcpu_dirty_cpumask); > printk("dirty_cpus=%s ", tmpstr); > - cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity); > + cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_hard_affinity); > printk("cpu_affinity=%s\n", tmpstr); > cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_soft_affinity); > printk("cpu_soft_affinity=%s\n", tmpstr); > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > index 28dafcf..398b095 100644 > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -332,13 +332,13 @@ csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) > if ( step == CSCHED_BALANCE_NODE_AFFINITY ) > { > cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask, > - vc->cpu_affinity); > + vc->cpu_hard_affinity); > > if ( unlikely(cpumask_empty(mask)) ) > - cpumask_copy(mask, vc->cpu_affinity); > + cpumask_copy(mask, vc->cpu_hard_affinity); > } > else /* step == CSCHED_BALANCE_CPU_AFFINITY */ > - cpumask_copy(mask, vc->cpu_affinity); > + cpumask_copy(mask, vc->cpu_hard_affinity); > } > > static void burn_credits(struct csched_vcpu *svc, s_time_t now) > @@ -407,7 +407,7 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) > > if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY > && !__vcpu_has_node_affinity(new->vcpu, > - new->vcpu->cpu_affinity) ) > + new->vcpu->cpu_hard_affinity) ) > continue; > > /* Are there idlers suitable for new (for this balance step)? */ > @@ -642,7 +642,7 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) > > /* Store in cpus the mask of online cpus on which the domain can run */ > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > - cpumask_and(&cpus, vc->cpu_affinity, online); > + cpumask_and(&cpus, vc->cpu_hard_affinity, online); > > for_each_csched_balance_step( balance_step ) > { > @@ -1487,7 +1487,7 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) > * or counter. > */ > if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY > - && !__vcpu_has_node_affinity(vc, vc->cpu_affinity) ) > + && !__vcpu_has_node_affinity(vc, vc->cpu_hard_affinity) ) > continue; > > csched_balance_cpumask(vc, balance_step, csched_balance_mask); > diff --git a/xen/common/sched_sedf.c b/xen/common/sched_sedf.c > index 7c24171..c219aed 100644 > --- a/xen/common/sched_sedf.c > +++ b/xen/common/sched_sedf.c > @@ -396,7 +396,7 @@ static int sedf_pick_cpu(const struct scheduler *ops, struct vcpu *v) > cpumask_t *online; > > online = cpupool_scheduler_cpumask(v->domain->cpupool); > - cpumask_and(&online_affinity, v->cpu_affinity, online); > + cpumask_and(&online_affinity, v->cpu_hard_affinity, online); > return cpumask_cycle(v->vcpu_id % cpumask_weight(&online_affinity) - 1, > &online_affinity); > } > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > index 5731622..28099d6 100644 > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -194,9 +194,9 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) > */ > v->processor = processor; > if ( is_idle_domain(d) || d->is_pinned ) > - cpumask_copy(v->cpu_affinity, cpumask_of(processor)); > + cpumask_copy(v->cpu_hard_affinity, cpumask_of(processor)); > else > - cpumask_setall(v->cpu_affinity); > + cpumask_setall(v->cpu_hard_affinity); > > cpumask_setall(v->cpu_soft_affinity); > > @@ -287,7 +287,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c) > migrate_timer(&v->singleshot_timer, new_p); > migrate_timer(&v->poll_timer, new_p); > > - cpumask_setall(v->cpu_affinity); > + cpumask_setall(v->cpu_hard_affinity); > > lock = vcpu_schedule_lock_irq(v); > v->processor = new_p; > @@ -459,7 +459,7 @@ static void vcpu_migrate(struct vcpu *v) > */ > if ( pick_called && > (new_lock == per_cpu(schedule_data, new_cpu).schedule_lock) && > - cpumask_test_cpu(new_cpu, v->cpu_affinity) && > + cpumask_test_cpu(new_cpu, v->cpu_hard_affinity) && > cpumask_test_cpu(new_cpu, v->domain->cpupool->cpu_valid) ) > break; > > @@ -563,7 +563,7 @@ void restore_vcpu_affinity(struct domain *d) > { > printk(XENLOG_DEBUG "Restoring affinity for d%dv%d\n", > d->domain_id, v->vcpu_id); > - cpumask_copy(v->cpu_affinity, v->cpu_affinity_saved); > + cpumask_copy(v->cpu_hard_affinity, v->cpu_hard_affinity_saved); > v->affinity_broken = 0; > } > > @@ -606,20 +606,21 @@ int cpu_disable_scheduler(unsigned int cpu) > unsigned long flags; > spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags); > > - cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid); > + cpumask_and(&online_affinity, v->cpu_hard_affinity, c->cpu_valid); > if ( cpumask_empty(&online_affinity) && > - cpumask_test_cpu(cpu, v->cpu_affinity) ) > + cpumask_test_cpu(cpu, v->cpu_hard_affinity) ) > { > printk(XENLOG_DEBUG "Breaking affinity for d%dv%d\n", > d->domain_id, v->vcpu_id); > > if (system_state == SYS_STATE_suspend) > { > - cpumask_copy(v->cpu_affinity_saved, v->cpu_affinity); > + cpumask_copy(v->cpu_hard_affinity_saved, > + v->cpu_hard_affinity); > v->affinity_broken = 1; > } > > - cpumask_setall(v->cpu_affinity); > + cpumask_setall(v->cpu_hard_affinity); > } > > if ( v->processor == cpu ) > @@ -667,7 +668,7 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) > > lock = vcpu_schedule_lock_irq(v); > > - cpumask_copy(v->cpu_affinity, affinity); > + cpumask_copy(v->cpu_hard_affinity, affinity); > > /* Always ask the scheduler to re-evaluate placement > * when changing the affinity */ > diff --git a/xen/common/wait.c b/xen/common/wait.c > index 3c9366c..3f6ff41 100644 > --- a/xen/common/wait.c > +++ b/xen/common/wait.c > @@ -134,7 +134,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) > > /* Save current VCPU affinity; force wakeup on *this* CPU only. */ > wqv->wakeup_cpu = smp_processor_id(); > - cpumask_copy(&wqv->saved_affinity, curr->cpu_affinity); > + cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); > if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) > { > gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); > @@ -183,7 +183,7 @@ void check_wakeup_from_wait(void) > { > /* Re-set VCPU affinity and re-enter the scheduler. */ > struct vcpu *curr = current; > - cpumask_copy(&wqv->saved_affinity, curr->cpu_affinity); > + cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); > if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) > { > gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > index 7e00caf..3575312 100644 > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -192,11 +192,11 @@ struct vcpu > spinlock_t virq_lock; > > /* Bitmask of CPUs on which this VCPU may run. */ > - cpumask_var_t cpu_affinity; > + cpumask_var_t cpu_hard_affinity; > /* Used to change affinity temporarily. */ > - cpumask_var_t cpu_affinity_tmp; > + cpumask_var_t cpu_hard_affinity_tmp; > /* Used to restore affinity across S3. */ > - cpumask_var_t cpu_affinity_saved; > + cpumask_var_t cpu_hard_affinity_saved; > > /* Bitmask of CPUs on which this VCPU prefers to run. */ > cpumask_var_t cpu_soft_affinity; > @@ -795,7 +795,7 @@ void watchdog_domain_destroy(struct domain *d); > #define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv) > #define has_hvm_container_vcpu(v) (has_hvm_container_domain((v)->domain)) > #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \ > - cpumask_weight((v)->cpu_affinity) == 1) > + cpumask_weight((v)->cpu_hard_affinity) == 1) > #ifdef HAS_PASSTHROUGH > #define need_iommu(d) ((d)->need_iommu) > #else >
Ian Jackson
2013-Nov-14 14:19 UTC
Re: [PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax
Dario Faggioli writes ("[PATCH v2 01/16] xl: match output of vcpu-list with pinning syntax"):> in fact, pinning to all the pcpus happens by specifying "all" > (either on the command line or in the config file), while `xl > vcpu-list'' report it as "any cpu". > > Change this into something more consistent, by using "all" > everywhere.Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson
2013-Nov-14 14:24 UTC
Re: [PATCH v2 02/16] xl: allow for node-wise specification of vcpu pinning
Dario Faggioli writes ("[PATCH v2 02/16] xl: allow for node-wise specification of vcpu pinning"):> Making it possible to use something like the following: > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 > but not pCPU 6;...> * fixing the handling (well, the rejection, actually) of > "^all" and "^nodes:all";If it were me I would have made ^all and ^nodes:all work, rather than rejecting them. That makes it possible for semi-automaticallyy-generated configs to specify "blah,blah,^all,1-3" to override the first part. But that''s not really important. Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson
2013-Nov-14 14:25 UTC
Re: [PATCH v2 05/16] xen: fix leaking of v->cpu_affinity_saved
Dario Faggioli writes ("[PATCH v2 05/16] xen: fix leaking of v->cpu_affinity_saved"):> on domain destruction. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Dario Faggioli
2013-Nov-14 14:37 UTC
Re: [PATCH v2 02/16] xl: allow for node-wise specification of vcpu pinning
On gio, 2013-11-14 at 14:24 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v2 02/16] xl: allow for node-wise specification of vcpu pinning"): > > Making it possible to use something like the following: > > * "nodes:0-3": all pCPUs of nodes 0,1,2,3; > > * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; > > * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 > > but not pCPU 6; > ... > > * fixing the handling (well, the rejection, actually) of > > "^all" and "^nodes:all"; > > If it were me I would have made ^all and ^nodes:all work, rather than > rejecting them. That makes it possible for > semi-automaticallyy-generated configs to specify > "blah,blah,^all,1-3" > to override the first part. >Oh, I see... No, sorry, I completely failed at spotting the possibility of enabling such an use-case. Now that you mention it, I see it exists and could be useful, although, probably, not very common.> But that''s not really important. >Yeah, I think so. Anyway, it''s nothing that can''t be done with a follow-up patch, once this series is in (/me adds it to my my TODO list. :-) ) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-14 14:42 UTC
Re: [PATCH v2 09/16] xen: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On 13/11/13 19:12, Dario Faggioli wrote:> by adding a flag for the caller to specify which one he cares about. > > Add also another cpumap there. This way, in case of > DOMCTL_setvcpuaffinity, Xen can return back to the caller the > "effective affinity" of the vcpu. We call the effective affinity > the intersection between cpupool''s cpus, the new hard affinity > and (if asking to set soft affinity) the new soft affinity. > > The purpose of this is allowing the toolstack to figure out whether > or not the requested change in affinity produced sensible results, > when combined with the other settings that are already in place. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > tools/libxc/xc_domain.c | 4 +++- > xen/arch/x86/traps.c | 4 ++-- > xen/common/domctl.c | 38 ++++++++++++++++++++++++++++++++++---- > xen/common/schedule.c | 35 ++++++++++++++++++++++++----------- > xen/common/wait.c | 6 +++--- > xen/include/public/domctl.h | 15 +++++++++++++-- > xen/include/xen/sched.h | 3 ++- > 7 files changed, 81 insertions(+), 24 deletions(-) > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > index 1ccafc5..f9ae4bf 100644 > --- a/tools/libxc/xc_domain.c > +++ b/tools/libxc/xc_domain.c > @@ -215,7 +215,9 @@ int xc_vcpu_setaffinity(xc_interface *xch, > > domctl.cmd = XEN_DOMCTL_setvcpuaffinity; > domctl.domain = (domid_t)domid; > - domctl.u.vcpuaffinity.vcpu = vcpu; > + domctl.u.vcpuaffinity.vcpu = vcpu; > + /* Soft affinity is there, but not used anywhere for now, so... */ > + domctl.u.vcpuaffinity.flags = XEN_VCPUAFFINITY_HARD; > > memcpy(local, cpumap, cpusize); > > diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c > index 4279cad..196ff68 100644 > --- a/xen/arch/x86/traps.c > +++ b/xen/arch/x86/traps.c > @@ -3093,7 +3093,7 @@ static void nmi_mce_softirq(void) > * Make sure to wakeup the vcpu on the > * specified processor. > */ > - vcpu_set_affinity(st->vcpu, cpumask_of(st->processor)); > + vcpu_set_hard_affinity(st->vcpu, cpumask_of(st->processor)); > > /* Affinity is restored in the iret hypercall. */ > } > @@ -3122,7 +3122,7 @@ void async_exception_cleanup(struct vcpu *curr) > if ( !cpumask_empty(curr->cpu_hard_affinity_tmp) && > !cpumask_equal(curr->cpu_hard_affinity_tmp, curr->cpu_hard_affinity) ) > { > - vcpu_set_affinity(curr, curr->cpu_hard_affinity_tmp); > + vcpu_set_hard_affinity(curr, curr->cpu_hard_affinity_tmp); > cpumask_clear(curr->cpu_hard_affinity_tmp); > } > > diff --git a/xen/common/domctl.c b/xen/common/domctl.c > index 5e0ac5c..b3f779e 100644 > --- a/xen/common/domctl.c > +++ b/xen/common/domctl.c > @@ -617,19 +617,49 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > if ( op->cmd == XEN_DOMCTL_setvcpuaffinity ) > { > cpumask_var_t new_affinity; > + cpumask_t *online; > > ret = xenctl_bitmap_to_cpumask( > &new_affinity, &op->u.vcpuaffinity.cpumap); > - if ( !ret ) > + if ( ret ) > + break; > + > + ret = -EINVAL; > + if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_HARD ) > + ret = vcpu_set_hard_affinity(v, new_affinity); > + else if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_SOFT ) > + ret = vcpu_set_soft_affinity(v, new_affinity);Wait, why are we making this a bitmask of flags, if we can only set one at a time? Shouldn''t it instead be a simple enum? Or alternately (which is what I was expecting when I saw it was ''flags''), shouldn''t it allow you to set both at the same time? (i.e., take away the ''else'' here?)> + > + if ( ret ) > { > - ret = vcpu_set_affinity(v, new_affinity); > free_cpumask_var(new_affinity); > + break; > } > + > + /* > + * Report back to the caller what the "effective affinity", that > + * is the intersection of cpupool''s pcpus, the (new?) hard > + * affinity and the (new?) soft-affinity. > + */ > + online = cpupool_online_cpumask(v->domain->cpupool); > + cpumask_and(new_affinity, online, v->cpu_hard_affinity); > + cpumask_and(new_affinity, new_affinity , v->cpu_soft_affinity); > + > + ret = cpumask_to_xenctl_bitmap( > + &op->u.vcpuaffinity.effective_affinity, new_affinity); > + > + free_cpumask_var(new_affinity); > } > else > { > - ret = cpumask_to_xenctl_bitmap( > - &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); > + if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_HARD ) > + ret = cpumask_to_xenctl_bitmap( > + &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); > + else if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_SOFT ) > + ret = cpumask_to_xenctl_bitmap( > + &op->u.vcpuaffinity.cpumap, v->cpu_soft_affinity); > + else > + ret = -EINVAL;Asking for both the hard and soft affinities at the same time shouldn''t return just the hard affinity. It should either return an error, or do something interesting like return the intersection of the two. Other than that, I think this looks good.> } > } > break; > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > index 28099d6..bb52366 100644 > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -653,22 +653,14 @@ void sched_set_node_affinity(struct domain *d, nodemask_t *mask) > SCHED_OP(DOM2OP(d), set_node_affinity, d, mask); > } > > -int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) > +static int vcpu_set_affinity( > + struct vcpu *v, const cpumask_t *affinity, cpumask_t **which) > { > - cpumask_t online_affinity; > - cpumask_t *online; > spinlock_t *lock; > > - if ( v->domain->is_pinned ) > - return -EINVAL; > - online = VCPU2ONLINE(v); > - cpumask_and(&online_affinity, affinity, online); > - if ( cpumask_empty(&online_affinity) ) > - return -EINVAL; > - > lock = vcpu_schedule_lock_irq(v); > > - cpumask_copy(v->cpu_hard_affinity, affinity); > + cpumask_copy(*which, affinity); > > /* Always ask the scheduler to re-evaluate placement > * when changing the affinity */ > @@ -687,6 +679,27 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) > return 0; > } > > +int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity) > +{ > + cpumask_t online_affinity; > + cpumask_t *online; > + > + if ( v->domain->is_pinned ) > + return -EINVAL; > + > + online = VCPU2ONLINE(v); > + cpumask_and(&online_affinity, affinity, online); > + if ( cpumask_empty(&online_affinity) ) > + return -EINVAL; > + > + return vcpu_set_affinity(v, affinity, &v->cpu_hard_affinity); > +} > + > +int vcpu_set_soft_affinity(struct vcpu *v, const cpumask_t *affinity) > +{ > + return vcpu_set_affinity(v, affinity, &v->cpu_soft_affinity); > +} > + > /* Block the currently-executing domain until a pertinent event occurs. */ > void vcpu_block(void) > { > diff --git a/xen/common/wait.c b/xen/common/wait.c > index 3f6ff41..1f6b597 100644 > --- a/xen/common/wait.c > +++ b/xen/common/wait.c > @@ -135,7 +135,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) > /* Save current VCPU affinity; force wakeup on *this* CPU only. */ > wqv->wakeup_cpu = smp_processor_id(); > cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); > - if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) > + if ( vcpu_set_hard_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) > { > gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); > domain_crash_synchronous(); > @@ -166,7 +166,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) > static void __finish_wait(struct waitqueue_vcpu *wqv) > { > wqv->esp = NULL; > - (void)vcpu_set_affinity(current, &wqv->saved_affinity); > + (void)vcpu_set_hard_affinity(current, &wqv->saved_affinity); > } > > void check_wakeup_from_wait(void) > @@ -184,7 +184,7 @@ void check_wakeup_from_wait(void) > /* Re-set VCPU affinity and re-enter the scheduler. */ > struct vcpu *curr = current; > cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); > - if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) > + if ( vcpu_set_hard_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) > { > gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); > domain_crash_synchronous(); > diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h > index 01a3652..aed9cd4 100644 > --- a/xen/include/public/domctl.h > +++ b/xen/include/public/domctl.h > @@ -300,8 +300,19 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t); > /* XEN_DOMCTL_setvcpuaffinity */ > /* XEN_DOMCTL_getvcpuaffinity */ > struct xen_domctl_vcpuaffinity { > - uint32_t vcpu; /* IN */ > - struct xenctl_bitmap cpumap; /* IN/OUT */ > + /* IN variables. */ > + uint32_t vcpu; > + /* Set/get the hard affinity for vcpu */ > +#define _XEN_VCPUAFFINITY_HARD 0 > +#define XEN_VCPUAFFINITY_HARD (1U<<_XEN_VCPUAFFINITY_HARD) > + /* Set/get the soft affinity for vcpu */ > +#define _XEN_VCPUAFFINITY_SOFT 1 > +#define XEN_VCPUAFFINITY_SOFT (1U<<_XEN_VCPUAFFINITY_SOFT) > + uint32_t flags; > + /* IN/OUT variables. */ > + struct xenctl_bitmap cpumap; > + /* OUT variables. */ > + struct xenctl_bitmap effective_affinity; > }; > typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t; > DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t); > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > index 3575312..0f728b3 100644 > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -755,7 +755,8 @@ void scheduler_free(struct scheduler *sched); > int schedule_cpu_switch(unsigned int cpu, struct cpupool *c); > void vcpu_force_reschedule(struct vcpu *v); > int cpu_disable_scheduler(unsigned int cpu); > -int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity); > +int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity); > +int vcpu_set_soft_affinity(struct vcpu *v, const cpumask_t *affinity); > void restore_vcpu_affinity(struct domain *d); > > void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate); >
Ian Jackson
2013-Nov-14 14:58 UTC
Re: [PATCH v2 11/16] libxc: get and set soft and hard affinity
Dario Faggioli writes ("[PATCH v2 11/16] libxc: get and set soft and hard affinity"):> by using the new flag introduced in the parameters of > DOMCTL_{get,set}_vcpuaffinity....> -int xc_vcpu_setaffinity(xc_interface *xch, > - uint32_t domid, > - int vcpu, > - xc_cpumap_t cpumap) > +static int _vcpu_setaffinity(xc_interface *xch, > + uint32_t domid, > + int vcpu, > + xc_cpumap_t cpumap, > + uint32_t flags, > + xc_cpumap_t ecpumap)It would be clearer if the ecpumap parameter was clearly marked as an out parameter. Even better if the in parameter was const-correct. However, defining the xc_cpumap_t to be a uint8_t* has made that difficult. We don''t want to introduce a new xc_cpumap_const_t. Perhaps we should have typedef uint8_t xc_cpumap_element. (NB that identifiers ending *_t are reserved for the C implementation. libxc gets this wrong all the time :-/.)> @@ -206,39 +209,119 @@ int xc_vcpu_setaffinity(xc_interface *xch, > goto out; > } > > - local = xc_hypercall_buffer_alloc(xch, local, cpusize); > - if ( local == NULL ) > + cpumap_local = xc_hypercall_buffer_alloc(xch, cpumap_local, cpusize);Is xc_hypercall_buffer_free idempotent, and is there a way to init a hypercall buffer to an unallocated state ? If so this function could be a lot simpler, and it could in particular more clearly not leak anything, by using the "goto out" cleanup style.> +/** > + * This functions specify the scheduling affinity for a vcpu. Soft > + * affinity is on what pcpus a vcpu prefers to run. Hard affinity is > + * on what pcpus a vcpu is allowed to run. When set independently (by > + * the respective _soft and _hard calls) the effective affinity is > + * also returned. What we call the effective affinity it the intersection > + * of soft affinity, hard affinity and the set of the cpus of the cpupool > + * the domain belongs to. It''s basically what the Xen scheduler will > + * actually use. Returning it to the caller allows him to check if that > + * matches with, or at least is good enough for, his purposes. > + * > + * A xc_vcpu_setaffinity() call is provided, mainly for backward > + * compatibility reasons, and what it does is setting both hard and > + * soft affinity for the vcpu. > + * > + * @param xch a handle to an open hypervisor interface. > + * @param domid the id of the domain to which the vcpu belongs > + * @param vcpu the vcpu id wihin the domain > + * @param cpumap the (hard, soft, both) new affinity map one wants to set > + * @param ecpumap the effective affinity for the vcpuEither the doc comment, or the parameter name, should make it clear that ecpumap is an out parameter. Ian.
George Dunlap
2013-Nov-14 15:03 UTC
Re: [PATCH v2 06/16] xen: sched: make space for cpu_soft_affinity
On 13/11/13 19:11, Dario Faggioli wrote:> Before this change, each vcpu had its own vcpu-affinity > (in v->cpu_affinity), representing the set of pcpus where > the vcpu is allowed to run. Since when NUMA-aware scheduling > was introduced the (credit1 only, for now) scheduler also > tries as much as it can to run all the vcpus of a domain > on one of the nodes that constitutes the domain''s > node-affinity. > > The idea here is making the mechanism more general by: > * allowing for this ''preference'' for some pcpus/nodes to be > expressed on a per-vcpu basis, instead than for the domain > as a whole. That is to say, each vcpu should have its own > set of preferred pcpus/nodes, instead than it being the > very same for all the vcpus of the domain; > * generalizing the idea of ''preferred pcpus'' to not only NUMA > awareness and support. That is to say, independently from > it being or not (mostly) useful on NUMA systems, it should > be possible to specify, for each vcpu, a set of pcpus where > it prefers to run (in addition, and possibly unrelated to, > the set of pcpus where it is allowed to run). > > We will be calling this set of *preferred* pcpus the vcpu''s > soft affinity, and this change introduces, allocates, frees > and initializes the data structure required to host that in > struct vcpu (cpu_soft_affinity). > > Also, the new field is not used anywhere yet, so no real > functional change yet. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>The breakdown of this in the series doesn''t make much sense to me -- I would have folded this one and patch 10 (use soft affinity instead of node affinity) together, and put it in after patch 07 (s/affinity/hard_affinity/g;). But the code itself is fine, and time is short, so: Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>> --- > Changes from v1: > * this patch does something similar to what, in v1, was > being done in "5/12 xen: numa-sched: make space for > per-vcpu node-affinity" > --- > xen/common/domain.c | 3 +++ > xen/common/keyhandler.c | 2 ++ > xen/common/schedule.c | 2 ++ > xen/include/xen/sched.h | 3 +++ > 4 files changed, 10 insertions(+) > > diff --git a/xen/common/domain.c b/xen/common/domain.c > index 2cbc489..c33b876 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -128,6 +128,7 @@ struct vcpu *alloc_vcpu( > if ( !zalloc_cpumask_var(&v->cpu_affinity) || > !zalloc_cpumask_var(&v->cpu_affinity_tmp) || > !zalloc_cpumask_var(&v->cpu_affinity_saved) || > + !zalloc_cpumask_var(&v->cpu_soft_affinity) || > !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) > goto fail_free; > > @@ -159,6 +160,7 @@ struct vcpu *alloc_vcpu( > free_cpumask_var(v->cpu_affinity); > free_cpumask_var(v->cpu_affinity_tmp); > free_cpumask_var(v->cpu_affinity_saved); > + free_cpumask_var(v->cpu_soft_affinity); > free_cpumask_var(v->vcpu_dirty_cpumask); > free_vcpu_struct(v); > return NULL; > @@ -737,6 +739,7 @@ static void complete_domain_destroy(struct rcu_head *head) > free_cpumask_var(v->cpu_affinity); > free_cpumask_var(v->cpu_affinity_tmp); > free_cpumask_var(v->cpu_affinity_saved); > + free_cpumask_var(v->cpu_soft_affinity); > free_cpumask_var(v->vcpu_dirty_cpumask); > free_vcpu_struct(v); > } > diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c > index 8e4b3f8..33c9a37 100644 > --- a/xen/common/keyhandler.c > +++ b/xen/common/keyhandler.c > @@ -298,6 +298,8 @@ static void dump_domains(unsigned char key) > printk("dirty_cpus=%s ", tmpstr); > cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity); > printk("cpu_affinity=%s\n", tmpstr); > + cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_soft_affinity); > + printk("cpu_soft_affinity=%s\n", tmpstr); > printk(" pause_count=%d pause_flags=%lx\n", > atomic_read(&v->pause_count), v->pause_flags); > arch_dump_vcpu_info(v); > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > index 0f45f07..5731622 100644 > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -198,6 +198,8 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) > else > cpumask_setall(v->cpu_affinity); > > + cpumask_setall(v->cpu_soft_affinity); > + > /* Initialise the per-vcpu timers. */ > init_timer(&v->periodic_timer, vcpu_periodic_timer_fn, > v, v->processor); > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > index cbdf377..7e00caf 100644 > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -198,6 +198,9 @@ struct vcpu > /* Used to restore affinity across S3. */ > cpumask_var_t cpu_affinity_saved; > > + /* Bitmask of CPUs on which this VCPU prefers to run. */ > + cpumask_var_t cpu_soft_affinity; > + > /* Bitmask of CPUs which are holding onto this VCPU''s state. */ > cpumask_var_t vcpu_dirty_cpumask; > >
Dario Faggioli writes ("[PATCH v2 12/16] libxl: get and set soft affinity"):> which basically means making space for a new cpumap in both > vcpu_info (for getting soft affinity) and build_info (for setting > it), along with providing the get/set functions, and wiring them > to the proper xc calls. Interface is as follows: > > * libxl_{get,set}_vcpuaffinity() deals with hard affinity, as it > always has happened; > * libxl_get,set}_vcpuaffinity_soft() deals with soft affinity.In practice, doesn''t this API plus these warnings mean that a toolstack which wants to migrate a domain to a new set of vcpus (or, worse, a new cpupool) will find it difficult to avoid warnings from libxl ? Because the toolstack will want to change both the hard and soft affinities to new maps, perhaps entirely disjoint from the old ones, but can only do one at a time. So the system will definitely go through "silly" states. This would be solved with an API that permitted setting both sets of affinities in a single call, even if the underlying libxc and hypercalls are separate, because libxl would do the check only on the overall final state. So perhaps we should have a singe function that can change the cpupool, hard affinity, and soft affinity, all at once ? What if the application makes a call to change the cpupool, without touching the affinities ? Should the affinities be reset automatically ?> diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile > index cf214bb..b7f39bd 100644 > --- a/tools/libxl/Makefile > +++ b/tools/libxl/Makefile > @@ -5,6 +5,7 @@ > XEN_ROOT = $(CURDIR)/../.. > include $(XEN_ROOT)/tools/Rules.mk > > +#MAJOR = 4.4 > MAJOR = 4.3 > MINOR = 0Commented out ?> + if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap_soft, 0)) { > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap_soft"); > + return NULL; > + }I went and looked at the error handling. Normally in libxl allocators can''t fail but this one wants to do a hypercall to look up the max cpus. But, pulling on the thread, I wonder if it would be better to do this error logging in libxl_cpu_bitmap_alloc rather than all the call sites. (If we did that in principle we should introduce a new #define which allows applications to #ifdef out their own logging for this, but in practice the double-logging isn''t likely to be a problem.) And looking at libxl_cpu_bitmap_alloc, it calls libxl_get_max_cpus and does something very odd with the return value: libxl_get_max_nodes ought to return a positive number or a libxl error code. So I went to look at libxl_get_max_nodes and it just returns whatever it got from libxc, which is definitely wrong! Would you mind fixing this as part of this series ? Thanks, Ian.
Ian Jackson
2013-Nov-14 15:12 UTC
Re: [PATCH v2 13/16] xl: show soft affinity in `xl vcpu-list''
Dario Faggioli writes ("[PATCH v2 13/16] xl: show soft affinity in `xl vcpu-list''"):> if the ''-s''/''--soft'' option is provided. An example of such > output is this:Do we want to use "-s" for this ? (What does xm do with it?) Perhaps we should have one more general option that displays cpu affinity and cpupool (ie, basic numa and scheduling info). Ian.
Ian Jackson
2013-Nov-14 15:14 UTC
Re: [PATCH v2 15/16] xl: enable for specifying node-affinity in the config file
Dario Faggioli writes ("[PATCH v2 15/16] xl: enable for specifying node-affinity in the config file"):> in a similar way to how it is possible to specify vcpu-affinity.I think this config file interface is fine, but if the libxl API is going to change the implementaion here will have to too. So I''ll write: Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> If this patch gets reworked, please take my ack off but mention that I previously acked it :-). Thanks, Ian.
Ian Jackson
2013-Nov-14 15:17 UTC
Re: [PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity
Dario Faggioli writes ("[PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity"):> vCPU soft affinity and NUMA-aware scheduling does not have > to be related. However, soft affinity is how NUMA-aware > scheduling is actually implemented, and therefore, by default, > the results of automatic NUMA placement (at VM creation time) > are also used to set the soft affinity of all the vCPUs of > the domain.....> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index a1c16b0..d241efc 100644 > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -222,21 +222,39 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > * some weird error manifests) the subsequent call to > * libxl_domain_set_nodeaffinity() will do the actual placement, > * whatever that turns out to be. > + * > + * As far as scheduling is concerned, we achieve NUMA-aware scheduling > + * by having the results of placement affect the soft affinity of all > + * the vcpus of the domain. Of course, we want that iff placement is > + * enabled and actually happens, so we only change info->cpumap_soft to > + * reflect the placement result if that is the case > */ > if (libxl_defbool_val(info->numa_placement)) {But isn''t the default for this true ?> - if (!libxl_bitmap_is_full(&info->cpumap)) { > + /* We require both hard and soft affinity not to be set */ > + if (!libxl_bitmap_is_full(&info->cpumap) || > + !libxl_bitmap_is_full(&info->cpumap_soft)) { > LOG(ERROR, "Can run NUMA placement only if no vcpu " > - "affinity is specified"); > + "(hard or soft) affinity is specified"); > return ERROR_INVAL;So the result will be that any attempt to set cpumaps in an otherwise-naive config file will cause errors, rather than just disabling the numa placement ? Thanks, Ian.
George Dunlap
2013-Nov-14 15:21 UTC
Re: [PATCH v2 08/16] xen: derive NUMA node affinity from hard and soft CPU affinity
On 13/11/13 19:12, Dario Faggioli wrote:> if a domain''s NUMA node-affinity (which is what controls > memory allocations) is provided by the user/toolstack, it > just is not touched. However, if the user does not say > anything, leaving it all to Xen, let''s compute it in the > following way: > > 1. cpupool''s cpus & hard-affinity & soft-affinity > 2. if (1) is empty: cpupool''s cpus & hard-affinity > > This guarantees memory to be allocated from the narrowest > possible set of NUMA nodes, ad makes it relatively easy to > set up NUMA-aware scheduling on top of soft affinity. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > xen/common/domain.c | 22 +++++++++++++++++++++- > 1 file changed, 21 insertions(+), 1 deletion(-) > > diff --git a/xen/common/domain.c b/xen/common/domain.c > index 2916490..4b8fca8 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -353,7 +353,7 @@ struct domain *domain_create( > > void domain_update_node_affinity(struct domain *d) > { > - cpumask_var_t cpumask; > + cpumask_var_t cpumask, cpumask_soft; > cpumask_var_t online_affinity; > const cpumask_t *online; > struct vcpu *v; > @@ -361,9 +361,15 @@ void domain_update_node_affinity(struct domain *d) > > if ( !zalloc_cpumask_var(&cpumask) ) > return; > + if ( !zalloc_cpumask_var(&cpumask_soft) ) > + { > + free_cpumask_var(cpumask); > + return; > + } > if ( !alloc_cpumask_var(&online_affinity) ) > { > free_cpumask_var(cpumask); > + free_cpumask_var(cpumask_soft); > return; > } > > @@ -373,8 +379,12 @@ void domain_update_node_affinity(struct domain *d) > > for_each_vcpu ( d, v ) > { > + /* Build up the mask of online pcpus we have hard affinity with */ > cpumask_and(online_affinity, v->cpu_hard_affinity, online); > cpumask_or(cpumask, cpumask, online_affinity); > + > + /* As well as the mask of all pcpus we have soft affinity with */ > + cpumask_or(cpumask_soft, cpumask_soft, v->cpu_soft_affinity);Is this really the most efficient way to do this? I would have thought or''ing cpumask and cpumask_soft, and then if it''s not empty, then use it; maybe use a pointer so you don''t have to copy one into the other one? Also, all of the above computation happens unconditionally, but is only used if d->auto_node_affinity is true. It seems like it would be better to move this calculation inside the conditional. Leaving the allocation outside might make sense to reduce the code covered by the lock -- not sure about that; I don''t think this is a heavily-contended lock, is it?> } > > /* > @@ -386,6 +396,15 @@ void domain_update_node_affinity(struct domain *d) > */ > if ( d->auto_node_affinity ) > { > + /* > + * We''re looking for the narower possible set of nodes. So, if > + * possible (i.e., if not empty!) let''s use the intersection > + * between online, hard and soft affinity. If not, just fall back > + * to online & hard affinity. > + */ > + if ( cpumask_intersects(cpumask, cpumask_soft) ) > + cpumask_and(cpumask, cpumask, cpumask_soft); > + > nodes_clear(d->node_affinity); > for_each_online_node ( node ) > if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) > @@ -397,6 +416,7 @@ void domain_update_node_affinity(struct domain *d) > spin_unlock(&d->node_affinity_lock);> > free_cpumask_var(online_affinity); > + free_cpumask_var(cpumask_soft); > free_cpumask_var(cpumask); > } > >
George Dunlap
2013-Nov-14 15:30 UTC
Re: [PATCH v2 10/16] xen: sched: use soft-affinity instead of domain''s node-affinity
On 13/11/13 19:12, Dario Faggioli wrote:> now that we have it, use soft affinity for scheduling, and replace > the indirect use of the domain''s NUMA node-affinity. This is > more general, as soft affinity does not have to be related to NUMA. > At the same time it allows to achieve the same results as > NUMA-aware scheduling, just by making soft affinity equal to the > domain''s node affinity, for all the vCPUs (e.g., from the toolstack). > > This also means renaming most of the NUMA-aware scheduling related > functions, in credit1, to something more generic, hinting toward > the concept of soft affinity rather than directly to NUMA awareness. > > As a side effects, this simplifies the code quit a bit. In fact, > prior to this change, we needed to cache the translation of > d->node_affinity (which is a nodemask_t) to a cpumask_t, since that > is what scheduling decisions require (we used to keep it in > node_affinity_cpumask). This, and all the complicated logic > required to keep it updated, is not necessary any longer. > > The high level description of NUMA placement and scheduling in > docs/misc/xl-numa-placement.markdown is being updated too, to match > the new architecture. > > signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Just a few things to note below...> diff --git a/xen/common/domain.c b/xen/common/domain.c > index 4b8fca8..b599223 100644 > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -411,8 +411,6 @@ void domain_update_node_affinity(struct domain *d) > node_set(node, d->node_affinity); > } > > - sched_set_node_affinity(d, &d->node_affinity); > - > spin_unlock(&d->node_affinity_lock);At this point, the only thing inside the spinlock is contingent on d->auto_node_affinity.> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > index 398b095..0790ebb 100644 > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c...> -static inline int __vcpu_has_node_affinity(const struct vcpu *vc, > +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc, > const cpumask_t *mask) > { > - const struct domain *d = vc->domain; > - const struct csched_dom *sdom = CSCHED_DOM(d); > - > - if ( d->auto_node_affinity > - || cpumask_full(sdom->node_affinity_cpumask) > - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) > + if ( cpumask_full(vc->cpu_soft_affinity) > + || !cpumask_intersects(vc->cpu_soft_affinity, mask) ) > return 0;At this point we''ve lost a way to make this check potentially much faster (being able to check auto_node_affinity). This isn''t a super-hot path but it does happen fairly frequently -- will the "cpumask_full()" check take a significant amount of time on, say, a 4096-core system? If so, we might think about "caching" the results of cpumask_full() at some point.
George Dunlap
2013-Nov-14 15:38 UTC
Re: [PATCH v2 11/16] libxc: get and set soft and hard affinity
On 13/11/13 19:12, Dario Faggioli wrote:> by using the new flag introduced in the parameters of > DOMCTL_{get,set}_vcpuaffinity. > > This happens in two new xc calls: xc_vcpu_setaffinity_hard() > and xc_vcpu_setaffinity_soft() (an in the corresponding > getters, of course).Personally I think: * You should be able to set both HARD and SOFT flags, in which case it will set both hard and soft affinities * You should expose the "flags" to the xc caller, so that the caller can set either one or both.> > The existing xc_vcpu_{set,get}affinity() call is also retained, > with the following behavior: > > * xc_vcpu_setaffinity() sets both the hard and soft affinity; > * xc_vcpu_getaffinity() gets the hard affinity. > > This is mainly for backward compatibility reasons, i.e., trying > not to break existing callers/users.The xc interface isn''t stable, right? Couldn''t we just change the callers (presumably just xend and libxl)? -George
On 14/11/13 15:11, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v2 12/16] libxl: get and set soft affinity"): >> which basically means making space for a new cpumap in both >> vcpu_info (for getting soft affinity) and build_info (for setting >> it), along with providing the get/set functions, and wiring them >> to the proper xc calls. Interface is as follows: >> >> * libxl_{get,set}_vcpuaffinity() deals with hard affinity, as it >> always has happened; >> * libxl_get,set}_vcpuaffinity_soft() deals with soft affinity. > In practice, doesn''t this API plus these warnings mean that a > toolstack which wants to migrate a domain to a new set of vcpus (or, > worse, a new cpupool) will find it difficult to avoid warnings from > libxl ? > > Because the toolstack will want to change both the hard and soft > affinities to new maps, perhaps entirely disjoint from the old ones, > but can only do one at a time. So the system will definitely go > through "silly" states. > > This would be solved with an API that permitted setting both sets of > affinities in a single call, even if the underlying libxc and > hypercalls are separate, because libxl would do the check only on the > overall final state. > > So perhaps we should have a singe function that can change the > cpupool, hard affinity, and soft affinity, all at once ?I think this is probably a good idea. Would it make sense to basically have libxl_[gs]et_vcpuaffinity2(), which takes a parameter that can specify that the mask is for either hard, soft, or both?> > What if the application makes a call to change the cpupool, without > touching the affinities ? Should the affinities be reset > automatically ?I think whatever happens for hard affinities right now should be carried on. -George
George Dunlap
2013-Nov-14 16:03 UTC
Re: [PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity
On 13/11/13 19:13, Dario Faggioli wrote:> vCPU soft affinity and NUMA-aware scheduling does not have > to be related. However, soft affinity is how NUMA-aware > scheduling is actually implemented, and therefore, by default, > the results of automatic NUMA placement (at VM creation time) > are also used to set the soft affinity of all the vCPUs of > the domain. > > Of course, this only happens if automatic NUMA placement is > enabled and actually takes place (for instance, if the user > does not specify any hard and soft affiniy in the xl config > file). > > This also takes care of the vice-versa, i.e., don''t trigger > automatic placement if the config file specifies either an > hard (the check for which was already there) or a soft (the > check for which is introduced by this commit) affinity.It looks like with this patch you set *both* hard and soft affinities when doing auto-numa placement. Would it make more sense to change it to setting only the soft affinity, and leaving the hard affinity to "any"? (My brain is running low, so forgive me if I''ve mis-read it...) -George> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > --- > docs/man/xl.cfg.pod.5 | 21 +++++++++++---------- > docs/misc/xl-numa-placement.markdown | 16 ++++++++++++++-- > tools/libxl/libxl_dom.c | 22 ++++++++++++++++++++-- > 3 files changed, 45 insertions(+), 14 deletions(-) > > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 > index 733c74e..d4a0a6f 100644 > --- a/docs/man/xl.cfg.pod.5 > +++ b/docs/man/xl.cfg.pod.5 > @@ -150,16 +150,6 @@ here, and the soft affinity mask, provided via B<cpus\_soft=> (if any), > is utilized to compute the domain node-affinity, for driving memory > allocations. > > -If we are on a NUMA machine (i.e., if the host has more than one NUMA > -node) and this option is not specified, libxl automatically tries to > -place the guest on the least possible number of nodes. That, however, > -will not affect vcpu pinning, so the guest will still be able to run on > -all the cpus. A heuristic approach is used for choosing the best node (or > -set of nodes), with the goals of maximizing performance for the guest > -and, at the same time, achieving efficient utilization of host cpus > -and memory. See F<docs/misc/xl-numa-placement.markdown> for more > -details. > - > =item B<cpus_soft="CPU-LIST"> > > Exactly as B<cpus=>, but specifies soft affinity, rather than pinning > @@ -178,6 +168,17 @@ the intersection of the soft affinity mask, provided here, and the vcpu > pinning, provided via B<cpus=> (if any), is utilized to compute the > domain node-affinity, for driving memory allocations. > > +If this option is not specified (and B<cpus=> is not specified either), > +libxl automatically tries to place the guest on the least possible > +number of nodes. A heuristic approach is used for choosing the best > +node (or set of nodes), with the goal of maximizing performance for > +the guest and, at the same time, achieving efficient utilization of > +host cpus and memory. In that case, the soft affinity of all the vcpus > +of the domain will be set to the pcpus belonging to the NUMA nodes > +chosen during placement. > + > +For more details, see F<docs/misc/xl-numa-placement.markdown>. > + > =back > > =head3 CPU Scheduling > diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown > index b1ed361..f644758 100644 > --- a/docs/misc/xl-numa-placement.markdown > +++ b/docs/misc/xl-numa-placement.markdown > @@ -126,10 +126,22 @@ or Xen won''t be able to guarantee the locality for their memory accesses. > That, of course, also mean the vCPUs of the domain will only be able to > execute on those same pCPUs. > > +Starting from 4.4, is is also possible to specify a "cpus\_soft=" option > +in the xl config file. This, independently from whether or not "cpus=" is > +specified too, affect the NUMA placement in a way very similar to what > +is described above. In fact, the hypervisor will build up the node-affinity > +of the VM basing right on it or, if both pinning (via "cpus=") and soft > +affinity (via "cpus\_soft=") are present, basing on their intersection. > + > +Besides that, "cpus\_soft=" also means, of course, that the vCPUs of the > +domain will prefer to execute on, among the pCPUs where they can run, > +those particular pCPUs. > + > + > ### Placing the guest automatically ### > > -If no "cpus=" option is specified in the config file, libxl tries > -to figure out on its own on which node(s) the domain could fit best. > +If neither "cpus=" nor "cpus\_soft=" are present in the config file, libxl > +tries to figure out on its own on which node(s) the domain could fit best. > If it finds one (some), the domain''s node affinity get set to there, > and both memory allocations and NUMA aware scheduling (for the credit > scheduler and starting from Xen 4.3) will comply with it. Starting from > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > index a1c16b0..d241efc 100644 > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -222,21 +222,39 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > * some weird error manifests) the subsequent call to > * libxl_domain_set_nodeaffinity() will do the actual placement, > * whatever that turns out to be. > + * > + * As far as scheduling is concerned, we achieve NUMA-aware scheduling > + * by having the results of placement affect the soft affinity of all > + * the vcpus of the domain. Of course, we want that iff placement is > + * enabled and actually happens, so we only change info->cpumap_soft to > + * reflect the placement result if that is the case > */ > if (libxl_defbool_val(info->numa_placement)) { > > - if (!libxl_bitmap_is_full(&info->cpumap)) { > + /* We require both hard and soft affinity not to be set */ > + if (!libxl_bitmap_is_full(&info->cpumap) || > + !libxl_bitmap_is_full(&info->cpumap_soft)) { > LOG(ERROR, "Can run NUMA placement only if no vcpu " > - "affinity is specified"); > + "(hard or soft) affinity is specified"); > return ERROR_INVAL; > } > > rc = numa_place_domain(gc, domid, info); > if (rc) > return rc; > + > + /* > + * We change the soft affinity in domain_build_info here, of course > + * after converting the result of placement from nodes to cpus. the > + * following call to libxl_set_vcpuaffinity_all_soft() will do the > + * actual updating of the domain''s vcpus'' soft affinity. > + */ > + libxl_nodemap_to_cpumap(ctx, &info->nodemap, &info->cpumap_soft); > } > libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); > + libxl_set_vcpuaffinity_all_soft(ctx, domid, info->max_vcpus, > + &info->cpumap_soft); > > xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); > xs_domid = xs_read(ctx->xsh, XBT_NULL, "/tool/xenstored/domid", NULL); >
Dario Faggioli
2013-Nov-14 16:11 UTC
Re: [PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity
On gio, 2013-11-14 at 15:17 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity"):> > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > > index a1c16b0..d241efc 100644 > > --- a/tools/libxl/libxl_dom.c > > +++ b/tools/libxl/libxl_dom.c > > @@ -222,21 +222,39 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > > * some weird error manifests) the subsequent call to > > * libxl_domain_set_nodeaffinity() will do the actual placement, > > * whatever that turns out to be. > > + * > > + * As far as scheduling is concerned, we achieve NUMA-aware scheduling > > + * by having the results of placement affect the soft affinity of all > > + * the vcpus of the domain. Of course, we want that iff placement is > > + * enabled and actually happens, so we only change info->cpumap_soft to > > + * reflect the placement result if that is the case > > */ > > if (libxl_defbool_val(info->numa_placement)) { > > But isn''t the default for this true ? >It is.> > - if (!libxl_bitmap_is_full(&info->cpumap)) { > > + /* We require both hard and soft affinity not to be set */ > > + if (!libxl_bitmap_is_full(&info->cpumap) || > > + !libxl_bitmap_is_full(&info->cpumap_soft)) { > > LOG(ERROR, "Can run NUMA placement only if no vcpu " > > - "affinity is specified"); > > + "(hard or soft) affinity is specified"); > > return ERROR_INVAL; > > So the result will be that any attempt to set cpumaps in an > otherwise-naive config file will cause errors, rather than just > disabling the numa placement ? >Nope, because, as it discussed already not more than a couple of days back, what xl does when finding a "cpus=" option (and from this patch on, a "cpus_soft=" option) is: 1. set numa_placement to false 2. set cpumap (or cpumap_soft) as requested. :-) So, as far as xl is concerned, this is fine. It is true that every other consumer of libxl needs to do the same (i.e., setting numa_placement to false), or it will hit the error, and that may or may not be obvious. For sure, they''ll figure out if the check out xl (which is meant to serve as a ''reference implementation'' too, right?), and, it is all documented, at least in docs/misc (I can double check whether it is also stressed enough in libxl.h somewhere, and put it there if not). For one, I probably should improve the comment above the if(), to avoid this sort of confusion to happen again. Anyway, like it or not, this is the kind of interface we came up with when designing, refining, and checking in that bit: http://lists.xen.org/archives/html/xen-devel/2012-07/msg01077.html If we don''t like it, I think there is some room for amending, since, despite libxl API being stable, this looks enough as an implementation detail to me.... It''s just a matter of agreeing on what we actually want. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-14 16:12 UTC
Re: [PATCH v2 15/16] xl: enable for specifying node-affinity in the config file
On gio, 2013-11-14 at 15:14 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v2 15/16] xl: enable for specifying node-affinity in the config file"): > > in a similar way to how it is possible to specify vcpu-affinity. > > I think this config file interface is fine, but if the libxl API is > going to change the implementaion here will have to too. > > So I''ll write: > > Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> >Thanks.> If this patch gets reworked, please take my ack off but mention that I > previously acked it :-). >Ok, will do. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-14 16:14 UTC
Re: [PATCH v2 06/16] xen: sched: make space for cpu_soft_affinity
On gio, 2013-11-14 at 15:03 +0000, George Dunlap wrote:> The breakdown of this in the series doesn''t make much sense to me -- I > would have folded this one and patch 10 (use soft affinity instead of > node affinity) together, and put it in after patch 07 > (s/affinity/hard_affinity/g;). >I see. Well, looks like I''m respinning it. While at it, I''ll see if I can put it that way. If I do (and provided that does not imply any code change in those patches), can I retain your Reviewed-by tag in the result?> But the code itself is fine, and time is short, so: > > Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> >Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-14 16:18 UTC
Re: [PATCH v2 11/16] libxc: get and set soft and hard affinity
On gio, 2013-11-14 at 14:58 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v2 11/16] libxc: get and set soft and hard affinity"):> > @@ -206,39 +209,119 @@ int xc_vcpu_setaffinity(xc_interface *xch, > > goto out; > > } > > > > - local = xc_hypercall_buffer_alloc(xch, local, cpusize); > > - if ( local == NULL ) > > + cpumap_local = xc_hypercall_buffer_alloc(xch, cpumap_local, cpusize); > > Is xc_hypercall_buffer_free idempotent, and is there a way to init a > hypercall buffer to an unallocated state ? If so this function could > be a lot simpler, and it could in particular more clearly not leak > anything, by using the "goto out" cleanup style. >When I put this together, I followed suit from similar cases (actually, this is mostly renaming the function and adding a new parameter). This to say that I don''t think it can be done much differently, but yes, I will check.> > +/** > > + * This functions specify the scheduling affinity for a vcpu. Soft > > + * affinity is on what pcpus a vcpu prefers to run. Hard affinity is > > + * on what pcpus a vcpu is allowed to run. When set independently (by > > + * the respective _soft and _hard calls) the effective affinity is > > + * also returned. What we call the effective affinity it the intersection > > + * of soft affinity, hard affinity and the set of the cpus of the cpupool > > + * the domain belongs to. It''s basically what the Xen scheduler will > > + * actually use. Returning it to the caller allows him to check if that > > + * matches with, or at least is good enough for, his purposes. > > + * > > + * A xc_vcpu_setaffinity() call is provided, mainly for backward > > + * compatibility reasons, and what it does is setting both hard and > > + * soft affinity for the vcpu. > > + * > > + * @param xch a handle to an open hypervisor interface. > > + * @param domid the id of the domain to which the vcpu belongs > > + * @param vcpu the vcpu id wihin the domain > > + * @param cpumap the (hard, soft, both) new affinity map one wants to set > > + * @param ecpumap the effective affinity for the vcpu > > Either the doc comment, or the parameter name, should make it clear > that ecpumap is an out parameter. >Ok, will do. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-14 16:21 UTC
Re: [PATCH v2 09/16] xen: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On gio, 2013-11-14 at 14:42 +0000, George Dunlap wrote:> On 13/11/13 19:12, Dario Faggioli wrote: > > @@ -617,19 +617,49 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > > if ( op->cmd == XEN_DOMCTL_setvcpuaffinity ) > > { > > cpumask_var_t new_affinity; > > + cpumask_t *online; > > > > ret = xenctl_bitmap_to_cpumask( > > &new_affinity, &op->u.vcpuaffinity.cpumap); > > - if ( !ret ) > > + if ( ret ) > > + break; > > + > > + ret = -EINVAL; > > + if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_HARD ) > > + ret = vcpu_set_hard_affinity(v, new_affinity); > > + else if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_SOFT ) > > + ret = vcpu_set_soft_affinity(v, new_affinity); > > Wait, why are we making this a bitmask of flags, if we can only set one > at a time? Shouldn''t it instead be a simple enum? >Right. I had a split mind about which one way to go between the ones you outline here and, apparently, ended up with the worst possible combination, i.e., something in the middle! :-P> Or alternately (which is what I was expecting when I saw it was > ''flags''), shouldn''t it allow you to set both at the same time? (i.e., > take away the ''else'' here?) >Indeed. I''ll take this path.> > else > > { > > - ret = cpumask_to_xenctl_bitmap( > > - &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); > > + if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_HARD ) > > + ret = cpumask_to_xenctl_bitmap( > > + &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); > > + else if ( op->u.vcpuaffinity.flags & XEN_VCPUAFFINITY_SOFT ) > > + ret = cpumask_to_xenctl_bitmap( > > + &op->u.vcpuaffinity.cpumap, v->cpu_soft_affinity); > > + else > > + ret = -EINVAL; > > Asking for both the hard and soft affinities at the same time shouldn''t > return just the hard affinity. It should either return an error, or do > something interesting like return the intersection of the two. >Right again. Given what we said above, I think I''ll go for the intersection.> Other than that, I think this looks good. >Cool, I''ll respin with these changes ASAP. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap writes ("Re: [PATCH v2 12/16] libxl: get and set soft affinity"):> On 14/11/13 15:11, Ian Jackson wrote: > > So perhaps we should have a singe function that can change the > > cpupool, hard affinity, and soft affinity, all at once ? > > I think this is probably a good idea. Would it make sense to basically > have libxl_[gs]et_vcpuaffinity2(), which takes a parameter that can > specify that the mask is for either hard, soft, or both?No, it needs to take at least two separate parameters for the hard and soft masks, because the toolstack might want to set the hard affinity and soft affinity to different values, without having to think carefully about ordering these calls to avoid generating spurious warnings and incompatible arrangements. I think this probably applies to cpupool changes too. So perhaps the function should look a bit like setresuid or something: it should take an optional new cpupool (is there a sentinel value that can be used to mean "no change"?), an optional new soft mask, and an optional new hard mask.> > What if the application makes a call to change the cpupool, without > > touching the affinities ? Should the affinities be reset > > automatically ? > > I think whatever happens for hard affinities right now should be carried on.Maybe it is a bug that it doesn''t do anything. I think it depends how we expect people to use this. If a caller sets the hard affinities and then changes the cpupool, are they supposed to always then set the hard affinities again to a new suitable value ? Ian.
Dario Faggioli
2013-Nov-14 16:30 UTC
Re: [PATCH v2 08/16] xen: derive NUMA node affinity from hard and soft CPU affinity
On gio, 2013-11-14 at 15:21 +0000, George Dunlap wrote:> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > diff --git a/xen/common/domain.c b/xen/common/domain.c > > @@ -373,8 +379,12 @@ void domain_update_node_affinity(struct domain *d) > > > > for_each_vcpu ( d, v ) > > { > > + /* Build up the mask of online pcpus we have hard affinity with */ > > cpumask_and(online_affinity, v->cpu_hard_affinity, online); > > cpumask_or(cpumask, cpumask, online_affinity); > > + > > + /* As well as the mask of all pcpus we have soft affinity with */ > > + cpumask_or(cpumask_soft, cpumask_soft, v->cpu_soft_affinity); > > Is this really the most efficient way to do this? I would have thought > or''ing cpumask and cpumask_soft, and then if it''s not empty, then use > it; maybe use a pointer so you don''t have to copy one into the other one? >I''m not sure I fully get what you mean... I cannot afford neglecting online_affinity, independently from how cpumask and cpumask_soft look like, because that''s what''s needed to account for cpupools. Anyway, I''ll think more about it and see if I can make it better.> Also, all of the above computation happens unconditionally, but is only > used if d->auto_node_affinity is true. It seems like it would be better > to move this calculation inside the conditional. >That is true, and I thought it too. I didn''t do this here because it seems an unrelated change (the calculation of cpumask|=hard_affinity&online was outside of the if already). Anyway, again, I agree with this, so I''ll either do it here or in a pre-patch.> Leaving the allocation outside might make sense to reduce the code > covered by the lock -- not sure about that; I don''t think this is a > heavily-contended lock, is it? >Not really, nor this is an hot-path, but I still think it''s worth doing what you suggest. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-14 16:41 UTC
Re: [PATCH v2 11/16] libxc: get and set soft and hard affinity
On gio, 2013-11-14 at 15:38 +0000, George Dunlap wrote:> On 13/11/13 19:12, Dario Faggioli wrote: > > by using the new flag introduced in the parameters of > > DOMCTL_{get,set}_vcpuaffinity. > > > > This happens in two new xc calls: xc_vcpu_setaffinity_hard() > > and xc_vcpu_setaffinity_soft() (an in the corresponding > > getters, of course). > > Personally I think: > * You should be able to set both HARD and SOFT flags, in which case it > will set both hard and soft affinities >Ok. Just to be sure, you mean that, if I specify both flags, both hard and soft affinity will be set to the same cpumap, right?> * You should expose the "flags" to the xc caller, so that the caller can > set either one or both. >Ok.> > The existing xc_vcpu_{set,get}affinity() call is also retained, > > with the following behavior: > > > > * xc_vcpu_setaffinity() sets both the hard and soft affinity; > > * xc_vcpu_getaffinity() gets the hard affinity. > > > > This is mainly for backward compatibility reasons, i.e., trying > > not to break existing callers/users. > > The xc interface isn''t stable, right? Couldn''t we just change the > callers (presumably just xend and libxl)? >In tree, yes, those ones plus something in tools/ocaml/libs/xc/xenctrl_stubs.c. I''ll do it that way. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-14 16:48 UTC
Re: [PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity
On gio, 2013-11-14 at 16:03 +0000, George Dunlap wrote:> On 13/11/13 19:13, Dario Faggioli wrote: > > vCPU soft affinity and NUMA-aware scheduling does not have > > to be related. However, soft affinity is how NUMA-aware > > scheduling is actually implemented, and therefore, by default, > > the results of automatic NUMA placement (at VM creation time) > > are also used to set the soft affinity of all the vCPUs of > > the domain. > > > > Of course, this only happens if automatic NUMA placement is > > enabled and actually takes place (for instance, if the user > > does not specify any hard and soft affiniy in the xl config > > file). > > > > This also takes care of the vice-versa, i.e., don''t trigger > > automatic placement if the config file specifies either an > > hard (the check for which was already there) or a soft (the > > check for which is introduced by this commit) affinity. > > It looks like with this patch you set *both* hard and soft affinities > when doing auto-numa placement. Would it make more sense to change it > to setting only the soft affinity, and leaving the hard affinity to "any"? >Nope, it indeed sets only soft affinity after automatic placement, hard affinity is left untouched.> (My brain is running low, so forgive me if I''ve mis-read it...) >:-) This is the spot:> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>> > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > > @@ -222,21 +222,39 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, > > * some weird error manifests) the subsequent call to > > * libxl_domain_set_nodeaffinity() will do the actual placement, > > * whatever that turns out to be. > > + * > > + * As far as scheduling is concerned, we achieve NUMA-aware scheduling > > + * by having the results of placement affect the soft affinity of all > > + * the vcpus of the domain. Of course, we want that iff placement is > > + * enabled and actually happens, so we only change info->cpumap_soft to > > + * reflect the placement result if that is the case > > */ > > if (libxl_defbool_val(info->numa_placement)) { > > > > - if (!libxl_bitmap_is_full(&info->cpumap)) { > > + /* We require both hard and soft affinity not to be set */ > > + if (!libxl_bitmap_is_full(&info->cpumap) || > > + !libxl_bitmap_is_full(&info->cpumap_soft)) { > > LOG(ERROR, "Can run NUMA placement only if no vcpu " > > - "affinity is specified"); > > + "(hard or soft) affinity is specified"); > > return ERROR_INVAL; > > } > > > > rc = numa_place_domain(gc, domid, info); > > if (rc) > > return rc; > > + > > + /* > > + * We change the soft affinity in domain_build_info here, of course > > + * after converting the result of placement from nodes to cpus. the > > + * following call to libxl_set_vcpuaffinity_all_soft() will do the > > + * actual updating of the domain''s vcpus'' soft affinity. > > + */ > > + libxl_nodemap_to_cpumap(ctx, &info->nodemap, &info->cpumap_soft); > ^| Here: -----------------------------------------------------------/ I only copy the result of placement into info->cpumap_soft, without touching info->cpumap, which is "all" (or we won''t be at this point) and stays that way.> > } > > libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); > > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); > > + libxl_set_vcpuaffinity_all_soft(ctx, domid, info->max_vcpus, > > + &info->cpumap_soft); > >Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-14 17:49 UTC
Re: [PATCH v2 16/16] libxl: automatic NUMA placement affects soft affinity
On 11/14/2013 04:48 PM, Dario Faggioli wrote:> On gio, 2013-11-14 at 16:03 +0000, George Dunlap wrote: >> On 13/11/13 19:13, Dario Faggioli wrote: >>> vCPU soft affinity and NUMA-aware scheduling does not have >>> to be related. However, soft affinity is how NUMA-aware >>> scheduling is actually implemented, and therefore, by default, >>> the results of automatic NUMA placement (at VM creation time) >>> are also used to set the soft affinity of all the vCPUs of >>> the domain. >>> >>> Of course, this only happens if automatic NUMA placement is >>> enabled and actually takes place (for instance, if the user >>> does not specify any hard and soft affiniy in the xl config >>> file). >>> >>> This also takes care of the vice-versa, i.e., don''t trigger >>> automatic placement if the config file specifies either an >>> hard (the check for which was already there) or a soft (the >>> check for which is introduced by this commit) affinity. >> >> It looks like with this patch you set *both* hard and soft affinities >> when doing auto-numa placement. Would it make more sense to change it >> to setting only the soft affinity, and leaving the hard affinity to "any"? >> > Nope, it indeed sets only soft affinity after automatic placement, hard > affinity is left untouched.Oh, right -- dur, I had it in my head that you were setting hard affinities before this patch (and since I didn''t see them removed I assumed they were still there.) But that''s what we did sometime in ancient history, before the scheduler paid attention to the domain numa affinity. :-) That being the case, Acked-by: George Dunlap <george.dunlap@eu.citrix.com> (If I have time I may come by and do a closer review later.) -George
Dario Faggioli
2013-Nov-15 00:39 UTC
Re: [PATCH v2 10/16] xen: sched: use soft-affinity instead of domain''s node-affinity
On gio, 2013-11-14 at 15:30 +0000, George Dunlap wrote:> On 13/11/13 19:12, Dario Faggioli wrote: > > [..] > > The high level description of NUMA placement and scheduling in > > docs/misc/xl-numa-placement.markdown is being updated too, to match > > the new architecture. > > > > signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> >Cool, thanks.> Just a few things to note below... >Ok.> > diff --git a/xen/common/domain.c b/xen/common/domain.c > > @@ -411,8 +411,6 @@ void domain_update_node_affinity(struct domain *d) > > node_set(node, d->node_affinity); > > } > > > > - sched_set_node_affinity(d, &d->node_affinity); > > - > > spin_unlock(&d->node_affinity_lock); > > At this point, the only thing inside the spinlock is contingent on > d->auto_node_affinity. >Mmm... Sorry, but I''m not geting what you mean here. :-(> > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > > -static inline int __vcpu_has_node_affinity(const struct vcpu *vc, > > +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc, > > const cpumask_t *mask) > > { > > - const struct domain *d = vc->domain; > > - const struct csched_dom *sdom = CSCHED_DOM(d); > > - > > - if ( d->auto_node_affinity > > - || cpumask_full(sdom->node_affinity_cpumask) > > - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) > > + if ( cpumask_full(vc->cpu_soft_affinity) > > + || !cpumask_intersects(vc->cpu_soft_affinity, mask) ) > > return 0; > > At this point we''ve lost a way to make this check potentially much > faster (being able to check auto_node_affinity). >Right.> This isn''t a super-hot > path but it does happen fairly frequently -- >Quite frequently indeed.> will the "cpumask_full()" > check take a significant amount of time on, say, a 4096-core system? If > so, we might think about "caching" the results of cpumask_full() at some > point. >Yes, I think cpumask_* operation could be heavy when the number of pcpus is high. However, this is not really a problem introduced by this series. Consider that the default behavior (for libxl and xl) is to go through initial domain placement, which would set a node-affinity for the domain explicitly, which means d->auto_node_affinity is false. In fact, every domain that does not manually pin its vcpus at creation time --which is what we want, because that way NUMA placement can do its magic-- will have to go through the (cpumask_full || !cpumask_intrscts) anyway. Basically, I''m saying that having d->auto_node_affinity there may look like a speedup, but it really is only for a minority of cases. So, yes, I think we should aim at optimizing this, but that is something completely orthogonal to this series. That is to say: (a) we should do it anyway, whether or not this series goes in; (b) for that same reason, that shouldn''t prevent this series from going in. If you think this can be an issue for 4.4, I''m fine creating a bug for it and putting it among the blockers. At that point, I''ll start looking for a solution, and will commit to post a fix ASAP, but again, that''s pretty independent from this very series, at least AFAICT. Then, the fact that you provided your Reviewed-by above probably means that you are aware and ok with this all, but I felt like it was worth pointing it out anyway. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-15 03:45 UTC
Re: [PATCH v2 12/16] libxl: get and set soft affinity
On gio, 2013-11-14 at 15:11 +0000, Ian Jackson wrote:> > + if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap_soft, 0)) { > > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap_soft"); > > + return NULL; > > + } > > I went and looked at the error handling. Normally in libxl allocators > can''t fail but this one wants to do a hypercall to look up the max > cpus. >Yep.> But, pulling on the thread, I wonder if it would be better to do this > error logging in libxl_cpu_bitmap_alloc rather than all the call > sites. >Indeed. And the main problem I see here is that the logging mostly says (and mine above is no exception, shame on me! :-P) "failed allocating ...", which is not true. If something failed, it was the retrieval of max cpus/nodes!> (If we did that in principle we should introduce a new #define > which allows applications to #ifdef out their own logging for this, > but in practice the double-logging isn''t likely to be a problem.) >Do we? Right now we don''t have anything like that... Looks to me that logging inside rather than outside the function is going to result in the same amount of logging we have already, with the notable difference that the "new" logging will be much more accurate.> And looking at libxl_cpu_bitmap_alloc, it calls libxl_get_max_cpus and > does something very odd with the return value: libxl_get_max_nodes > ought to return a positive number or a libxl error code. > > So I went to look at libxl_get_max_nodes and it just returns whatever > it got from libxc, which is definitely wrong! >I think I see what you mean. It''s a weird path, since those xc calls (xc_get_max_cpus() and xc_get_max_nodes()) either return a positive number (if successful) or zero (if failing). That''s why, I think, libxl ended up checking for zero, as an indication of failure, rather than "< 0" / libxl error code.> Would you mind fixing this as part of this series ? >I don''t, but I''m not sure I''m 100% getting how you think this should be fixed. What about the following (compile tested only) patch? Basically, I tried to modify libxl_get_max_{cpus,nodes} as you suggeste above (making them return either a positive number or a libxl error code). At the same time, I made them a bit more "robust" against what could come from xc_get_max_{cpus,nodes}() (by checking for <= 0). I also added the proper logging --stating correctly what is it that is failing-- inside libxl_get_max_{cpus,nodes}() and removed it from the callsites that had it (wrong). If you''re fine with that, I''ve got no problem in folding it in the series. Thanks and Regards, Dario --- diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 0de1112..d3ab65e 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -616,10 +616,8 @@ static int cpupool_info(libxl__gc *gc, info->n_dom = xcinfo->n_dom; rc = libxl_cpu_bitmap_alloc(CTX, &info->cpumap, 0); if (rc) - { - LOG(ERROR, "unable to allocate cpumap %d\n", rc); goto out; - } + memcpy(info->cpumap.map, xcinfo->cpumap, info->cpumap.size); rc = 0; @@ -4204,10 +4202,8 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, } for (*nb_vcpu = 0; *nb_vcpu <= domaininfo.max_vcpu_id; ++*nb_vcpu, ++ptr) { - if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap, 0)) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap"); + if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap, 0)) return NULL; - } if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu info"); return NULL; diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c index 682f874..28e5b91 100644 --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -645,6 +645,47 @@ char *libxl_bitmap_to_hex_string(libxl_ctx *ctx, const libxl_bitmap *bitmap) return q; } +inline int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, + libxl_bitmap *cpumap, + int max_cpus) +{ + if (max_cpus < 0) + return ERROR_INVAL; + + if (max_cpus == 0) + max_cpus = libxl_get_max_cpus(ctx); + + if (max_cpus <= 0) { + LIBXL__LOG(ctx, LIBXL__LOG_ERROR, + "failed to retrieve the maximum number of cpus"); + return ERROR_FAIL; + } + + + /* This can''t fail: no need to check and log */ + return libxl_bitmap_alloc(ctx, cpumap, max_cpus); +} + +int libxl_node_bitmap_alloc(libxl_ctx *ctx, + libxl_bitmap *nodemap, + int max_nodes) +{ + if (max_nodes < 0) + return ERROR_INVAL; + + if (max_nodes == 0) + max_nodes = libxl_get_max_nodes(ctx); + + if (max_nodes <= 0) { + LIBXL__LOG(ctx, LIBXL__LOG_ERROR, + "failed to retrieve the maximum number of nodes"); + return ERROR_FAIL; + } + + /* This can''t fail: no need to check and log */ + return libxl_bitmap_alloc(ctx, nodemap, max_nodes); +} + int libxl_nodemap_to_cpumap(libxl_ctx *ctx, const libxl_bitmap *nodemap, libxl_bitmap *cpumap) @@ -713,12 +754,16 @@ int libxl_cpumap_to_nodemap(libxl_ctx *ctx, int libxl_get_max_cpus(libxl_ctx *ctx) { - return xc_get_max_cpus(ctx->xch); + int max_cpus = xc_get_max_cpus(ctx->xch); + + return max_cpus <= 0 ? ERROR_FAIL : max_cpus; } int libxl_get_max_nodes(libxl_ctx *ctx) { - return xc_get_max_nodes(ctx->xch); + int max_nodes = xc_get_max_nodes(ctx->xch); + + return max_nodes <= 0 ? ERROR_FAIL : max_nodes; } int libxl__enum_from_string(const libxl_enum_string_table *t, diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h index 7b84e6a..b11cf28 100644 --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -98,32 +98,12 @@ static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit) #define libxl_for_each_set_bit(v, m) for (v = 0; v < (m).size * 8; v++) \ if (libxl_bitmap_test(&(m), v)) -static inline int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap, - int max_cpus) -{ - if (max_cpus < 0) - return ERROR_INVAL; - if (max_cpus == 0) - max_cpus = libxl_get_max_cpus(ctx); - if (max_cpus == 0) - return ERROR_FAIL; - - return libxl_bitmap_alloc(ctx, cpumap, max_cpus); -} - -static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx, - libxl_bitmap *nodemap, - int max_nodes) -{ - if (max_nodes < 0) - return ERROR_INVAL; - if (max_nodes == 0) - max_nodes = libxl_get_max_nodes(ctx); - if (max_nodes == 0) - return ERROR_FAIL; - - return libxl_bitmap_alloc(ctx, nodemap, max_nodes); -} +int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, + libxl_bitmap *cpumap, + int max_cpus); +int libxl_node_bitmap_alloc(libxl_ctx *ctx, + libxl_bitmap *nodemap, + int max_nodes); /* Populate cpumap with the cpus spanned by the nodes in nodemap */ int libxl_nodemap_to_cpumap(libxl_ctx *ctx, -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-15 05:13 UTC
Re: [PATCH v2 12/16] libxl: get and set soft affinity
On gio, 2013-11-14 at 16:25 +0000, Ian Jackson wrote:> George Dunlap writes ("Re: [PATCH v2 12/16] libxl: get and set soft affinity"): > > On 14/11/13 15:11, Ian Jackson wrote: > > > So perhaps we should have a singe function that can change the > > > cpupool, hard affinity, and soft affinity, all at once ? > > > > I think this is probably a good idea. Would it make sense to basically > > have libxl_[gs]et_vcpuaffinity2(), which takes a parameter that can > > specify that the mask is for either hard, soft, or both? > > No, it needs to take at least two separate parameters for the hard and > soft masks, because the toolstack might want to set the hard affinity > and soft affinity to different values, without having to think > carefully about ordering these calls to avoid generating spurious > warnings and incompatible arrangements. >Ok, what about if I provide both? Something like this: int libxl_set_vcpuaffinity2(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, const libxl_bitmap *cpumap, int flags) { flags & HARD <set_hard_affinity>(domid, vcpuid, cpumap) flags & SOFT <set_soft_affinity>(domid, vcpuid, cpumap) <check_ecpumap_out_and_warn> } int libxl_set_vcpuaffinity3(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, const libxl_bitmap *cpumap_hard, const libxl_bitmap *cpumap_soft) { <set_hard_affinity>(domid, vcpuid, cpumap_hard) <set_soft_affinity>(domid, vcpuid, cpumap_soft) <check_ecpumap_out_and_warn> } This ought to allow both the use case: 1. when one is only want to change one (or both to the same value), and is ok with being WARN''ed if that result in an inconsistent state 2. when one wants to land directly in a specific state, which requires changing both, each one to its own value, and only want to be WARN''ed _iff_ that final state is inconsistent I''ve already an half backed updated series doing right this, but I''ll be working far from my testbox in the morning, so I''ll complete the testing and send it in the afternoon, ok?> I think this probably applies to cpupool changes too. So perhaps the > function should look a bit like setresuid or something: it should take > an optional new cpupool (is there a sentinel value that can be used to > mean "no change"?), an optional new soft mask, and an optional new > hard mask. >I see what you mean, but cpupools are really a different thing, with its own interface, etc. I''d rather try to implement what I described above and being consistent with the current cpupool behavior, copying what it happens already for hard affinity to how soft affinity is handled too (as George was saying), at least for this series.> > > What if the application makes a call to change the cpupool, without > > > touching the affinities ? Should the affinities be reset > > > automatically ? > > > > I think whatever happens for hard affinities right now should be carried on. > > Maybe it is a bug that it doesn''t do anything. >Maybe, but that just mean that we can (should) fix that even outside from this series, right?> I think it depends how > we expect people to use this. >Indeed. :-D> If a caller sets the hard affinities > and then changes the cpupool, are they supposed to always then set the > hard affinities again to a new suitable value ? >Yes and no. :-) I''ll provide some examples of what happens to try clarifying the situation as soon as I get the chance. Thanks for the review and the nice discussion. :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Nov-15 10:07 UTC
Re: [PATCH v2 06/16] xen: sched: make space for cpu_soft_affinity
On 14/11/13 16:14, Dario Faggioli wrote:> On gio, 2013-11-14 at 15:03 +0000, George Dunlap wrote: >> The breakdown of this in the series doesn''t make much sense to me -- I >> would have folded this one and patch 10 (use soft affinity instead of >> node affinity) together, and put it in after patch 07 >> (s/affinity/hard_affinity/g;). >> > I see. Well, looks like I''m respinning it. While at it, I''ll see if I > can put it that way. > > If I do (and provided that does not imply any code change in those > patches), can I retain your Reviewed-by tag in the result?Yes, if you do a plain merge of this with patch 10 and don''t make any changes, go ahead and keep it. -George
George Dunlap
2013-Nov-15 10:52 UTC
Re: [PATCH v2 08/16] xen: derive NUMA node affinity from hard and soft CPU affinity
On 14/11/13 16:30, Dario Faggioli wrote:> On gio, 2013-11-14 at 15:21 +0000, George Dunlap wrote: >>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >>> diff --git a/xen/common/domain.c b/xen/common/domain.c >>> @@ -373,8 +379,12 @@ void domain_update_node_affinity(struct domain *d) >>> >>> for_each_vcpu ( d, v ) >>> { >>> + /* Build up the mask of online pcpus we have hard affinity with */ >>> cpumask_and(online_affinity, v->cpu_hard_affinity, online); >>> cpumask_or(cpumask, cpumask, online_affinity); >>> + >>> + /* As well as the mask of all pcpus we have soft affinity with */ >>> + cpumask_or(cpumask_soft, cpumask_soft, v->cpu_soft_affinity); >> Is this really the most efficient way to do this? I would have thought >> or''ing cpumask and cpumask_soft, and then if it''s not empty, then use >> it; maybe use a pointer so you don''t have to copy one into the other one? >> > I''m not sure I fully get what you mean... I cannot afford neglecting > online_affinity, independently from how cpumask and cpumask_soft look > like, because that''s what''s needed to account for cpupools. Anyway, I''ll > think more about it and see if I can make it better.So what you have here is (in pseudocode): online_affinity = hard_affinity & online; cpumask |= online_affinity; cpumask_soft |= soft_affinity; if ( intersects(cpumask, cpumask_soft) ) cpumask &= cpumask_soft; So at least four full bitwise operations, plus "intersects" which will be the equivalent of a full bitwise operation if it turns out to be false. How about something like the following: cpumask_hard = hard_affinity & online; cpumask_soft = cpumask_hard & soft_affinity; cpumask_p = is_empty(cpumask_soft) ? &cpumask_hard : &cpumask_soft; So only two full bitwise operations, plus "is_empty", which I think should be faster than "intersects" even if it turns out to be true, since we can (I think) check full words at a time. Then use cpumask_p in the subsequent node_affinity update. (Or just do a full copy, since doing the nodewise check for intersection is going to be fairly slow anyway.) -George
George Dunlap
2013-Nov-15 11:23 UTC
Re: [PATCH v2 10/16] xen: sched: use soft-affinity instead of domain''s node-affinity
On 15/11/13 00:39, Dario Faggioli wrote:> On gio, 2013-11-14 at 15:30 +0000, George Dunlap wrote: >> On 13/11/13 19:12, Dario Faggioli wrote: >>> [..] >>> The high level description of NUMA placement and scheduling in >>> docs/misc/xl-numa-placement.markdown is being updated too, to match >>> the new architecture. >>> >>> signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> >> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> >> > Cool, thanks. > >> Just a few things to note below... >> > Ok. > >>> diff --git a/xen/common/domain.c b/xen/common/domain.c >>> @@ -411,8 +411,6 @@ void domain_update_node_affinity(struct domain *d) >>> node_set(node, d->node_affinity); >>> } >>> >>> - sched_set_node_affinity(d, &d->node_affinity); >>> - >>> spin_unlock(&d->node_affinity_lock); >> At this point, the only thing inside the spinlock is contingent on >> d->auto_node_affinity. >> > Mmm... Sorry, but I''m not geting what you mean here. :-(I mean just what I said -- if d->auto_node_affinity is false, nothing inside the critical region here needs to be done. I''m just pointing it out. :-) (This is sort of related to my comment on the other patch, about not needing to do the work of calculating intersections.)> >>> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c >>> -static inline int __vcpu_has_node_affinity(const struct vcpu *vc, >>> +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc, >>> const cpumask_t *mask) >>> { >>> - const struct domain *d = vc->domain; >>> - const struct csched_dom *sdom = CSCHED_DOM(d); >>> - >>> - if ( d->auto_node_affinity >>> - || cpumask_full(sdom->node_affinity_cpumask) >>> - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) >>> + if ( cpumask_full(vc->cpu_soft_affinity) >>> + || !cpumask_intersects(vc->cpu_soft_affinity, mask) ) >>> return 0; >> At this point we''ve lost a way to make this check potentially much >> faster (being able to check auto_node_affinity). >> > Right. > >> This isn''t a super-hot >> path but it does happen fairly frequently -- >> > Quite frequently indeed. > >> will the "cpumask_full()" >> check take a significant amount of time on, say, a 4096-core system? If >> so, we might think about "caching" the results of cpumask_full() at some >> point. >> > Yes, I think cpumask_* operation could be heavy when the number of pcpus > is high. However, this is not really a problem introduced by this > series. Consider that the default behavior (for libxl and xl) is to go > through initial domain placement, which would set a node-affinity for > the domain explicitly, which means d->auto_node_affinity is false. > > In fact, every domain that does not manually pin its vcpus at creation > time --which is what we want, because that way NUMA placement can do its > magic-- will have to go through the (cpumask_full || !cpumask_intrscts) > anyway. Basically, I''m saying that having d->auto_node_affinity there > may look like a speedup, but it really is only for a minority of cases. > > So, yes, I think we should aim at optimizing this, but that is something > completely orthogonal to this series. That is to say: (a) we should do > it anyway, whether or not this series goes in; (b) for that same reason, > that shouldn''t prevent this series from going in. > > If you think this can be an issue for 4.4, I''m fine creating a bug for > it and putting it among the blockers. At that point, I''ll start looking > for a solution, and will commit to post a fix ASAP, but again, that''s > pretty independent from this very series, at least AFAICT. > > Then, the fact that you provided your Reviewed-by above probably means > that you are aware and ok with this all, but I felt like it was worth > pointing it out anyway. :-)Yes, the "at some point" was intended to imply that I didn''t think this had to be done right away, as was "things to note", which means, "I just want to point this out, they''re not something which needs to be acted on right away." -George
On 14/11/13 16:25, Ian Jackson wrote:> George Dunlap writes ("Re: [PATCH v2 12/16] libxl: get and set soft affinity"): >> On 14/11/13 15:11, Ian Jackson wrote: >>> So perhaps we should have a singe function that can change the >>> cpupool, hard affinity, and soft affinity, all at once ? >> I think this is probably a good idea. Would it make sense to basically >> have libxl_[gs]et_vcpuaffinity2(), which takes a parameter that can >> specify that the mask is for either hard, soft, or both? > No, it needs to take at least two separate parameters for the hard and > soft masks, because the toolstack might want to set the hard affinity > and soft affinity to different values, without having to think > carefully about ordering these calls to avoid generating spurious > warnings and incompatible arrangements. > > I think this probably applies to cpupool changes too. So perhaps the > function should look a bit like setresuid or something: it should take > an optional new cpupool (is there a sentinel value that can be used to > mean "no change"?), an optional new soft mask, and an optional new > hard mask. > >>> What if the application makes a call to change the cpupool, without >>> touching the affinities ? Should the affinities be reset >>> automatically ? >> I think whatever happens for hard affinities right now should be carried on. > Maybe it is a bug that it doesn''t do anything. I think it depends how > we expect people to use this. If a caller sets the hard affinities > and then changes the cpupool, are they supposed to always then set the > hard affinities again to a new suitable value ?Well in fact, as far as I can tell, it *does* do something. When moving a vcpu to a new pool, it unconditionally calls cpumask_setall(v->cpu_affinity) for each vcpu, which will effectively erase the hard affinity. (xen/common/schedule.c:sched_move_domain()). And, when unplugging cpus, if it unplugs the last cpu a vcpu can run on, it also resets the affinity to "all". -George
Dario Faggioli
2013-Nov-15 14:17 UTC
Re: [PATCH v2 08/16] xen: derive NUMA node affinity from hard and soft CPU affinity
On ven, 2013-11-15 at 10:52 +0000, George Dunlap wrote:> On 14/11/13 16:30, Dario Faggioli wrote: > > > I''m not sure I fully get what you mean... I cannot afford neglecting > > online_affinity, independently from how cpumask and cpumask_soft look > > like, because that''s what''s needed to account for cpupools. Anyway, I''ll > > think more about it and see if I can make it better. > > So what you have here is (in pseudocode): > > online_affinity = hard_affinity & online; > cpumask |= online_affinity; > cpumask_soft |= soft_affinity; > > if ( intersects(cpumask, cpumask_soft) ) > cpumask &= cpumask_soft; > > So at least four full bitwise operations, plus "intersects" which will > be the equivalent of a full bitwise operation if it turns out to be false. >Well, that''s not exactly what I have. Fact is, _every_ vcpu has a soft_affinity and a hard_affinity. That''s why I have the cpumask_or()-s in the loop, to build a hard haffinity and a soft affinity mask of the _domain_, by ||-ing all the hard (soft) affinity of all the vcpus. From the pseudo code above (and even from the one below), it''s not clear to me whether you mean vcpu hard (and soft) affinity or domain ones. In fact...> How about something like the following: > > cpumask_hard = hard_affinity & online; > cpumask_soft = cpumask_hard & soft_affinity; > > cpumask_p = is_empty(cpumask_soft) ? &cpumask_hard : &cpumask_soft; >... From reading this last pseudo-statement I think you actually mean the affinity masks for the whole domain, which is exactly what I also want to consider... But I have to construct them somehow... That''s, again, form where the cpumask_or()-s come. So, thanking this per-vcpu-ness into account, what I have is: for_each_vcpu(i) dom_cpumask_hard |= online & cpumask_hard(i) dom_cpumask_soft |= cpumask_soft(i) if ( dom_cpumask_hard & dom_cpumask_soft ) dom_cpumask = dom_cpumask_hard & dom_cpumask_soft else <use dom_cpumask_hard> Perhaps I can rename the variable with something like this ''dom_'' prefix in the real code too, to make things more clear. However, it is indeed true that online is per domain, so yes, perhaps I can ddo something like the following: for_each_vcpu(i) dom_cpumask_hard |= cpumask_hard(i) dom_cpumask_soft |= cpumask_soft(i) dom_cpumask_hard &= online dom_cpumask_soft &= dom_cpumask_hard dom_cpumask_p = is_empty(dom_cpumask_soft) ? dom_cpumask_hard : dom_cpumask_soft; I think it''s both more efficient and more clear than what I have now, and I give you that it''s close enough to what you were suggesting... But I don''t think I can ditch the ''|='' in the loop for building up the two domain wide cpumasks. How do you like it? Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Nov-15 17:29 UTC
Re: [PATCH v2 12/16] libxl: get and set soft affinity
On ven, 2013-11-15 at 12:02 +0000, George Dunlap wrote:> On 14/11/13 16:25, Ian Jackson wrote: > > Maybe it is a bug that it doesn''t do anything. I think it depends how > > we expect people to use this. If a caller sets the hard affinities > > and then changes the cpupool, are they supposed to always then set the > > hard affinities again to a new suitable value ? > > Well in fact, as far as I can tell, it *does* do something. When moving > a vcpu to a new pool, it unconditionally calls > cpumask_setall(v->cpu_affinity) for each vcpu, which will effectively > erase the hard affinity. (xen/common/schedule.c:sched_move_domain()). > > And, when unplugging cpus, if it unplugs the last cpu a vcpu can run on, > it also resets the affinity to "all". >Right. But this all happens in hypervisor level and, personally, I think it''s just fine. The point here is how we should behave and what kind of interface we should have/add at the libxl level. My opinion here is that, while it is ok to have calls that deals with both hard and soft affinity together, we should leave cpupool alone, as it is too different of both a concept and of an interface. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel