Dario Faggioli
2013-Dec-02 18:27 UTC
[PATCH v5 00/17] Implement vcpu soft affinity for credit1
Take 5. Main changes from last round are: - changed the Xen and libxc interface, as agreed with Jan during v4 review. Now it is possible to specify and get back both the (effective) hard and soft affinity with just one hypercall. - applied all IanC''s comments on the tool patches. Release whise, George expressed himself in favour of a freeze exception here: http://bugs.xenproject.org/xen/mid/<528B8B25.6000608@eu.citrix.com> The patch series is available in the following git branch: git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v5 Thanks and Regards, Dario --- Dario Faggioli (17): a xl: match output of vcpu-list with pinning syntax n libxl: better name for last parameter of libxl_list_vcpu n libxl: fix memory leak in libxl_list_vcpu a libxc/libxl: sanitize error handling in *_get_max_{cpus,nodes} libxc/libxl: allow to retrieve the number of online pCPUs ra xl: allow for node-wise specification of vcpu pinning a xl: implement and enable dryrun mode for `xl vcpu-pin'' a xl: test script for the cpumap parser (for vCPU pinning) r xen: sched: rename v->cpu_affinity into v->cpu_hard_affinity r xen: sched: introduce soft-affinity and use it instead d->node-affinity r xen: derive NUMA node affinity from hard and soft CPU affinity r xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity a libxc: get and set soft and hard affinity r libxl: get and set soft affinity xl: enable getting and setting soft * xl: enable for specifying node-affinity in the config file a libxl: automatic NUMA placement affects soft affinity n = new in v4 r = has been ''Reviewed-by'' a = has been ''Acked-by'' * = has been ''Acked-by'' but the implementation changed a bit (IanJ explicitly said to point this out) docs/man/xl.cfg.pod.5 | 62 ++- docs/man/xl.pod.1 | 20 + docs/misc/xl-numa-placement.markdown | 162 +++++-- tools/libxc/xc_domain.c | 76 ++-- tools/libxc/xc_misc.c | 32 + tools/libxc/xenctrl.h | 58 +++ tools/libxl/Makefile | 2 tools/libxl/check-xl-vcpupin-parse | 294 ++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 ++ tools/libxl/libxl.c | 124 ++++-- tools/libxl/libxl.h | 44 ++ tools/libxl/libxl_create.c | 6 tools/libxl/libxl_dom.c | 23 + tools/libxl/libxl_types.idl | 4 tools/libxl/libxl_utils.c | 67 +++ tools/libxl/libxl_utils.h | 44 +- tools/libxl/xl_cmdimpl.c | 497 +++++++++++++++-------- tools/libxl/xl_cmdtable.c | 5 tools/ocaml/libs/xc/xenctrl_stubs.c | 8 tools/python/xen/lowlevel/xc/xc.c | 12 - xen/arch/x86/traps.c | 13 - xen/common/domain.c | 86 ++-- xen/common/domctl.c | 111 +++++ xen/common/keyhandler.c | 4 xen/common/sched_credit.c | 161 +++---- xen/common/sched_sedf.c | 2 xen/common/schedule.c | 57 ++- xen/common/wait.c | 10 xen/include/public/domctl.h | 17 + xen/include/xen/sched.h | 14 - 30 files changed, 1541 insertions(+), 527 deletions(-) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2013-Dec-02 18:27 UTC
[PATCH v5 01/17] xl: match output of vcpu-list with pinning syntax
in fact, pinning to all the pcpus happens by specifying "all" (either on the command line or in the config file), while `xl vcpu-list'' report it as "any cpu". Change this into something more consistent, by using "all" everywhere. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Changes since v1: * this patch was not there in v1. It is now as using the same syntax for both input and output was requested during review. --- tools/libxl/xl_cmdimpl.c | 27 +++++++-------------------- 1 file changed, 7 insertions(+), 20 deletions(-) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index c5677cd..7fc049f 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -3117,8 +3117,7 @@ out: } } -/* If map is not full, prints it and returns 0. Returns 1 otherwise. */ -static int print_bitmap(uint8_t *map, int maplen, FILE *stream) +static void print_bitmap(uint8_t *map, int maplen, FILE *stream) { int i; uint8_t pmap = 0, bitmask = 0; @@ -3156,28 +3155,16 @@ static int print_bitmap(uint8_t *map, int maplen, FILE *stream) case 2: break; case 1: - if (firstset == 0) - return 1; + if (firstset == 0) { + fprintf(stream, "all"); + break; + } case 3: fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); if (i - 1 > firstset) fprintf(stream, "-%d", i - 1); break; } - - return 0; -} - -static void print_cpumap(uint8_t *map, int maplen, FILE *stream) -{ - if (print_bitmap(map, maplen, stream)) - fprintf(stream, "any cpu"); -} - -static void print_nodemap(uint8_t *map, int maplen, FILE *stream) -{ - if (print_bitmap(map, maplen, stream)) - fprintf(stream, "any node"); } static void list_domains(int verbose, int context, int claim, int numa, @@ -3250,7 +3237,7 @@ static void list_domains(int verbose, int context, int claim, int numa, libxl_domain_get_nodeaffinity(ctx, info[i].domid, &nodemap); putchar('' ''); - print_nodemap(nodemap.map, physinfo.nr_nodes, stdout); + print_bitmap(nodemap.map, physinfo.nr_nodes, stdout); } putchar(''\n''); } @@ -4462,7 +4449,7 @@ static void print_vcpuinfo(uint32_t tdomid, /* TIM */ printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); /* CPU AFFINITY */ - print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout); + print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout); printf("\n"); }
Dario Faggioli
2013-Dec-02 18:27 UTC
[PATCH v5 02/17] libxl: better name for last parameter of libxl_list_vcpu
as it is now called `nr_vcpus_out'', but what is returned there is the number of pCPUs, rather than the number of vCPUs (which is also returned, but in another parameter, `nb_vcpu''). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v4: * new patch. --- tools/libxl/libxl.c | 4 ++-- tools/libxl/libxl.h | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index a57d571..0cb7ca9 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4525,7 +4525,7 @@ const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx) } libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, - int *nb_vcpu, int *nr_vcpus_out) + int *nb_vcpu, int *nr_cpus_out) { libxl_vcpuinfo *ptr, *ret; xc_domaininfo_t domaininfo; @@ -4535,7 +4535,7 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting infolist"); return NULL; } - *nr_vcpus_out = libxl_get_max_cpus(ctx); + *nr_cpus_out = libxl_get_max_cpus(ctx); ret = ptr = calloc(domaininfo.max_vcpu_id + 1, sizeof (libxl_vcpuinfo)); if (!ptr) { return NULL; diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index a9663e4..bf311fe 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -744,7 +744,7 @@ libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr); void libxl_numainfo_list_free(libxl_numainfo *, int nr); libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, - int *nb_vcpu, int *nr_vcpus_out); + int *nb_vcpu, int *nr_cpus_out); void libxl_vcpuinfo_list_free(libxl_vcpuinfo *, int nr_vcpus); void libxl_device_vtpm_list_free(libxl_device_vtpm*, int nr_vtpms);
Dario Faggioli
2013-Dec-02 18:27 UTC
[PATCH v5 03/17] libxl: fix memory leak in libxl_list_vcpu
more specifically, of the cpumap inside libxl_vcpuinfo, in case of failure after it has been allocated. While at it, use the correct libxl memory allocation wrapper for calloc() in there and turn the function into using the new LOGE() logging style. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v4: * new patch, wasn''t there in v4, but leak was identified during review. --- tools/libxl/libxl.c | 19 +++++++++++-------- 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 0cb7ca9..028106b 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4527,31 +4527,31 @@ const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx) libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, int *nb_vcpu, int *nr_cpus_out) { + GC_INIT(ctx); libxl_vcpuinfo *ptr, *ret; xc_domaininfo_t domaininfo; xc_vcpuinfo_t vcpuinfo; if (xc_domain_getinfolist(ctx->xch, domid, 1, &domaininfo) != 1) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting infolist"); + LOGE(ERROR, "getting infolist"); return NULL; } *nr_cpus_out = libxl_get_max_cpus(ctx); - ret = ptr = calloc(domaininfo.max_vcpu_id + 1, sizeof (libxl_vcpuinfo)); - if (!ptr) { - return NULL; - } + ret = ptr = libxl__calloc(NOGC, domaininfo.max_vcpu_id + 1, + sizeof(libxl_vcpuinfo)); for (*nb_vcpu = 0; *nb_vcpu <= domaininfo.max_vcpu_id; ++*nb_vcpu, ++ptr) { + libxl_bitmap_init(&ptr->cpumap); if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap, 0)) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating cpumap"); + LOGE(ERROR, "allocating cpumap"); goto err; } if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu info"); + LOGE(ERROR, "getting vcpu info"); goto err; } if (xc_vcpu_getaffinity(ctx->xch, domid, *nb_vcpu, ptr->cpumap.map) == -1) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu affinity"); + LOGE(ERROR, "getting vcpu affinity"); goto err; } ptr->vcpuid = *nb_vcpu; @@ -4561,10 +4561,13 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, ptr->running = !!vcpuinfo.running; ptr->vcpu_time = vcpuinfo.cpu_time; } + GC_FREE; return ret; err: + libxl_bitmap_dispose(&ptr->cpumap); free(ret); + GC_FREE; return NULL; }
Dario Faggioli
2013-Dec-02 18:27 UTC
[PATCH v5 04/17] libxc/libxl: sanitize error handling in *_get_max_{cpus, nodes}
In libxc, make xc_get_max_{cpus,node}() always return either a positive number or -1, and change all the callers to deal with that. In libxl, make libxl_get_max_{cpus,nodes}() always return either a positive number or a libxl error code. Thanks to that, it is also possible to fix loggig for libxl_{cpu,node}_bitmap_alloc(), which now happens inside the functions themselves, more accurately reporting what happened. Note that libxl_{cpu,node}_bitmap_alloc() are also moved to libxl_utils.c as, apart from having become less trivial and hence better suited there, that''s required for using GC_INIT()/GC_FREE, which in turn is needed for the LOG()/LOGE() macros. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes from v4: * fix error handling in xc_get_max_{cpus,nodes}() as well, as suggested during review; * no longer move libxl_{cpu,node}_bitmap_alloc() from .h to .c, as requested during review; * add more logging in libxl_{cpu,node}_bitmap_alloc(), as suggested during review; * propagate the error from libxl_get_max_{cpus,nodes}, as suggested during review; Changes from v3: * switch the functions to LOG() / LOGE(). * take care of the callers of libxl_get_max_{cpus,nodes}() too. Changes from v2: * this wasn''t there in v2, but fixing this for v3 was requested during v2 review. --- tools/libxc/xc_misc.c | 22 +++++++++++--- tools/libxl/libxl.c | 16 ++++------ tools/libxl/libxl_utils.c | 60 ++++++++++++++++++++++++++++++++++++- tools/libxl/libxl_utils.h | 29 ++---------------- tools/python/xen/lowlevel/xc/xc.c | 6 ++-- 5 files changed, 88 insertions(+), 45 deletions(-) diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c index 56efe6a..c771469 100644 --- a/tools/libxc/xc_misc.c +++ b/tools/libxc/xc_misc.c @@ -30,9 +30,12 @@ int xc_get_max_cpus(xc_interface *xch) return max_cpus; if ( !xc_physinfo(xch, &physinfo) ) + { max_cpus = physinfo.max_cpu_id + 1; + return max_cpus; + } - return max_cpus; + return -1; } int xc_get_max_nodes(xc_interface *xch) @@ -44,19 +47,30 @@ int xc_get_max_nodes(xc_interface *xch) return max_nodes; if ( !xc_physinfo(xch, &physinfo) ) + { max_nodes = physinfo.max_node_id + 1; + return max_nodes; + } - return max_nodes; + return -1; } int xc_get_cpumap_size(xc_interface *xch) { - return (xc_get_max_cpus(xch) + 7) / 8; + int max_cpus = xc_get_max_cpus(xch); + + if ( max_cpus < 0 ) + return -1; + return (max_cpus + 7) / 8; } int xc_get_nodemap_size(xc_interface *xch) { - return (xc_get_max_nodes(xch) + 7) / 8; + int max_nodes = xc_get_max_nodes(xch); + + if ( max_nodes < 0 ) + return -1; + return (max_nodes + 7) / 8; } xc_cpumap_t xc_cpumap_alloc(xc_interface *xch) diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index 028106b..d3180dc 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -615,10 +615,8 @@ static int cpupool_info(libxl__gc *gc, info->n_dom = xcinfo->n_dom; rc = libxl_cpu_bitmap_alloc(CTX, &info->cpumap, 0); if (rc) - { - LOG(ERROR, "unable to allocate cpumap %d\n", rc); goto out; - } + memcpy(info->cpumap.map, xcinfo->cpumap, info->cpumap.size); rc = 0; @@ -4355,7 +4353,7 @@ libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nb_cpu_out) int max_cpus; max_cpus = libxl_get_max_cpus(ctx); - if (max_cpus == 0) + if (max_cpus < 0) { LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of CPUS"); ret = NULL; @@ -4420,7 +4418,7 @@ libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr) int i, j, max_nodes; max_nodes = libxl_get_max_nodes(ctx); - if (max_nodes == 0) + if (max_nodes < 0) { LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of NODES"); ret = NULL; @@ -4542,10 +4540,8 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, for (*nb_vcpu = 0; *nb_vcpu <= domaininfo.max_vcpu_id; ++*nb_vcpu, ++ptr) { libxl_bitmap_init(&ptr->cpumap); - if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap, 0)) { - LOGE(ERROR, "allocating cpumap"); + if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap, 0)) goto err; - } if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1) { LOGE(ERROR, "getting vcpu info"); goto err; @@ -5315,8 +5311,8 @@ int libxl_get_freecpus(libxl_ctx *ctx, libxl_bitmap *cpumap) int ncpus; ncpus = libxl_get_max_cpus(ctx); - if (ncpus == 0) - return ERROR_FAIL; + if (ncpus < 0) + return ncpus; cpumap->map = xc_cpupool_freeinfo(ctx->xch); if (cpumap->map == NULL) diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c index 9f5f589..0833de2 100644 --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -651,6 +651,58 @@ char *libxl_bitmap_to_hex_string(libxl_ctx *ctx, const libxl_bitmap *bitmap) return q; } +int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap, int max_cpus) +{ + GC_INIT(ctx); + int rc = 0; + + if (max_cpus < 0) { + rc = ERROR_INVAL; + LOG(ERROR, "invalid number of cpus provided"); + goto out; + } + if (max_cpus == 0) + max_cpus = libxl_get_max_cpus(ctx); + if (max_cpus < 0) { + LOG(ERROR, "failed to retrieve the maximum number of cpus"); + rc = max_cpus; + goto out; + } + /* This can''t fail: no need to check and log */ + libxl_bitmap_alloc(ctx, cpumap, max_cpus); + + out: + GC_FREE; + return rc; +} + +int libxl_node_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *nodemap, + int max_nodes) +{ + GC_INIT(ctx); + int rc = 0; + + if (max_nodes < 0) { + rc = ERROR_INVAL; + LOG(ERROR, "invalid number of nodes provided"); + goto out; + } + + if (max_nodes == 0) + max_nodes = libxl_get_max_nodes(ctx); + if (max_nodes < 0) { + LOG(ERROR, "failed to retrieve the maximum number of nodes"); + rc = max_nodes; + goto out; + } + /* This can''t fail: no need to check and log */ + libxl_bitmap_alloc(ctx, nodemap, max_nodes); + + out: + GC_FREE; + return rc; +} + int libxl_nodemap_to_cpumap(libxl_ctx *ctx, const libxl_bitmap *nodemap, libxl_bitmap *cpumap) @@ -719,12 +771,16 @@ int libxl_cpumap_to_nodemap(libxl_ctx *ctx, int libxl_get_max_cpus(libxl_ctx *ctx) { - return xc_get_max_cpus(ctx->xch); + int max_cpus = xc_get_max_cpus(ctx->xch); + + return max_cpus < 0 ? ERROR_FAIL : max_cpus; } int libxl_get_max_nodes(libxl_ctx *ctx) { - return xc_get_max_nodes(ctx->xch); + int max_nodes = xc_get_max_nodes(ctx->xch); + + return max_nodes < 0 ? ERROR_FAIL : max_nodes; } int libxl__enum_from_string(const libxl_enum_string_table *t, diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h index 7b84e6a..e37fb89 100644 --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -98,32 +98,9 @@ static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit) #define libxl_for_each_set_bit(v, m) for (v = 0; v < (m).size * 8; v++) \ if (libxl_bitmap_test(&(m), v)) -static inline int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap, - int max_cpus) -{ - if (max_cpus < 0) - return ERROR_INVAL; - if (max_cpus == 0) - max_cpus = libxl_get_max_cpus(ctx); - if (max_cpus == 0) - return ERROR_FAIL; - - return libxl_bitmap_alloc(ctx, cpumap, max_cpus); -} - -static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx, - libxl_bitmap *nodemap, - int max_nodes) -{ - if (max_nodes < 0) - return ERROR_INVAL; - if (max_nodes == 0) - max_nodes = libxl_get_max_nodes(ctx); - if (max_nodes == 0) - return ERROR_FAIL; - - return libxl_bitmap_alloc(ctx, nodemap, max_nodes); -} +int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap, int max_cpus); +int libxl_node_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *nodemap, + int max_nodes); /* Populate cpumap with the cpus spanned by the nodes in nodemap */ int libxl_nodemap_to_cpumap(libxl_ctx *ctx, diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c index 2625fc4..737bdac 100644 --- a/tools/python/xen/lowlevel/xc/xc.c +++ b/tools/python/xen/lowlevel/xc/xc.c @@ -233,7 +233,7 @@ static PyObject *pyxc_vcpu_setaffinity(XcObject *self, return NULL; nr_cpus = xc_get_max_cpus(self->xc_handle); - if ( nr_cpus == 0 ) + if ( nr_cpus < 0 ) return pyxc_error_to_exception(self->xc_handle); cpumap = xc_cpumap_alloc(self->xc_handle); @@ -392,7 +392,7 @@ static PyObject *pyxc_vcpu_getinfo(XcObject *self, return NULL; nr_cpus = xc_get_max_cpus(self->xc_handle); - if ( nr_cpus == 0 ) + if ( nr_cpus < 0 ) return pyxc_error_to_exception(self->xc_handle); rc = xc_vcpu_getinfo(self->xc_handle, dom, vcpu, &info); @@ -1923,7 +1923,7 @@ static PyObject *cpumap_to_cpulist(XcObject *self, xc_cpumap_t cpumap) int nr_cpus; nr_cpus = xc_get_max_cpus(self->xc_handle); - if ( nr_cpus == 0 ) + if ( nr_cpus < 0 ) return pyxc_error_to_exception(self->xc_handle); cpulist = PyList_New(0);
Dario Faggioli
2013-Dec-02 18:27 UTC
[PATCH v5 05/17] libxc/libxl: allow to retrieve the number of online pCPUs
by introducing introduce xc_get_online_cpus() and libxl_get_online_cpus(). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v4: * introduced the xc call and call it from libxl, as suggested during review. --- tools/libxc/xc_misc.c | 10 ++++++++++ tools/libxc/xenctrl.h | 3 +++ tools/libxl/libxl.h | 3 +++ tools/libxl/libxl_utils.c | 7 +++++++ 4 files changed, 23 insertions(+) diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c index c771469..00cd0d8 100644 --- a/tools/libxc/xc_misc.c +++ b/tools/libxc/xc_misc.c @@ -38,6 +38,16 @@ int xc_get_max_cpus(xc_interface *xch) return -1; } +int xc_get_online_cpus(xc_interface *xch) +{ + xc_physinfo_t physinfo; + + if ( !xc_physinfo(xch, &physinfo) ) + return physinfo.nr_cpus; + + return -1; +} + int xc_get_max_nodes(xc_interface *xch) { static int max_nodes = 0; diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 4ac6b8a..733ed03 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -356,6 +356,9 @@ typedef uint8_t *xc_cpumap_t; /* return maximum number of cpus the hypervisor supports */ int xc_get_max_cpus(xc_interface *xch); +/* return the number of online cpus */ +int xc_get_online_cpus(xc_interface *xch); + /* return array size for cpumap */ int xc_get_cpumap_size(xc_interface *xch); diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index bf311fe..a0d0908 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -652,6 +652,9 @@ int libxl_domain_preserve(libxl_ctx *ctx, uint32_t domid, libxl_domain_create_in /* get max. number of cpus supported by hypervisor */ int libxl_get_max_cpus(libxl_ctx *ctx); +/* get the actual number of online cpus on the host */ +int libxl_get_online_cpus(libxl_ctx *ctx); + /* get max. number of NUMA nodes supported by hypervisor */ int libxl_get_max_nodes(libxl_ctx *ctx); diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c index 0833de2..c9cef66 100644 --- a/tools/libxl/libxl_utils.c +++ b/tools/libxl/libxl_utils.c @@ -776,6 +776,13 @@ int libxl_get_max_cpus(libxl_ctx *ctx) return max_cpus < 0 ? ERROR_FAIL : max_cpus; } +int libxl_get_online_cpus(libxl_ctx *ctx) +{ + int online_cpus = xc_get_online_cpus(ctx->xch); + + return online_cpus < 0 ? ERROR_FAIL : online_cpus; +} + int libxl_get_max_nodes(libxl_ctx *ctx) { int max_nodes = xc_get_max_nodes(ctx->xch);
Dario Faggioli
2013-Dec-02 18:28 UTC
[PATCH v5 06/17] xl: allow for node-wise specification of vcpu pinning
Making it possible to use something like the following: * "nodes:0-3": all pCPUs of nodes 0,1,2,3; * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3; * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2 but not pCPU 6; * ... In both domain config file and `xl vcpu-pin''. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Changes from v1 (of this very series): * actually checking for both "nodes:" and "node:" as per the doc says; * using strcmp() (rather than strncmp()) when matching "all", to avoid returning success on any longer string that just begins with "all"; * fixing the handling (well, the rejection, actually) of "^all" and "^nodes:all"; * make some string pointers const. Changes from v2 (of original series): * turned a ''return'' into ''goto out'', consistently with the most of exit patterns; * harmonized error handling: now parse_range() return a libxl error code, as requested during review; * dealing with "all" moved inside update_cpumap_range(). It''s tricky to move it in parse_range() (as requested during review), since we need the cpumap being modified handy when dealing with it. However, having it in update_cpumap_range() simplifies the code just as much as that; * explicitly checking for junk after a valid value or range in parse_range(), as requested during review; * xl exits on parsing failing, so no need to reset the cpumap to something sensible in vcpupin_parse(), as suggested during review; Changes from v1 (of original series): * code rearranged in order to look more simple to follow and understand, as requested during review; * improved docs in xl.cfg.pod.5, as requested during review; * strtoul() now returns into unsigned long, and the case where it returns ULONG_MAX is now taken into account, as requested during review; * stuff like "all,^7" now works, as requested during review. Specifying just "^7" does not work either before or after this change * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`) by introducing STR_SKIP_PREFIX() macro, as requested during review. --- docs/man/xl.cfg.pod.5 | 20 ++++++ tools/libxl/xl_cmdimpl.c | 153 +++++++++++++++++++++++++++++++++------------- 2 files changed, 128 insertions(+), 45 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 7f107cb..70d6d9f 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the cpus on the host. =item "0-3,5,^1" -To allow all the vcpus of the guest to run on cpus 0,2,3,5. +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining +this with "all" is possible, meaning "all,^7" results in all the vcpus +of the guest running on all the cpus on the host except cpu 7. + +=item "nodes:0-3,node:^2" + +To allow all the vcpus of the guest to run on the cpus from NUMA nodes +0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs +to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus +of the guest will run on cpus 0-3,8-11. + +Combining this notation with the one above is possible. For instance, +"1,node:2,^6", means all the vcpus of the guest will run on cpu 1 and +on all the cpus of NUMA node 2, but not on cpu 6. Following the same +example as above, that would be cpus 1,4,5,7. + +Combining this with "all" is also possible, meaning "all,^nodes:1" +results in all the vcpus of the guest running on all the cpus on the +host, except for the cpus belonging to the host NUMA node 1. =item ["2", "3"] (or [2, 3]) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 7fc049f..10ced5a 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -59,6 +59,11 @@ } \ }) +#define STR_HAS_PREFIX( a, b ) \ + ( strncmp(a, b, strlen(b)) == 0 ) +#define STR_SKIP_PREFIX( a, b ) \ + ( STR_HAS_PREFIX(a, b) ? ((a) += strlen(b), 1) : 0 ) + int logfile = 2; @@ -562,61 +567,121 @@ static void split_string_into_string_list(const char *str, free(s); } -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +static int parse_range(const char *str, unsigned long *a, unsigned long *b) { - libxl_bitmap exclude_cpumap; - uint32_t cpuida, cpuidb; - char *endptr, *toka, *tokb, *saveptr = NULL; - int i, rc = 0, rmcpu; + const char *nstr; + char *endptr; - if (!strcmp(cpu, "all")) { - libxl_bitmap_set_any(cpumap); - return 0; + *a = *b = strtoul(str, &endptr, 10); + if (endptr == str || *a == ULONG_MAX) + return ERROR_INVAL; + + if (*endptr == ''-'') { + nstr = endptr + 1; + + *b = strtoul(nstr, &endptr, 10); + if (endptr == nstr || *b == ULONG_MAX || *b < *a) + return ERROR_INVAL; + } + + /* Valid value or range so far, but we also don''t want junk after that */ + if (*endptr != ''\0'') + return ERROR_INVAL; + + return 0; +} + +/* + * Add or removes a specific set of cpus (specified in str, either as + * single cpus or as entire NUMA nodes) to/from cpumap. + */ +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap) +{ + unsigned long ida, idb; + libxl_bitmap node_cpumap; + bool is_not = false, is_nodes = false; + int rc = 0; + + libxl_bitmap_init(&node_cpumap); + + rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0); + if (rc) { + fprintf(stderr, "libxl_node_bitmap_alloc failed.\n"); + goto out; } - if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) { - fprintf(stderr, "Error: Failed to allocate cpumap.\n"); - return ENOMEM; + /* Are we adding or removing cpus/nodes? */ + if (STR_SKIP_PREFIX(str, "^")) { + is_not = true; } - for (toka = strtok_r(cpu, ",", &saveptr); toka; - toka = strtok_r(NULL, ",", &saveptr)) { - rmcpu = 0; - if (*toka == ''^'') { - /* This (These) Cpu(s) will be removed from the map */ - toka++; - rmcpu = 1; - } - /* Extract a valid (range of) cpu(s) */ - cpuida = cpuidb = strtoul(toka, &endptr, 10); - if (endptr == toka) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; - } - if (*endptr == ''-'') { - tokb = endptr + 1; - cpuidb = strtoul(tokb, &endptr, 10); - if (endptr == tokb || cpuida > cpuidb) { - fprintf(stderr, "Error: Invalid argument.\n"); - rc = EINVAL; - goto vcpp_out; + /* Are we dealing with cpus or full nodes? */ + if (STR_SKIP_PREFIX(str, "node:") || STR_SKIP_PREFIX(str, "nodes:")) { + is_nodes = true; + } + + if (strcmp(str, "all") == 0) { + /* We do not accept "^all" or "^nodes:all" */ + if (is_not) { + fprintf(stderr, "Can''t combine \"^\" and \"all\".\n"); + rc = ERROR_INVAL; + } else + libxl_bitmap_set_any(cpumap); + goto out; + } + + rc = parse_range(str, &ida, &idb); + if (rc) { + fprintf(stderr, "Invalid pcpu range: %s.\n", str); + goto out; + } + + /* Add or remove the specified cpus in the range */ + while (ida <= idb) { + if (is_nodes) { + /* Add/Remove all the cpus of a NUMA node */ + int i; + + rc = libxl_node_to_cpumap(ctx, ida, &node_cpumap); + if (rc) { + fprintf(stderr, "libxl_node_to_cpumap failed.\n"); + goto out; } + + /* Add/Remove all the cpus in the node cpumap */ + libxl_for_each_set_bit(i, node_cpumap) { + is_not ? libxl_bitmap_reset(cpumap, i) : + libxl_bitmap_set(cpumap, i); + } + } else { + /* Add/Remove this cpu */ + is_not ? libxl_bitmap_reset(cpumap, ida) : + libxl_bitmap_set(cpumap, ida); } - while (cpuida <= cpuidb) { - rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) : - libxl_bitmap_set(&exclude_cpumap, cpuida); - cpuida++; - } + ida++; } - /* Clear all the cpus from the removal list */ - libxl_for_each_set_bit(i, exclude_cpumap) { - libxl_bitmap_reset(cpumap, i); - } + out: + libxl_bitmap_dispose(&node_cpumap); + return rc; +} -vcpp_out: - libxl_bitmap_dispose(&exclude_cpumap); +/* + * Takes a string representing a set of cpus (specified either as + * single cpus or as eintire NUMA nodes) and turns it into the + * corresponding libxl_bitmap (in cpumap). + */ +static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) +{ + char *ptr, *saveptr = NULL; + int rc = 0; + + for (ptr = strtok_r(cpu, ",", &saveptr); ptr; + ptr = strtok_r(NULL, ",", &saveptr)) { + rc = update_cpumap_range(ptr, cpumap); + if (rc) + break; + } return rc; }
Dario Faggioli
2013-Dec-02 18:28 UTC
[PATCH v5 07/17] xl: implement and enable dryrun mode for `xl vcpu-pin''
As it can be useful to see if the outcome of some complex vCPU pinning bitmap specification looks as expected. This also allow for the introduction of some automatic testing and verification for the bitmap parsing code, as it happens already in check-xl-disk-parse and check-xl-vif-parse. In particular, to make the above possible, this commit also changes the implementation of the vcpu-pin command so that, instead of always returning 0, it returns an error if the parsing fails. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> --- Changes since v2 (of original series): * fixed a typo in the changelog --- tools/libxl/xl_cmdimpl.c | 48 +++++++++++++++++++++++++++++++++------------ tools/libxl/xl_cmdtable.c | 2 +- 2 files changed, 36 insertions(+), 14 deletions(-) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 10ced5a..b756b70 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4582,40 +4582,62 @@ int main_vcpulist(int argc, char **argv) return 0; } -static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) +static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) { libxl_vcpuinfo *vcpuinfo; libxl_bitmap cpumap; uint32_t vcpuid; char *endptr; - int i, nb_vcpu; + int i, nb_cpu, nb_vcpu, rc = -1; + + libxl_bitmap_init(&cpumap); vcpuid = strtoul(vcpu, &endptr, 10); if (vcpu == endptr) { if (strcmp(vcpu, "all")) { fprintf(stderr, "Error: Invalid argument.\n"); - return; + goto out; } vcpuid = -1; } - if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) { - goto vcpupin_out; - } + if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) + goto out; if (vcpupin_parse(cpu, &cpumap)) - goto vcpupin_out1; + goto out; + + if (dryrun_only) { + nb_cpu = libxl_get_online_cpus(ctx); + if (nb_cpu <= 0) { + fprintf(stderr, "libxl_get_online_cpus failed.\n"); + goto out; + } + + fprintf(stdout, "cpumap: "); + print_bitmap(cpumap.map, nb_cpu, stdout); + fprintf(stdout, "\n"); + + if (ferror(stdout) || fflush(stdout)) { + perror("stdout"); + exit(-1); + } + + rc = 0; + goto out; + } if (vcpuid != -1) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap) == -1) { fprintf(stderr, "Could not set affinity for vcpu `%u''.\n", vcpuid); + goto out; } } else { if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) { fprintf(stderr, "libxl_list_vcpu failed.\n"); - goto vcpupin_out1; + goto out; } for (i = 0; i < nb_vcpu; i++) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid, @@ -4626,10 +4648,11 @@ static void vcpupin(uint32_t domid, const char *vcpu, char *cpu) } libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } - vcpupin_out1: + + rc = 0; + out: libxl_bitmap_dispose(&cpumap); - vcpupin_out: - ; + return rc; } int main_vcpupin(int argc, char **argv) @@ -4640,8 +4663,7 @@ int main_vcpupin(int argc, char **argv) /* No options */ } - vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); - return 0; + return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); } static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index 7709206..ebe0220 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -211,7 +211,7 @@ struct cmd_spec cmd_table[] = { "[Domain, ...]", }, { "vcpu-pin", - &main_vcpupin, 0, 1, + &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", "<Domain> <VCPU|all> <CPUs|all>", },
Dario Faggioli
2013-Dec-02 18:28 UTC
[PATCH v5 08/17] xl: test script for the cpumap parser (for vCPU pinning)
This commit introduces "check-xl-vcpupin-parse" for helping verifying and debugging the (v)CPU bitmap parsing code in xl. The script runs "xl -N vcpu-pin 0 all <some strings>" repeatedly, with various input strings, and checks that the output is as expected. This is what the script can do: # ./check-xl-vcpupin-parse -h usage: ./check-xl-vcpupin-parse [options] Tests various vcpu-pinning strings. If run without arguments acts as follows: - generates some test data and saves them in check-xl-vcpupin-parse.data; - tests all the generated configurations (reading them back from check-xl-vcpupin-parse.data). An example of a test vector file is provided in check-xl-vcpupin-parse.data-example. Options: -h prints this message -r seed uses seed for initializing the rundom number generator (default: the script PID) -s string tries using string as a vcpu pinning configuration and reports whether that succeeds or not -o ofile save the test data in ofile (default: check-xl-vcpupin-parse.data) -i ifile read test data from ifile An example test data file (generated on a 2 NUMA nodes, 16 CPUs host) is being provided in check-xl-vcpupin-parse.data-example. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v2 (of original series): * killed the `sleep 1'', as requested during review; * allow for passing a custom randon seed, and report what is the actual random seed used, as requested during review; * allow for testing for specific pinning configuration strings, as suggested during review; * stores the test data in a file, after generating them, and read them back from there for actual testing, as suggested during review; * allow for reading the test data from an existing test file instead than always generating new ones. Changes from v1 (of original series): * this was not there in v1, and adding it has been requested during review. --- tools/libxl/check-xl-vcpupin-parse | 294 +++++++++++++++++++++++ tools/libxl/check-xl-vcpupin-parse.data-example | 53 ++++ 2 files changed, 347 insertions(+) create mode 100755 tools/libxl/check-xl-vcpupin-parse create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example diff --git a/tools/libxl/check-xl-vcpupin-parse b/tools/libxl/check-xl-vcpupin-parse new file mode 100755 index 0000000..21f8421 --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse @@ -0,0 +1,294 @@ +#!/bin/bash + +set -e + +if [ -x ./xl ] ; then + export LD_LIBRARY_PATH=.:../libxc:../xenstore: + XL=./xl +else + XL=xl +fi + +fprefix=tmp.check-xl-vcpupin-parse +outfile=check-xl-vcpupin-parse.data + +usage () { +cat <<END +usage: $0 [options] + +Tests various vcpu-pinning strings. If run without arguments acts +as follows: + - generates some test data and saves them in $outfile; + - tests all the generated configurations (reading them back from + $outfile). + +An example of a test vector file is provided in ${outfile}-example. + +Options: + -h prints this message + -r seed uses seed for initializing the rundom number generator + (default: the script PID) + -s string tries using string as a vcpu pinning configuration and + reports whether that succeeds or not + -o ofile save the test data in ofile (default: $outfile) + -i ifile read test data from ifile +END +} + +expected () { + cat >$fprefix.expected +} + +# by default, re-seed with our PID +seed=$$ +failures=0 + +# Execute one test and check the result against the provided +# rc value and output +one () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" </dev/null >$fprefix.actual 2>/dev/null + actual_rc=$? + if [ $actual_rc != $expected_rc ]; then + diff -u $fprefix.expected $fprefix.actual + echo >&2 "test case \`$*'' failed ($actual_rc $diff_rc)" + failures=$(( $failures + 1 )) + fi + set -e +} + +# Write an entry in the test vector file. Format is as follows: +# test-string*expected-rc*expected-output +write () { + printf "$1*$2*$3\n" >> $outfile +} + +complete () { + if [ "$failures" = 0 ]; then + echo all ok.; exit 0 + else + echo "$failures tests failed."; exit 1 + fi +} + +# Test a specific pinning string +string () { + expected_rc=$1; shift + printf "test case %s...\n" "$*" + set +e + ${XL} -N vcpu-pin 0 all "$@" &> /dev/null + actual_rc=$? + set -e + + if [ $actual_rc != $expected_rc ]; then + echo >&2 "test case \`$*'' failed ($actual_rc)" + else + echo >&2 "test case \`$*'' succeeded" + fi + + exit 0 +} + +# Read a test vector file (provided as $1) line by line and +# test all the entries it contains +run () +{ + while read line + do + if [ ${line:0:1} != ''#'' ]; then + test_string="`echo $line | cut -f1 -d''*''`" + exp_rc="`echo $line | cut -f2 -d''*''`" + exp_output="`echo $line | cut -f3 -d''*''`" + + expected <<END +$exp_output +END + one $exp_rc "$test_string" + fi + done < $1 + + complete + + exit 0 +} + +while getopts "hr:s:o:i:" option +do + case $option in + h) + usage + exit 0 + ;; + r) + seed=$OPTARG + ;; + s) + string 0 "$OPTARG" + ;; + o) + outfile=$OPTARG + ;; + i) + run $OPTARG + ;; + esac +done + +#---------- test data ---------- +# +nr_cpus=`xl info | grep nr_cpus | cut -f2 -d'':''` +nr_nodes=`xl info | grep nr_nodes | cut -f2 -d'':''` +nr_cpus_per_node=`xl info -n | sed ''/cpu:/,/numa_info/!d'' | head -n -1 | \ + awk ''{print $4}'' | uniq -c | tail -1 | awk ''{print $1}''` +cat >$outfile <<END +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: $nr_cpus CPUs, $nr_nodes nodes, $nr_cpus_per_node CPUs per node. +# +# seed used for random number generation: seed=${seed}. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +END + +# Re-seed the random number generator +RANDOM=$seed + +echo "# Testing a wrong configuration" >> $outfile +write foo 255 "" + +echo "# Testing the ''all'' syntax" >> $outfile +write "all" 0 "cpumap: all" +write "nodes:all" 0 "cpumap: all" +write "all,nodes:all" 0 "cpumap: all" +write "all,^nodes:0,all" 0 "cpumap: all" + +echo "# Testing the empty cpumap case" >> $outfile +write "^0" 0 "cpumap: none" + +echo "# A few attempts of pinning to just one random cpu" >> $outfile +if [ $nr_cpus -gt 1 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + write "$cpu" 0 "cpumap: $cpu" + done +fi + +echo "# A few attempts of pinning to all but one random cpu" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpu=$(($RANDOM % nr_cpus)) + if [ $cpu -eq 0 ]; then + expected_range="1-$((nr_cpus - 1))" + elif [ $cpu -eq 1 ]; then + expected_range="0,2-$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 2)) ]; then + expected_range="0-$((cpu - 1)),$((nr_cpus - 1))" + elif [ $cpu -eq $((nr_cpus - 1)) ]; then + expected_range="0-$((nr_cpus - 2))" + else + expected_range="0-$((cpu - 1)),$((cpu + 1))-$((nr_cpus - 1))" + fi + write "all,^$cpu" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of cpus" >> $outfile +if [ $nr_cpus -gt 2 ]; then + for i in `seq 0 3`; do + cpua=$(($RANDOM % nr_cpus)) + range=$((nr_cpus - cpua)) + cpub=$(($RANDOM % range)) + cpubb=$((cpua + cpub)) + if [ $cpua -eq $cpubb ]; then + expected_range="$cpua" + else + expected_range="$cpua-$cpubb" + fi + write "$expected_range" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to just one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + write "nodes:$node" 0 "cpumap: $((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1)-1))" + done +fi + +echo "# A few attempts of pinning to all but one random node" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $node -eq 0 ]; then + expected_range="$nr_cpus_per_node-$((nr_cpus - 1))" + elif [ $node -eq $((nr_nodes - 1)) ]; then + expected_range="0-$((nr_cpus - nr_cpus_per_node - 1))" + else + expected_range="0-$((nr_cpus_per_node*node-1)),$((nr_cpus_per_node*(node+1)))-$nr_cpus" + fi + write "all,^nodes:$node" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a random range of nodes" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + nodea=$(($RANDOM % nr_nodes)) + range=$((nr_nodes - nodea)) + nodeb=$(($RANDOM % range)) + nodebb=$((nodea + nodeb)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + if [ $nodea -eq 0 ] && [ $nodebb -eq $((nr_nodes - 1)) ]; then + expected_range="all" + else + expected_range="$((nr_cpus_per_node*nodea))-$((nr_cpus_per_node*(nodebb+1) - 1))" + fi + write "nodes:$nodea-$nodebb" 0 "cpumap: $expected_range" + done +fi + +echo "# A few attempts of pinning to a node but excluding one random cpu" >> $outfile +if [ $nr_nodes -gt 1 ]; then + for i in `seq 0 3`; do + node=$(($RANDOM % nr_nodes)) + # this assumes that the first $nr_cpus_per_node (from cpu + # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node + # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node + # to 2*$nr_cpus_per_node-1) are assigned to the second node (node + # 1), etc. Expect failures if that is not the case. + cpu=$(($RANDOM % nr_cpus_per_node + nr_cpus_per_node*node)) + if [ $cpu -eq $((nr_cpus_per_node*node)) ]; then + expected_range="$((nr_cpus_per_node*node + 1))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*node + 1)) ]; then + expected_range="$((nr_cpus_per_node*node)),$((nr_cpus_per_node*node + 2))-$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 2)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 3)),$((nr_cpus_per_node*(node+1) - 1))" + elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 1)) ]; then + expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) - 2))" + else + expected_range="$((nr_cpus_per_node*node))-$((cpu - 1)),$((cpu + 1))-$((nr_cpus_per_node*(node+1) - 1))" + fi + write "nodes:$node,^$cpu" 0 "cpumap: $expected_range" + done +fi + +run $outfile diff --git a/tools/libxl/check-xl-vcpupin-parse.data-example b/tools/libxl/check-xl-vcpupin-parse.data-example new file mode 100644 index 0000000..4bbd5de --- /dev/null +++ b/tools/libxl/check-xl-vcpupin-parse.data-example @@ -0,0 +1,53 @@ +# WARNING: some of these tests are topology based tests. +# Expect failures if the topology is not detected correctly +# detected topology: 16 CPUs, 2 nodes, 8 CPUs per node. +# +# seed used for random number generation: seed=13328. +# +# Format is as follows: +# test-string*expected-return-code*expected-output +# +# Testing a wrong configuration +foo*255* +# Testing the ''all'' syntax +all*0*cpumap: all +nodes:all*0*cpumap: all +all,nodes:all*0*cpumap: all +all,^nodes:0,all*0*cpumap: all +# Testing the empty cpumap case +^0*0*cpumap: none +# A few attempts of pinning to just one random cpu +0*0*cpumap: 0 +9*0*cpumap: 9 +6*0*cpumap: 6 +0*0*cpumap: 0 +# A few attempts of pinning to all but one random cpu +all,^12*0*cpumap: 0-11,13-15 +all,^6*0*cpumap: 0-5,7-15 +all,^3*0*cpumap: 0-2,4-15 +all,^7*0*cpumap: 0-6,8-15 +# A few attempts of pinning to a random range of cpus +13-15*0*cpumap: 13-15 +7*0*cpumap: 7 +3-5*0*cpumap: 3-5 +8-11*0*cpumap: 8-11 +# A few attempts of pinning to just one random node +nodes:1*0*cpumap: 8-15 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +nodes:0*0*cpumap: 0-7 +# A few attempts of pinning to all but one random node +all,^nodes:0*0*cpumap: 8-15 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:1*0*cpumap: 0-7 +all,^nodes:0*0*cpumap: 8-15 +# A few attempts of pinning to a random range of nodes +nodes:1-1*0*cpumap: 8-15 +nodes:1-1*0*cpumap: 8-15 +nodes:0-1*0*cpumap: all +nodes:0-0*0*cpumap: 0-7 +# A few attempts of pinning to a node but excluding one random cpu +nodes:1,^8*0*cpumap: 9-15 +nodes:0,^6*0*cpumap: 0-5,7 +nodes:1,^9*0*cpumap: 8,10-15 +nodes:0,^5*0*cpumap: 0-4,6-7
Dario Faggioli
2013-Dec-02 18:28 UTC
[PATCH v5 09/17] xen: sched: rename v->cpu_affinity into v->cpu_hard_affinity
in order to distinguish it from the cpu_soft_affinity which will be introduced a later commit ("xen: sched: introduce soft-affinity and use it instead d->node-affinity"). This patch does not imply any functional change, it is basically the result of something like the following: s/cpu_affinity/cpu_hard_affinity/g s/cpu_affinity_tmp/cpu_hard_affinity_tmp/g s/cpu_affinity_saved/cpu_hard_affinity_saved/g Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes from v2: * patch has been moved one step up in the series. --- xen/arch/x86/traps.c | 11 ++++++----- xen/common/domain.c | 22 +++++++++++----------- xen/common/domctl.c | 2 +- xen/common/keyhandler.c | 2 +- xen/common/sched_credit.c | 12 ++++++------ xen/common/sched_sedf.c | 2 +- xen/common/schedule.c | 21 +++++++++++---------- xen/common/wait.c | 4 ++-- xen/include/xen/sched.h | 8 ++++---- 9 files changed, 43 insertions(+), 41 deletions(-) diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index d8b3eac..157031e 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -3093,7 +3093,8 @@ static void nmi_mce_softirq(void) /* Set the tmp value unconditionally, so that * the check in the iret hypercall works. */ - cpumask_copy(st->vcpu->cpu_affinity_tmp, st->vcpu->cpu_affinity); + cpumask_copy(st->vcpu->cpu_hard_affinity_tmp, + st->vcpu->cpu_hard_affinity); if ((cpu != st->processor) || (st->processor != st->vcpu->processor)) @@ -3128,11 +3129,11 @@ void async_exception_cleanup(struct vcpu *curr) return; /* Restore affinity. */ - if ( !cpumask_empty(curr->cpu_affinity_tmp) && - !cpumask_equal(curr->cpu_affinity_tmp, curr->cpu_affinity) ) + if ( !cpumask_empty(curr->cpu_hard_affinity_tmp) && + !cpumask_equal(curr->cpu_hard_affinity_tmp, curr->cpu_hard_affinity) ) { - vcpu_set_affinity(curr, curr->cpu_affinity_tmp); - cpumask_clear(curr->cpu_affinity_tmp); + vcpu_set_affinity(curr, curr->cpu_hard_affinity_tmp); + cpumask_clear(curr->cpu_hard_affinity_tmp); } if ( !(curr->async_exception_mask & (curr->async_exception_mask - 1)) ) diff --git a/xen/common/domain.c b/xen/common/domain.c index 2cbc489..d8116c7 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -125,9 +125,9 @@ struct vcpu *alloc_vcpu( tasklet_init(&v->continue_hypercall_tasklet, NULL, 0); - if ( !zalloc_cpumask_var(&v->cpu_affinity) || - !zalloc_cpumask_var(&v->cpu_affinity_tmp) || - !zalloc_cpumask_var(&v->cpu_affinity_saved) || + if ( !zalloc_cpumask_var(&v->cpu_hard_affinity) || + !zalloc_cpumask_var(&v->cpu_hard_affinity_tmp) || + !zalloc_cpumask_var(&v->cpu_hard_affinity_saved) || !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) goto fail_free; @@ -156,9 +156,9 @@ struct vcpu *alloc_vcpu( fail_wq: destroy_waitqueue_vcpu(v); fail_free: - free_cpumask_var(v->cpu_affinity); - free_cpumask_var(v->cpu_affinity_tmp); - free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->cpu_hard_affinity); + free_cpumask_var(v->cpu_hard_affinity_tmp); + free_cpumask_var(v->cpu_hard_affinity_saved); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); return NULL; @@ -371,7 +371,7 @@ void domain_update_node_affinity(struct domain *d) for_each_vcpu ( d, v ) { - cpumask_and(online_affinity, v->cpu_affinity, online); + cpumask_and(online_affinity, v->cpu_hard_affinity, online); cpumask_or(cpumask, cpumask, online_affinity); } @@ -734,9 +734,9 @@ static void complete_domain_destroy(struct rcu_head *head) for ( i = d->max_vcpus - 1; i >= 0; i-- ) if ( (v = d->vcpu[i]) != NULL ) { - free_cpumask_var(v->cpu_affinity); - free_cpumask_var(v->cpu_affinity_tmp); - free_cpumask_var(v->cpu_affinity_saved); + free_cpumask_var(v->cpu_hard_affinity); + free_cpumask_var(v->cpu_hard_affinity_tmp); + free_cpumask_var(v->cpu_hard_affinity_saved); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); } @@ -875,7 +875,7 @@ int vcpu_reset(struct vcpu *v) v->async_exception_mask = 0; memset(v->async_exception_state, 0, sizeof(v->async_exception_state)); #endif - cpumask_clear(v->cpu_affinity_tmp); + cpumask_clear(v->cpu_hard_affinity_tmp); clear_bit(_VPF_blocked, &v->pause_flags); clear_bit(_VPF_in_reset, &v->pause_flags); diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 904d27b..5e0ac5c 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -629,7 +629,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) else { ret = cpumask_to_xenctl_bitmap( - &op->u.vcpuaffinity.cpumap, v->cpu_affinity); + &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); } } break; diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c index 8e4b3f8..c11f577 100644 --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -296,7 +296,7 @@ static void dump_domains(unsigned char key) !vcpu_event_delivery_is_enabled(v)); cpuset_print(tmpstr, sizeof(tmpstr), v->vcpu_dirty_cpumask); printk("dirty_cpus=%s ", tmpstr); - cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity); + cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_hard_affinity); printk("cpu_affinity=%s\n", tmpstr); printk(" pause_count=%d pause_flags=%lx\n", atomic_read(&v->pause_count), v->pause_flags); diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index db5512e..c6a2560 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -332,13 +332,13 @@ csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) if ( step == CSCHED_BALANCE_NODE_AFFINITY ) { cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask, - vc->cpu_affinity); + vc->cpu_hard_affinity); if ( unlikely(cpumask_empty(mask)) ) - cpumask_copy(mask, vc->cpu_affinity); + cpumask_copy(mask, vc->cpu_hard_affinity); } else /* step == CSCHED_BALANCE_CPU_AFFINITY */ - cpumask_copy(mask, vc->cpu_affinity); + cpumask_copy(mask, vc->cpu_hard_affinity); } static void burn_credits(struct csched_vcpu *svc, s_time_t now) @@ -407,7 +407,7 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY && !__vcpu_has_node_affinity(new->vcpu, - new->vcpu->cpu_affinity) ) + new->vcpu->cpu_hard_affinity) ) continue; /* Are there idlers suitable for new (for this balance step)? */ @@ -642,7 +642,7 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) /* Store in cpus the mask of online cpus on which the domain can run */ online = cpupool_scheduler_cpumask(vc->domain->cpupool); - cpumask_and(&cpus, vc->cpu_affinity, online); + cpumask_and(&cpus, vc->cpu_hard_affinity, online); for_each_csched_balance_step( balance_step ) { @@ -1498,7 +1498,7 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) * or counter. */ if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc, vc->cpu_affinity) ) + && !__vcpu_has_node_affinity(vc, vc->cpu_hard_affinity) ) continue; csched_balance_cpumask(vc, balance_step, csched_balance_mask); diff --git a/xen/common/sched_sedf.c b/xen/common/sched_sedf.c index 7c24171..c219aed 100644 --- a/xen/common/sched_sedf.c +++ b/xen/common/sched_sedf.c @@ -396,7 +396,7 @@ static int sedf_pick_cpu(const struct scheduler *ops, struct vcpu *v) cpumask_t *online; online = cpupool_scheduler_cpumask(v->domain->cpupool); - cpumask_and(&online_affinity, v->cpu_affinity, online); + cpumask_and(&online_affinity, v->cpu_hard_affinity, online); return cpumask_cycle(v->vcpu_id % cpumask_weight(&online_affinity) - 1, &online_affinity); } diff --git a/xen/common/schedule.c b/xen/common/schedule.c index 0f45f07..c4236c5 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -194,9 +194,9 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) */ v->processor = processor; if ( is_idle_domain(d) || d->is_pinned ) - cpumask_copy(v->cpu_affinity, cpumask_of(processor)); + cpumask_copy(v->cpu_hard_affinity, cpumask_of(processor)); else - cpumask_setall(v->cpu_affinity); + cpumask_setall(v->cpu_hard_affinity); /* Initialise the per-vcpu timers. */ init_timer(&v->periodic_timer, vcpu_periodic_timer_fn, @@ -285,7 +285,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c) migrate_timer(&v->singleshot_timer, new_p); migrate_timer(&v->poll_timer, new_p); - cpumask_setall(v->cpu_affinity); + cpumask_setall(v->cpu_hard_affinity); lock = vcpu_schedule_lock_irq(v); v->processor = new_p; @@ -457,7 +457,7 @@ static void vcpu_migrate(struct vcpu *v) */ if ( pick_called && (new_lock == per_cpu(schedule_data, new_cpu).schedule_lock) && - cpumask_test_cpu(new_cpu, v->cpu_affinity) && + cpumask_test_cpu(new_cpu, v->cpu_hard_affinity) && cpumask_test_cpu(new_cpu, v->domain->cpupool->cpu_valid) ) break; @@ -561,7 +561,7 @@ void restore_vcpu_affinity(struct domain *d) { printk(XENLOG_DEBUG "Restoring affinity for d%dv%d\n", d->domain_id, v->vcpu_id); - cpumask_copy(v->cpu_affinity, v->cpu_affinity_saved); + cpumask_copy(v->cpu_hard_affinity, v->cpu_hard_affinity_saved); v->affinity_broken = 0; } @@ -604,20 +604,21 @@ int cpu_disable_scheduler(unsigned int cpu) unsigned long flags; spinlock_t *lock = vcpu_schedule_lock_irqsave(v, &flags); - cpumask_and(&online_affinity, v->cpu_affinity, c->cpu_valid); + cpumask_and(&online_affinity, v->cpu_hard_affinity, c->cpu_valid); if ( cpumask_empty(&online_affinity) && - cpumask_test_cpu(cpu, v->cpu_affinity) ) + cpumask_test_cpu(cpu, v->cpu_hard_affinity) ) { printk(XENLOG_DEBUG "Breaking affinity for d%dv%d\n", d->domain_id, v->vcpu_id); if (system_state == SYS_STATE_suspend) { - cpumask_copy(v->cpu_affinity_saved, v->cpu_affinity); + cpumask_copy(v->cpu_hard_affinity_saved, + v->cpu_hard_affinity); v->affinity_broken = 1; } - cpumask_setall(v->cpu_affinity); + cpumask_setall(v->cpu_hard_affinity); } if ( v->processor == cpu ) @@ -665,7 +666,7 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) lock = vcpu_schedule_lock_irq(v); - cpumask_copy(v->cpu_affinity, affinity); + cpumask_copy(v->cpu_hard_affinity, affinity); /* Always ask the scheduler to re-evaluate placement * when changing the affinity */ diff --git a/xen/common/wait.c b/xen/common/wait.c index 3c9366c..3f6ff41 100644 --- a/xen/common/wait.c +++ b/xen/common/wait.c @@ -134,7 +134,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) /* Save current VCPU affinity; force wakeup on *this* CPU only. */ wqv->wakeup_cpu = smp_processor_id(); - cpumask_copy(&wqv->saved_affinity, curr->cpu_affinity); + cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); @@ -183,7 +183,7 @@ void check_wakeup_from_wait(void) { /* Re-set VCPU affinity and re-enter the scheduler. */ struct vcpu *curr = current; - cpumask_copy(&wqv->saved_affinity, curr->cpu_affinity); + cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index cbdf377..40e5927 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -192,11 +192,11 @@ struct vcpu spinlock_t virq_lock; /* Bitmask of CPUs on which this VCPU may run. */ - cpumask_var_t cpu_affinity; + cpumask_var_t cpu_hard_affinity; /* Used to change affinity temporarily. */ - cpumask_var_t cpu_affinity_tmp; + cpumask_var_t cpu_hard_affinity_tmp; /* Used to restore affinity across S3. */ - cpumask_var_t cpu_affinity_saved; + cpumask_var_t cpu_hard_affinity_saved; /* Bitmask of CPUs which are holding onto this VCPU''s state. */ cpumask_var_t vcpu_dirty_cpumask; @@ -792,7 +792,7 @@ void watchdog_domain_destroy(struct domain *d); #define has_hvm_container_domain(d) ((d)->guest_type != guest_type_pv) #define has_hvm_container_vcpu(v) (has_hvm_container_domain((v)->domain)) #define is_pinned_vcpu(v) ((v)->domain->is_pinned || \ - cpumask_weight((v)->cpu_affinity) == 1) + cpumask_weight((v)->cpu_hard_affinity) == 1) #ifdef HAS_PASSTHROUGH #define need_iommu(d) ((d)->need_iommu) #else
Dario Faggioli
2013-Dec-02 18:28 UTC
[PATCH v5 10/17] xen: sched: introduce soft-affinity and use it instead d->node-affinity
Before this change, each vcpu had its own vcpu-affinity (in v->cpu_affinity), representing the set of pcpus where the vcpu is allowed to run. Since when NUMA-aware scheduling was introduced the (credit1 only, for now) scheduler also tries as much as it can to run all the vcpus of a domain on one of the nodes that constitutes the domain''s node-affinity. The idea here is making the mechanism more general by: * allowing for this ''preference'' for some pcpus/nodes to be expressed on a per-vcpu basis, instead than for the domain as a whole. That is to say, each vcpu should have its own set of preferred pcpus/nodes, instead than it being the very same for all the vcpus of the domain; * generalizing the idea of ''preferred pcpus'' to not only NUMA awareness and support. That is to say, independently from it being or not (mostly) useful on NUMA systems, it should be possible to specify, for each vcpu, a set of pcpus where it prefers to run (in addition, and possibly unrelated to, the set of pcpus where it is allowed to run). We will be calling this set of *preferred* pcpus the vcpu''s soft affinity, and this changes introduce it, and starts using it for scheduling, replacing the indirect use of the domain''s NUMA node-affinity. This is more general, as soft affinity does not have to be related to NUMA. Nevertheless, it allows to achieve the same results of NUMA-aware scheduling, just by making soft affinity equal to the domain''s node affinity, for all the vCPUs (e.g., from the toolstack). This also means renaming most of the NUMA-aware scheduling related functions, in credit1, to something more generic, hinting toward the concept of soft affinity rather than directly to NUMA awareness. As a side effects, this simplifies the code quit a bit. In fact, prior to this change, we needed to cache the translation of d->node_affinity (which is a nodemask_t) to a cpumask_t, since that is what scheduling decisions require (we used to keep it in node_affinity_cpumask). This, and all the complicated logic required to keep it updated, is not necessary any longer. The high level description of NUMA placement and scheduling in docs/misc/xl-numa-placement.markdown is being updated too, to match the new architecture. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes from v2: * this patch folds patches 6 ("xen: sched: make space for cpu_soft_affinity") and 10 ("xen: sched: use soft-affinity instead of domain''s node-affinity"), as suggested during review. ''Reviewed-by'' from George is there since both patch 6 and 10 had it, and I didn''t do anything else than squashing them. Changes from v1: * in v1, "7/12 xen: numa-sched: use per-vcpu node-affinity for actual scheduling" was doing something very similar to this patch. --- docs/misc/xl-numa-placement.markdown | 148 ++++++++++++++++++++++----------- xen/common/domain.c | 5 + xen/common/keyhandler.c | 2 xen/common/sched_credit.c | 153 +++++++++++++--------------------- xen/common/schedule.c | 3 + xen/include/xen/sched.h | 3 + 6 files changed, 168 insertions(+), 146 deletions(-) diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown index caa3fec..b1ed361 100644 --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA node is usually defined as a set of processor cores (typically a physical CPU package) and the memory directly attached to the set of cores. -The Xen hypervisor deals with NUMA machines by assigning to each domain -a "node affinity", i.e., a set of NUMA nodes of the host from which they -get their memory allocated. Also, even if the node affinity of a domain -is allowed to change on-line, it is very important to "place" the domain -correctly when it is fist created, as the most of its memory is allocated -at that time and can not (for now) be moved easily. - NUMA awareness becomes very important as soon as many domains start running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the @@ -27,14 +20,37 @@ performance degradation is likely to be noticeable. For more information, have a look at the [Xen NUMA Introduction][numa_intro] page on the Wiki. +## Xen and NUMA machines: the concept of _node-affinity_ ## + +The Xen hypervisor deals with NUMA machines throughout the concept of +_node-affinity_. The node-affinity of a domain is the set of NUMA nodes +of the host where the memory for the domain is being allocated (mostly, +at domain creation time). This is, at least in principle, different and +unrelated with the vCPU (hard and soft, see below) scheduling affinity, +which instead is the set of pCPUs where the vCPU is allowed (or prefers) +to run. + +Of course, despite the fact that they belong to and affect different +subsystems, the domain node-affinity and the vCPUs affinity are not +completely independent. +In fact, if the domain node-affinity is not explicitly specified by the +user, via the proper libxl calls or xl config item, it will be computed +basing on the vCPUs'' scheduling affinity. + +Notice that, even if the node affinity of a domain may change on-line, +it is very important to "place" the domain correctly when it is fist +created, as the most of its memory is allocated at that time and can +not (for now) be moved easily. + ### Placing via pinning and cpupools ### -The simplest way of placing a domain on a NUMA node is statically pinning -the domain''s vCPUs to the pCPUs of the node. This goes under the name of -CPU affinity and can be set through the "cpus=" option in the config file -(more about this below). Another option is to pool together the pCPUs -spanning the node and put the domain in such a cpupool with the "pool=" -config option (as documented in our [Wiki][cpupools_howto]). +The simplest way of placing a domain on a NUMA node is setting the hard +scheduling affinity of the domain''s vCPUs to the pCPUs of the node. This +also goes under the name of vCPU pinning, and can be done through the +"cpus=" option in the config file (more about this below). Another option +is to pool together the pCPUs spanning the node and put the domain in +such a _cpupool_ with the "pool=" config option (as documented in our +[Wiki][cpupools_howto]). In both the above cases, the domain will not be able to execute outside the specified set of pCPUs for any reasons, even if all those pCPUs are @@ -45,24 +61,45 @@ may come at he cost of some load imbalances. ### NUMA aware scheduling ### -If the credit scheduler is in use, the concept of node affinity defined -above does not only apply to memory. In fact, starting from Xen 4.3, the -scheduler always tries to run the domain''s vCPUs on one of the nodes in -its node affinity. Only if that turns out to be impossible, it will just -pick any free pCPU. - -This is, therefore, something more flexible than CPU affinity, as a domain -can still run everywhere, it just prefers some nodes rather than others. -Locality of access is less guaranteed than in the pinning case, but that -comes along with better chances to exploit all the host resources (e.g., -the pCPUs). - -In fact, if all the pCPUs in a domain''s node affinity are busy, it is -possible for the domain to run outside of there, but it is very likely that -slower execution (due to remote memory accesses) is still better than no -execution at all, as it would happen with pinning. For this reason, NUMA -aware scheduling has the potential of bringing substantial performances -benefits, although this will depend on the workload. +If using the credit1 scheduler, and starting from Xen 4.3, the scheduler +itself always tries to run the domain''s vCPUs on one of the nodes in +its node-affinity. Only if that turns out to be impossible, it will just +pick any free pCPU. Locality of access is less guaranteed than in the +pinning case, but that comes along with better chances to exploit all +the host resources (e.g., the pCPUs). + +Starting from Xen 4.4, credit1 supports two forms of affinity: hard and +soft, both on a per-vCPU basis. This means each vCPU can have its own +soft affinity, stating where such vCPU prefers to execute on. This is +less strict than what it (also starting from 4.4) is called hard affinity, +as the vCPU can potentially run everywhere, it just prefers some pCPUs +rather than others. +In Xen 4.4, therefore, NUMA-aware scheduling is achieved by matching the +soft affinity of the vCPUs of a domain with its node-affinity. + +In fact, as it was for 4.3, if all the pCPUs in a vCPU''s soft affinity +are busy, it is possible for the domain to run outside from there. The +idea is that slower execution (due to remote memory accesses) is still +better than no execution at all (as it would happen with pinning). For +this reason, NUMA aware scheduling has the potential of bringing +substantial performances benefits, although this will depend on the +workload. + +Notice that, for each vCPU, the following three scenarios are possbile: + + * a vCPU *is pinned* to some pCPUs and *does not have* any soft affinity + In this case, the vCPU is always scheduled on one of the pCPUs to which + it is pinned, without any specific peference among them. + * a vCPU *has* its own soft affinity and *is not* pinned to any particular + pCPU. In this case, the vCPU can run on every pCPU. Nevertheless, the + scheduler will try to have it running on one of the pCPUs in its soft + affinity; + * a vCPU *has* its own vCPU soft affinity and *is also* pinned to some + pCPUs. In this case, the vCPU is always scheduled on one of the pCPUs + onto which it is pinned, with, among them, a preference for the ones + that also forms its soft affinity. In case pinning and soft affinity + form two disjoint sets of pCPUs, pinning "wins", and the soft affinity + is just ignored. ## Guest placement in xl ## @@ -71,25 +108,23 @@ both manual or automatic placement of them across the host''s NUMA nodes. Note that xm/xend does a very similar thing, the only differences being the details of the heuristics adopted for automatic placement (see below), -and the lack of support (in both xm/xend and the Xen versions where that\ +and the lack of support (in both xm/xend and the Xen versions where that was the default toolstack) for NUMA aware scheduling. ### Placing the guest manually ### Thanks to the "cpus=" option, it is possible to specify where a domain should be created and scheduled on, directly in its config file. This -affects NUMA placement and memory accesses as the hypervisor constructs -the node affinity of a VM basing right on its CPU affinity when it is -created. +affects NUMA placement and memory accesses as, in this case, the +hypervisor constructs the node-affinity of a VM basing right on its +vCPU pinning when it is created. This is very simple and effective, but requires the user/system -administrator to explicitly specify affinities for each and every domain, +administrator to explicitly specify the pinning for each and every domain, or Xen won''t be able to guarantee the locality for their memory accesses. -Notice that this also pins the domain''s vCPUs to the specified set of -pCPUs, so it not only sets the domain''s node affinity (its memory will -come from the nodes to which the pCPUs belong), but at the same time -forces the vCPUs of the domain to be scheduled on those same pCPUs. +That, of course, also mean the vCPUs of the domain will only be able to +execute on those same pCPUs. ### Placing the guest automatically ### @@ -97,7 +132,9 @@ If no "cpus=" option is specified in the config file, libxl tries to figure out on its own on which node(s) the domain could fit best. If it finds one (some), the domain''s node affinity get set to there, and both memory allocations and NUMA aware scheduling (for the credit -scheduler and starting from Xen 4.3) will comply with it. +scheduler and starting from Xen 4.3) will comply with it. Starting from +Xen 4.4, this also means that the mask resulting from this "fitting" +procedure will become the soft affinity of all the vCPUs of the domain. It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host is an incarnation of the Bin Packing Problem. In fact, @@ -142,34 +179,43 @@ any placement from happening: libxl_defbool_set(&domain_build_info->numa_placement, false); -Also, if `numa_placement` is set to `true`, the domain must not -have any CPU affinity (i.e., `domain_build_info->cpumap` must -have all its bits set, as it is by default), or domain creation -will fail returning `ERROR_INVAL`. +Also, if `numa_placement` is set to `true`, the domain''s vCPUs must +not be pinned (i.e., `domain_build_info->cpumap` must have all its +bits set, as it is by default), or domain creation will fail with +`ERROR_INVAL`. Starting from Xen 4.3, in case automatic placement happens (and is -successful), it will affect the domain''s node affinity and _not_ its -CPU affinity. Namely, the domain''s vCPUs will not be pinned to any +successful), it will affect the domain''s node-affinity and _not_ its +vCPU pinning. Namely, the domain''s vCPUs will not be pinned to any pCPU on the host, but the memory from the domain will come from the selected node(s) and the NUMA aware scheduling (if the credit scheduler -is in use) will try to keep the domain there as much as possible. +is in use) will try to keep the domain''s vCPUs there as much as possible. Besides than that, looking and/or tweaking the placement algorithm search "Automatic NUMA placement" in libxl\_internal.h. Note this may change in future versions of Xen/libxl. +## Xen < 4.4 ## + +The concept of vCPU soft affinity has been introduced for the first time +in Xen 4.4. In 4.3, it is the domain''s node-affinity that drives the +NUMA-aware scheduler. The main difference is soft affinity is per-vCPU, +and so each vCPU can have its own mask of pCPUs, while node-affinity is +per-domain, that is the equivalent of having all the vCPUs with the same +soft affinity. + ## Xen < 4.3 ## As NUMA aware scheduling is a new feature of Xen 4.3, things are a little bit different for earlier version of Xen. If no "cpus=" option is specified and Xen 4.2 is in use, the automatic placement algorithm still runs, but the results is used to _pin_ the vCPUs of the domain to the output node(s). -This is consistent with what was happening with xm/xend, which were also -affecting the domain''s CPU affinity. +This is consistent with what was happening with xm/xend. On a version of Xen earlier than 4.2, there is not automatic placement at -all in xl or libxl, and hence no node or CPU affinity being affected. +all in xl or libxl, and hence no node-affinity, vCPU affinity or pinning +being introduced/modified. ## Limitations ## diff --git a/xen/common/domain.c b/xen/common/domain.c index d8116c7..d6ac4d1 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -128,6 +128,7 @@ struct vcpu *alloc_vcpu( if ( !zalloc_cpumask_var(&v->cpu_hard_affinity) || !zalloc_cpumask_var(&v->cpu_hard_affinity_tmp) || !zalloc_cpumask_var(&v->cpu_hard_affinity_saved) || + !zalloc_cpumask_var(&v->cpu_soft_affinity) || !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) ) goto fail_free; @@ -159,6 +160,7 @@ struct vcpu *alloc_vcpu( free_cpumask_var(v->cpu_hard_affinity); free_cpumask_var(v->cpu_hard_affinity_tmp); free_cpumask_var(v->cpu_hard_affinity_saved); + free_cpumask_var(v->cpu_soft_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); return NULL; @@ -390,8 +392,6 @@ void domain_update_node_affinity(struct domain *d) node_set(node, d->node_affinity); } - sched_set_node_affinity(d, &d->node_affinity); - spin_unlock(&d->node_affinity_lock); free_cpumask_var(online_affinity); @@ -737,6 +737,7 @@ static void complete_domain_destroy(struct rcu_head *head) free_cpumask_var(v->cpu_hard_affinity); free_cpumask_var(v->cpu_hard_affinity_tmp); free_cpumask_var(v->cpu_hard_affinity_saved); + free_cpumask_var(v->cpu_soft_affinity); free_cpumask_var(v->vcpu_dirty_cpumask); free_vcpu_struct(v); } diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c index c11f577..42fb418 100644 --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -298,6 +298,8 @@ static void dump_domains(unsigned char key) printk("dirty_cpus=%s ", tmpstr); cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_hard_affinity); printk("cpu_affinity=%s\n", tmpstr); + cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_soft_affinity); + printk("cpu_soft_affinity=%s\n", tmpstr); printk(" pause_count=%d pause_flags=%lx\n", atomic_read(&v->pause_count), v->pause_flags); arch_dump_vcpu_info(v); diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c index c6a2560..8b02b7b 100644 --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -112,10 +112,24 @@ /* - * Node Balancing + * Hard and soft affinity load balancing. + * + * Idea is each vcpu has some pcpus that it prefers, some that it does not + * prefer but is OK with, and some that it cannot run on at all. The first + * set of pcpus are the ones that are both in the soft affinity *and* in the + * hard affinity; the second set of pcpus are the ones that are in the hard + * affinity but *not* in the soft affinity; the third set of pcpus are the + * ones that are not in the hard affinity. + * + * We implement a two step balancing logic. Basically, every time there is + * the need to decide where to run a vcpu, we first check the soft affinity + * (well, actually, the && between soft and hard affinity), to see if we can + * send it where it prefers to (and can) run on. However, if the first step + * does not find any suitable and free pcpu, we fall back checking the hard + * affinity. */ -#define CSCHED_BALANCE_NODE_AFFINITY 0 -#define CSCHED_BALANCE_CPU_AFFINITY 1 +#define CSCHED_BALANCE_SOFT_AFFINITY 0 +#define CSCHED_BALANCE_HARD_AFFINITY 1 /* * Boot parameters @@ -138,7 +152,7 @@ struct csched_pcpu { /* * Convenience macro for accessing the per-PCPU cpumask we need for - * implementing the two steps (vcpu and node affinity) balancing logic. + * implementing the two steps (soft and hard affinity) balancing logic. * It is stored in csched_pcpu so that serialization is not an issue, * as there is a csched_pcpu for each PCPU and we always hold the * runqueue spin-lock when using this. @@ -178,9 +192,6 @@ struct csched_dom { struct list_head active_vcpu; struct list_head active_sdom_elem; struct domain *dom; - /* cpumask translated from the domain''s node-affinity. - * Basically, the CPUs we prefer to be scheduled on. */ - cpumask_var_t node_affinity_cpumask; uint16_t active_vcpu_count; uint16_t weight; uint16_t cap; @@ -261,59 +272,28 @@ __runq_remove(struct csched_vcpu *svc) list_del_init(&svc->runq_elem); } -/* - * Translates node-affinity mask into a cpumask, so that we can use it during - * actual scheduling. That of course will contain all the cpus from all the - * set nodes in the original node-affinity mask. - * - * Note that any serialization needed to access mask safely is complete - * responsibility of the caller of this function/hook. - */ -static void csched_set_node_affinity( - const struct scheduler *ops, - struct domain *d, - nodemask_t *mask) -{ - struct csched_dom *sdom; - int node; - - /* Skip idle domain since it doesn''t even have a node_affinity_cpumask */ - if ( unlikely(is_idle_domain(d)) ) - return; - - sdom = CSCHED_DOM(d); - cpumask_clear(sdom->node_affinity_cpumask); - for_each_node_mask( node, *mask ) - cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, - &node_to_cpumask(node)); -} #define for_each_csched_balance_step(step) \ - for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ ) + for ( (step) = 0; (step) <= CSCHED_BALANCE_HARD_AFFINITY; (step)++ ) /* - * vcpu-affinity balancing is always necessary and must never be skipped. - * OTOH, if a domain''s node-affinity is said to be automatically computed - * (or if it just spans all the nodes), we can safely avoid dealing with - * node-affinity entirely. + * Hard affinity balancing is always necessary and must never be skipped. + * OTOH, if the vcpu''s soft affinity is full (it spans all the possible + * pcpus) we can safely avoid dealing with it entirely. * - * Node-affinity is also deemed meaningless in case it has empty - * intersection with mask, to cover the cases where using the node-affinity + * A vcpu''s soft affinity is also deemed meaningless in case it has empty + * intersection with mask, to cover the cases where using the soft affinity * mask seems legit, but would instead led to trying to schedule the vcpu * on _no_ pcpu! Typical use cases are for mask to be equal to the vcpu''s - * vcpu-affinity, or to the && of vcpu-affinity and the set of online cpus + * hard affinity, or to the && of hard affinity and the set of online cpus * in the domain''s cpupool. */ -static inline int __vcpu_has_node_affinity(const struct vcpu *vc, +static inline int __vcpu_has_soft_affinity(const struct vcpu *vc, const cpumask_t *mask) { - const struct domain *d = vc->domain; - const struct csched_dom *sdom = CSCHED_DOM(d); - - if ( d->auto_node_affinity - || cpumask_full(sdom->node_affinity_cpumask) - || !cpumask_intersects(sdom->node_affinity_cpumask, mask) ) + if ( cpumask_full(vc->cpu_soft_affinity) + || !cpumask_intersects(vc->cpu_soft_affinity, mask) ) return 0; return 1; @@ -321,23 +301,22 @@ static inline int __vcpu_has_node_affinity(const struct vcpu *vc, /* * Each csched-balance step uses its own cpumask. This function determines - * which one (given the step) and copies it in mask. For the node-affinity - * balancing step, the pcpus that are not part of vc''s vcpu-affinity are + * which one (given the step) and copies it in mask. For the soft affinity + * balancing step, the pcpus that are not part of vc''s hard affinity are * filtered out from the result, to avoid running a vcpu where it would * like, but is not allowed to! */ static void csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) { - if ( step == CSCHED_BALANCE_NODE_AFFINITY ) + if ( step == CSCHED_BALANCE_SOFT_AFFINITY ) { - cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask, - vc->cpu_hard_affinity); + cpumask_and(mask, vc->cpu_soft_affinity, vc->cpu_hard_affinity); if ( unlikely(cpumask_empty(mask)) ) cpumask_copy(mask, vc->cpu_hard_affinity); } - else /* step == CSCHED_BALANCE_CPU_AFFINITY */ + else /* step == CSCHED_BALANCE_HARD_AFFINITY */ cpumask_copy(mask, vc->cpu_hard_affinity); } @@ -398,15 +377,15 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) else if ( !idlers_empty ) { /* - * Node and vcpu-affinity balancing loop. For vcpus without - * a useful node-affinity, consider vcpu-affinity only. + * Soft and hard affinity balancing loop. For vcpus without + * a useful soft affinity, consider hard affinity only. */ for_each_csched_balance_step( balance_step ) { int new_idlers_empty; - if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(new->vcpu, + if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(new->vcpu, new->vcpu->cpu_hard_affinity) ) continue; @@ -418,11 +397,11 @@ __runq_tickle(unsigned int cpu, struct csched_vcpu *new) /* * Let''s not be too harsh! If there aren''t idlers suitable - * for new in its node-affinity mask, make sure we check its - * vcpu-affinity as well, before taking final decisions. + * for new in its soft affinity mask, make sure we check its + * hard affinity as well, before taking final decisions. */ if ( new_idlers_empty - && balance_step == CSCHED_BALANCE_NODE_AFFINITY ) + && balance_step == CSCHED_BALANCE_SOFT_AFFINITY ) continue; /* @@ -649,23 +628,23 @@ _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit) /* * We want to pick up a pcpu among the ones that are online and * can accommodate vc, which is basically what we computed above - * and stored in cpus. As far as vcpu-affinity is concerned, + * and stored in cpus. As far as hard affinity is concerned, * there always will be at least one of these pcpus, hence cpus * is never empty and the calls to cpumask_cycle() and * cpumask_test_cpu() below are ok. * - * On the other hand, when considering node-affinity too, it + * On the other hand, when considering soft affinity too, it * is possible for the mask to become empty (for instance, if the * domain has been put in a cpupool that does not contain any of the - * nodes in its node-affinity), which would result in the ASSERT()-s + * pcpus in its soft affinity), which would result in the ASSERT()-s * inside cpumask_*() operations triggering (in debug builds). * - * Therefore, in this case, we filter the node-affinity mask against - * cpus and, if the result is empty, we just skip the node-affinity + * Therefore, in this case, we filter the soft affinity mask against + * cpus and, if the result is empty, we just skip the soft affinity * balancing step all together. */ - if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc, &cpus) ) + if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(vc, &cpus) ) continue; /* Pick an online CPU from the proper affinity mask */ @@ -1122,13 +1101,6 @@ csched_alloc_domdata(const struct scheduler *ops, struct domain *dom) if ( sdom == NULL ) return NULL; - if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) ) - { - xfree(sdom); - return NULL; - } - cpumask_setall(sdom->node_affinity_cpumask); - /* Initialize credit and weight */ INIT_LIST_HEAD(&sdom->active_vcpu); INIT_LIST_HEAD(&sdom->active_sdom_elem); @@ -1158,9 +1130,6 @@ csched_dom_init(const struct scheduler *ops, struct domain *dom) static void csched_free_domdata(const struct scheduler *ops, void *data) { - struct csched_dom *sdom = data; - - free_cpumask_var(sdom->node_affinity_cpumask); xfree(data); } @@ -1486,19 +1455,19 @@ csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) BUG_ON( is_idle_vcpu(vc) ); /* - * If the vcpu has no useful node-affinity, skip this vcpu. - * In fact, what we want is to check if we have any node-affine - * work to steal, before starting to look at vcpu-affine work. + * If the vcpu has no useful soft affinity, skip this vcpu. + * In fact, what we want is to check if we have any "soft-affine + * work" to steal, before starting to look at "hard-affine work". * * Notice that, if not even one vCPU on this runq has a useful - * node-affinity, we could have avoid considering this runq for - * a node balancing step in the first place. This, for instance, + * soft affinity, we could have avoid considering this runq for + * a soft balancing step in the first place. This, for instance, * can be implemented by taking note of on what runq there are - * vCPUs with useful node-affinities in some sort of bitmap + * vCPUs with useful soft affinities in some sort of bitmap * or counter. */ - if ( balance_step == CSCHED_BALANCE_NODE_AFFINITY - && !__vcpu_has_node_affinity(vc, vc->cpu_hard_affinity) ) + if ( balance_step == CSCHED_BALANCE_SOFT_AFFINITY + && !__vcpu_has_soft_affinity(vc, vc->cpu_hard_affinity) ) continue; csched_balance_cpumask(vc, balance_step, csched_balance_mask); @@ -1546,17 +1515,17 @@ csched_load_balance(struct csched_private *prv, int cpu, SCHED_STAT_CRANK(load_balance_other); /* - * Let''s look around for work to steal, taking both vcpu-affinity - * and node-affinity into account. More specifically, we check all + * Let''s look around for work to steal, taking both hard affinity + * and soft affinity into account. More specifically, we check all * the non-idle CPUs'' runq, looking for: - * 1. any node-affine work to steal first, - * 2. if not finding anything, any vcpu-affine work to steal. + * 1. any "soft-affine work" to steal first, + * 2. if not finding anything, any "hard-affine work" to steal. */ for_each_csched_balance_step( bstep ) { /* * We peek at the non-idling CPUs in a node-wise fashion. In fact, - * it is more likely that we find some node-affine work on our same + * it is more likely that we find some affine work on our same * node, not to mention that migrating vcpus within the same node * could well expected to be cheaper than across-nodes (memory * stays local, there might be some node-wide cache[s], etc.). @@ -1982,8 +1951,6 @@ const struct scheduler sched_credit_def = { .adjust = csched_dom_cntl, .adjust_global = csched_sys_cntl, - .set_node_affinity = csched_set_node_affinity, - .pick_cpu = csched_cpu_pick, .do_schedule = csched_schedule, diff --git a/xen/common/schedule.c b/xen/common/schedule.c index c4236c5..c9ae521 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -198,6 +198,8 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor) else cpumask_setall(v->cpu_hard_affinity); + cpumask_setall(v->cpu_soft_affinity); + /* Initialise the per-vcpu timers. */ init_timer(&v->periodic_timer, vcpu_periodic_timer_fn, v, v->processor); @@ -286,6 +288,7 @@ int sched_move_domain(struct domain *d, struct cpupool *c) migrate_timer(&v->poll_timer, new_p); cpumask_setall(v->cpu_hard_affinity); + cpumask_setall(v->cpu_soft_affinity); lock = vcpu_schedule_lock_irq(v); v->processor = new_p; diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 40e5927..3575312 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -198,6 +198,9 @@ struct vcpu /* Used to restore affinity across S3. */ cpumask_var_t cpu_hard_affinity_saved; + /* Bitmask of CPUs on which this VCPU prefers to run. */ + cpumask_var_t cpu_soft_affinity; + /* Bitmask of CPUs which are holding onto this VCPU''s state. */ cpumask_var_t vcpu_dirty_cpumask;
Dario Faggioli
2013-Dec-02 18:28 UTC
[PATCH v5 11/17] xen: derive NUMA node affinity from hard and soft CPU affinity
if a domain''s NUMA node-affinity (which is what controls memory allocations) is provided by the user/toolstack, it just is not touched. However, if the user does not say anything, leaving it all to Xen, let''s compute it in the following way: 1. cpupool''s cpus & hard-affinity & soft-affinity 2. if (1) is empty: cpupool''s cpus & hard-affinity This guarantees memory to be allocated from the narrowest possible set of NUMA nodes, ad makes it relatively easy to set up NUMA-aware scheduling on top of soft affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes from v3: * avoid pointless calls to cpumask_clear(), as requested during review; * ASSERT() non emptyness of cpupool & hard affinity, as suggested during review. Changes from v2: * the loop computing the mask is now only executed when it really is useful, as suggested during review; * the loop, and all the cpumask handling is optimized, in a way similar to what was suggested during review. --- xen/common/domain.c | 61 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 39 insertions(+), 22 deletions(-) diff --git a/xen/common/domain.c b/xen/common/domain.c index d6ac4d1..4adc123 100644 --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -353,17 +353,17 @@ struct domain *domain_create( void domain_update_node_affinity(struct domain *d) { - cpumask_var_t cpumask; - cpumask_var_t online_affinity; + cpumask_var_t dom_cpumask, dom_cpumask_soft; + cpumask_t *dom_affinity; const cpumask_t *online; struct vcpu *v; - unsigned int node; + unsigned int cpu; - if ( !zalloc_cpumask_var(&cpumask) ) + if ( !zalloc_cpumask_var(&dom_cpumask) ) return; - if ( !alloc_cpumask_var(&online_affinity) ) + if ( !zalloc_cpumask_var(&dom_cpumask_soft) ) { - free_cpumask_var(cpumask); + free_cpumask_var(dom_cpumask); return; } @@ -371,31 +371,48 @@ void domain_update_node_affinity(struct domain *d) spin_lock(&d->node_affinity_lock); - for_each_vcpu ( d, v ) - { - cpumask_and(online_affinity, v->cpu_hard_affinity, online); - cpumask_or(cpumask, cpumask, online_affinity); - } - /* - * If d->auto_node_affinity is true, the domain''s node-affinity mask - * (d->node_affinity) is automaically computed from all the domain''s - * vcpus'' vcpu-affinity masks (the union of which we have just built - * above in cpumask). OTOH, if d->auto_node_affinity is false, we - * must leave the node-affinity of the domain alone. + * If d->auto_node_affinity is true, let''s compute the domain''s + * node-affinity and update d->node_affinity accordingly. if false, + * just leave d->auto_node_affinity alone. */ if ( d->auto_node_affinity ) { + /* + * We want the narrowest possible set of pcpus (to get the narowest + * possible set of nodes). What we need is the cpumask of where the + * domain can run (the union of the hard affinity of all its vcpus), + * and the full mask of where it would prefer to run (the union of + * the soft affinity of all its various vcpus). Let''s build them. + */ + for_each_vcpu ( d, v ) + { + cpumask_or(dom_cpumask, dom_cpumask, v->cpu_hard_affinity); + cpumask_or(dom_cpumask_soft, dom_cpumask_soft, + v->cpu_soft_affinity); + } + /* Filter out non-online cpus */ + cpumask_and(dom_cpumask, dom_cpumask, online); + ASSERT(!cpumask_empty(dom_cpumask)); + /* And compute the intersection between hard, online and soft */ + cpumask_and(dom_cpumask_soft, dom_cpumask_soft, dom_cpumask); + + /* + * If not empty, the intersection of hard, soft and online is the + * narrowest set we want. If empty, we fall back to hard&online. + */ + dom_affinity = cpumask_empty(dom_cpumask_soft) ? + dom_cpumask : dom_cpumask_soft; + nodes_clear(d->node_affinity); - for_each_online_node ( node ) - if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_set(node, d->node_affinity); + for_each_cpu( cpu, dom_affinity ) + node_set(cpu_to_node(cpu), d->node_affinity); } spin_unlock(&d->node_affinity_lock); - free_cpumask_var(online_affinity); - free_cpumask_var(cpumask); + free_cpumask_var(dom_cpumask_soft); + free_cpumask_var(dom_cpumask); }
Dario Faggioli
2013-Dec-02 18:29 UTC
[PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
by adding a flag for the caller to specify which one he cares about. At the same time, enable the caller to get back the "effective affinity" of the vCPU. That is the intersection between cpupool''s cpus, the (new) hard affinity and, for soft affinity, the (new) soft affinity. In fact, despite what has been successfully set with the DOMCTL_setvcpuaffinity hypercall, the Xen scheduler will never run a vCPU outside of its hard affinity or of its domain''s cpupool. This happens by adding another cpumap to the interface and making both the cpumaps IN/OUT parameters (for DOMCTL_setvcpuaffinity, they''re of course out-only for DOMCTL_getvcpuaffinity). Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes since v4 * make both the cpumaps IN/OUT and use them for reporting back both effective hard and soft affinity, as requested during review; * fix the arguments'' names, the comments and the annotations in the public header accordingly, as requested during review. Changes since v3: * no longer discarding possible errors. Also, rollback setting hard affinity if setting soft affinity fails afterwards, so that the caller really sees no changes when the call fails, as requested during review; * fixed -EFAULT --> -ENOMEM in case of a failed memory allocation, as requested during review; * removed non necessary use of pointer to pointer, as requested during review. Changes from v2: * in DOMCTL_[sg]etvcpuaffinity, flag is really a flag now, i.e., we accept request for setting and getting: (1) only hard affinity; (2) only soft affinity; (3) both; as suggested during review. --- tools/libxc/xc_domain.c | 12 +++-- xen/arch/x86/traps.c | 4 +- xen/common/domctl.c | 111 +++++++++++++++++++++++++++++++++++++++---- xen/common/schedule.c | 35 +++++++++----- xen/common/wait.c | 6 +- xen/include/public/domctl.h | 17 ++++++- xen/include/xen/sched.h | 3 + 7 files changed, 154 insertions(+), 34 deletions(-) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 1ccafc5..8c807a6 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -215,13 +215,14 @@ int xc_vcpu_setaffinity(xc_interface *xch, domctl.cmd = XEN_DOMCTL_setvcpuaffinity; domctl.domain = (domid_t)domid; - domctl.u.vcpuaffinity.vcpu = vcpu; + domctl.u.vcpuaffinity.vcpu = vcpu; + domctl.u.vcpuaffinity.flags = XEN_VCPUAFFINITY_HARD; memcpy(local, cpumap, cpusize); - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); + set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_hard.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; + domctl.u.vcpuaffinity.cpumap_hard.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); @@ -259,9 +260,10 @@ int xc_vcpu_getaffinity(xc_interface *xch, domctl.cmd = XEN_DOMCTL_getvcpuaffinity; domctl.domain = (domid_t)domid; domctl.u.vcpuaffinity.vcpu = vcpu; + domctl.u.vcpuaffinity.flags = XEN_VCPUAFFINITY_HARD; - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8; + set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_hard.bitmap, local); + domctl.u.vcpuaffinity.cpumap_hard.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c index 157031e..ff4523b 100644 --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -3103,7 +3103,7 @@ static void nmi_mce_softirq(void) * Make sure to wakeup the vcpu on the * specified processor. */ - vcpu_set_affinity(st->vcpu, cpumask_of(st->processor)); + vcpu_set_hard_affinity(st->vcpu, cpumask_of(st->processor)); /* Affinity is restored in the iret hypercall. */ } @@ -3132,7 +3132,7 @@ void async_exception_cleanup(struct vcpu *curr) if ( !cpumask_empty(curr->cpu_hard_affinity_tmp) && !cpumask_equal(curr->cpu_hard_affinity_tmp, curr->cpu_hard_affinity) ) { - vcpu_set_affinity(curr, curr->cpu_hard_affinity_tmp); + vcpu_set_hard_affinity(curr, curr->cpu_hard_affinity_tmp); cpumask_clear(curr->cpu_hard_affinity_tmp); } diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 5e0ac5c..9eecb5e 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -287,6 +287,16 @@ void domctl_lock_release(void) spin_unlock(¤t->domain->hypercall_deadlock_mutex); } +static inline +int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff) +{ + return vcpuaff->flags == 0 || + (vcpuaff->flags & XEN_VCPUAFFINITY_HARD && + guest_handle_is_null(vcpuaff->cpumap_hard.bitmap)) || + (vcpuaff->flags & XEN_VCPUAFFINITY_SOFT && + guest_handle_is_null(vcpuaff->cpumap_soft.bitmap)); +} + long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) { long ret = 0; @@ -605,31 +615,112 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) case XEN_DOMCTL_getvcpuaffinity: { struct vcpu *v; + xen_domctl_vcpuaffinity_t *vcpuaff = &op->u.vcpuaffinity; ret = -EINVAL; - if ( op->u.vcpuaffinity.vcpu >= d->max_vcpus ) + if ( vcpuaff->vcpu >= d->max_vcpus ) break; ret = -ESRCH; - if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL ) + if ( (v = d->vcpu[vcpuaff->vcpu]) == NULL ) break; if ( op->cmd == XEN_DOMCTL_setvcpuaffinity ) { - cpumask_var_t new_affinity; + cpumask_var_t new_affinity, old_affinity; + cpumask_t *online = cpupool_online_cpumask(v->domain->cpupool);; + + /* + * We want to be able to restore hard affinity if we are trying + * setting both and changing soft affinity (which happens later, + * when hard affinity has been succesfully chaged already) fails. + */ + if ( !alloc_cpumask_var(&old_affinity) ) + { + ret = -ENOMEM; + break; + } + cpumask_copy(old_affinity, v->cpu_hard_affinity); - ret = xenctl_bitmap_to_cpumask( - &new_affinity, &op->u.vcpuaffinity.cpumap); - if ( !ret ) + if ( !alloc_cpumask_var(&new_affinity) ) { - ret = vcpu_set_affinity(v, new_affinity); - free_cpumask_var(new_affinity); + free_cpumask_var(old_affinity); + ret = -ENOMEM; + break; } + + ret = -EINVAL; + if (vcpuaffinity_params_invalid(vcpuaff)) + goto setvcpuaffinity_out; + + /* + * We both set a new affinity and report back to the caller what + * the scheduler will be effectively using. + */ + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) + { + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), + &vcpuaff->cpumap_hard, + vcpuaff->cpumap_hard.nr_bits); + if ( !ret ) + ret = vcpu_set_hard_affinity(v, new_affinity); + if ( ret ) + goto setvcpuaffinity_out; + + /* + * For hard affinity, what we return is the intersection of + * cpupool''s online mask and the new hard affinity. + */ + cpumask_and(new_affinity, online, v->cpu_hard_affinity); + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_hard, + new_affinity); + } + if ( vcpuaff->flags & XEN_VCPUAFFINITY_SOFT ) + { + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), + &vcpuaff->cpumap_soft, + vcpuaff->cpumap_soft.nr_bits); + if ( !ret) + ret = vcpu_set_soft_affinity(v, new_affinity); + if ( ret ) + { + /* + * Since we''re returning error, the caller expects nothing + * happened, so we rollback the changes to hard affinity + * (if any). + */ + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) + vcpu_set_hard_affinity(v, old_affinity); + goto setvcpuaffinity_out; + } + + /* + * For soft affinity, we return the intersection between the + * new soft affinity, the cpupool''s online map and the (new) + * hard affinity. + */ + cpumask_and(new_affinity, new_affinity, online); + cpumask_and(new_affinity, new_affinity, v->cpu_hard_affinity); + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_soft, + new_affinity); + } + + setvcpuaffinity_out: + free_cpumask_var(new_affinity); + free_cpumask_var(old_affinity); } else { - ret = cpumask_to_xenctl_bitmap( - &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); + ret = -EINVAL; + if (vcpuaffinity_params_invalid(vcpuaff)) + break; + + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_hard, + v->cpu_hard_affinity); + if ( vcpuaff->flags & XEN_VCPUAFFINITY_SOFT ) + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_soft, + v->cpu_soft_affinity); } } break; diff --git a/xen/common/schedule.c b/xen/common/schedule.c index c9ae521..b1e9b08 100644 --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -654,22 +654,14 @@ void sched_set_node_affinity(struct domain *d, nodemask_t *mask) SCHED_OP(DOM2OP(d), set_node_affinity, d, mask); } -int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) +static int vcpu_set_affinity( + struct vcpu *v, const cpumask_t *affinity, cpumask_t *which) { - cpumask_t online_affinity; - cpumask_t *online; spinlock_t *lock; - if ( v->domain->is_pinned ) - return -EINVAL; - online = VCPU2ONLINE(v); - cpumask_and(&online_affinity, affinity, online); - if ( cpumask_empty(&online_affinity) ) - return -EINVAL; - lock = vcpu_schedule_lock_irq(v); - cpumask_copy(v->cpu_hard_affinity, affinity); + cpumask_copy(which, affinity); /* Always ask the scheduler to re-evaluate placement * when changing the affinity */ @@ -688,6 +680,27 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) return 0; } +int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity) +{ + cpumask_t online_affinity; + cpumask_t *online; + + if ( v->domain->is_pinned ) + return -EINVAL; + + online = VCPU2ONLINE(v); + cpumask_and(&online_affinity, affinity, online); + if ( cpumask_empty(&online_affinity) ) + return -EINVAL; + + return vcpu_set_affinity(v, affinity, v->cpu_hard_affinity); +} + +int vcpu_set_soft_affinity(struct vcpu *v, const cpumask_t *affinity) +{ + return vcpu_set_affinity(v, affinity, v->cpu_soft_affinity); +} + /* Block the currently-executing domain until a pertinent event occurs. */ void vcpu_block(void) { diff --git a/xen/common/wait.c b/xen/common/wait.c index 3f6ff41..1f6b597 100644 --- a/xen/common/wait.c +++ b/xen/common/wait.c @@ -135,7 +135,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) /* Save current VCPU affinity; force wakeup on *this* CPU only. */ wqv->wakeup_cpu = smp_processor_id(); cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); - if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) + if ( vcpu_set_hard_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); domain_crash_synchronous(); @@ -166,7 +166,7 @@ static void __prepare_to_wait(struct waitqueue_vcpu *wqv) static void __finish_wait(struct waitqueue_vcpu *wqv) { wqv->esp = NULL; - (void)vcpu_set_affinity(current, &wqv->saved_affinity); + (void)vcpu_set_hard_affinity(current, &wqv->saved_affinity); } void check_wakeup_from_wait(void) @@ -184,7 +184,7 @@ void check_wakeup_from_wait(void) /* Re-set VCPU affinity and re-enter the scheduler. */ struct vcpu *curr = current; cpumask_copy(&wqv->saved_affinity, curr->cpu_hard_affinity); - if ( vcpu_set_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) + if ( vcpu_set_hard_affinity(curr, cpumask_of(wqv->wakeup_cpu)) ) { gdprintk(XENLOG_ERR, "Unable to set vcpu affinity\n"); domain_crash_synchronous(); diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h index 01a3652..d44a775 100644 --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -300,8 +300,21 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t); /* XEN_DOMCTL_setvcpuaffinity */ /* XEN_DOMCTL_getvcpuaffinity */ struct xen_domctl_vcpuaffinity { - uint32_t vcpu; /* IN */ - struct xenctl_bitmap cpumap; /* IN/OUT */ + /* IN variables. */ + uint32_t vcpu; + /* Set/get the hard affinity for vcpu */ +#define _XEN_VCPUAFFINITY_HARD 0 +#define XEN_VCPUAFFINITY_HARD (1U<<_XEN_VCPUAFFINITY_HARD) + /* Set/get the soft affinity for vcpu */ +#define _XEN_VCPUAFFINITY_SOFT 1 +#define XEN_VCPUAFFINITY_SOFT (1U<<_XEN_VCPUAFFINITY_SOFT) + uint32_t flags; + /* + * IN/OUT variables. Both are IN/OUT for XEN_DOMCTL_setvcpuaffinity + * and OUT-only for XEN_DOMCTL_getvcpuaffinity. + */ + struct xenctl_bitmap cpumap_hard; + struct xenctl_bitmap cpumap_soft; }; typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h index 3575312..0f728b3 100644 --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -755,7 +755,8 @@ void scheduler_free(struct scheduler *sched); int schedule_cpu_switch(unsigned int cpu, struct cpupool *c); void vcpu_force_reschedule(struct vcpu *v); int cpu_disable_scheduler(unsigned int cpu); -int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity); +int vcpu_set_hard_affinity(struct vcpu *v, const cpumask_t *affinity); +int vcpu_set_soft_affinity(struct vcpu *v, const cpumask_t *affinity); void restore_vcpu_affinity(struct domain *d); void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);
Dario Faggioli
2013-Dec-02 18:29 UTC
[PATCH v5 13/17] libxc: get and set soft and hard affinity
by using the flag and the new cpumap arguments introduced in the parameters of the DOMCTL_{get,set}_vcpuaffinity hypercalls. Now, both xc_vcpu_setaffinity() and xc_vcpu_getaffinity() have a new flag parameter, to specify whether the user wants to set/get hard affinity, soft affinity or both. They also have two cpumap parameters instead of only one. This way, it is possible to set/get both hard and soft affinity at the same time (and, in case of set, each one to its own value). In xc_vcpu_setaffinity(), the cpumaps are IN/OUT parameters, as it is for the corresponding arguments of the DOMCTL_set_vcpuaffinity hypercall. What Xen puts there is the hard and soft effective affinity, that is what Xen will actually use for scheduling. In-tree callers are also fixed to cope with the new interface. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v4: * update toward the new hypercall interface; * migrate to hypercall BOUNCEs instead of BUFFERs, as suggested during (v3) review; Changes from v2: * better cleanup logic in _vcpu_setaffinity() (regarding xc_hypercall_buffer_{alloc,free}()), as suggested during review; * make it more evident that DOMCTL_setvcpuaffinity has an out parameter, by calling ecpumap_out, and improving the comment wrt that; * change the interface and have xc_vcpu_[sg]etaffinity() so that they take the new parameters (flags and ecpumap_out) and fix the in tree callers. --- tools/libxc/xc_domain.c | 72 ++++++++++++++++++++++------------- tools/libxc/xenctrl.h | 55 ++++++++++++++++++++++++++- tools/libxl/libxl.c | 6 ++- tools/ocaml/libs/xc/xenctrl_stubs.c | 8 +++- tools/python/xen/lowlevel/xc/xc.c | 6 ++- 5 files changed, 113 insertions(+), 34 deletions(-) diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c index 8c807a6..0348f23 100644 --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -192,43 +192,53 @@ int xc_domain_node_getaffinity(xc_interface *xch, int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu, - xc_cpumap_t cpumap) + xc_cpumap_t cpumap_hard_inout, + xc_cpumap_t cpumap_soft_inout, + uint32_t flags) { DECLARE_DOMCTL; - DECLARE_HYPERCALL_BUFFER(uint8_t, local); + DECLARE_HYPERCALL_BOUNCE(cpumap_hard_inout, 0, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); + DECLARE_HYPERCALL_BOUNCE(cpumap_soft_inout, 0, + XC_HYPERCALL_BUFFER_BOUNCE_BOTH); int ret = -1; int cpusize; cpusize = xc_get_cpumap_size(xch); - if (!cpusize) + if ( !cpusize ) { PERROR("Could not get number of cpus"); - goto out; + return -1; } - local = xc_hypercall_buffer_alloc(xch, local, cpusize); - if ( local == NULL ) + HYPERCALL_BOUNCE_SET_SIZE(cpumap_hard_inout, cpusize); + HYPERCALL_BOUNCE_SET_SIZE(cpumap_soft_inout, cpusize); + + if ( xc_hypercall_bounce_pre(xch, cpumap_hard_inout) || + xc_hypercall_bounce_pre(xch, cpumap_soft_inout) ) { - PERROR("Could not allocate memory for setvcpuaffinity domctl hypercall"); + PERROR("Could not allocate hcall buffers for DOMCTL_setvcpuaffinity"); goto out; } domctl.cmd = XEN_DOMCTL_setvcpuaffinity; domctl.domain = (domid_t)domid; domctl.u.vcpuaffinity.vcpu = vcpu; - domctl.u.vcpuaffinity.flags = XEN_VCPUAFFINITY_HARD; - - memcpy(local, cpumap, cpusize); - - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_hard.bitmap, local); + domctl.u.vcpuaffinity.flags = flags; + set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_hard.bitmap, + cpumap_hard_inout); domctl.u.vcpuaffinity.cpumap_hard.nr_bits = cpusize * 8; + set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_soft.bitmap, + cpumap_soft_inout); + domctl.u.vcpuaffinity.cpumap_soft.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); - xc_hypercall_buffer_free(xch, local); - out: + xc_hypercall_bounce_post(xch, cpumap_hard_inout); + xc_hypercall_bounce_post(xch, cpumap_soft_inout); + return ret; } @@ -236,41 +246,51 @@ int xc_vcpu_setaffinity(xc_interface *xch, int xc_vcpu_getaffinity(xc_interface *xch, uint32_t domid, int vcpu, - xc_cpumap_t cpumap) + xc_cpumap_t cpumap_hard, + xc_cpumap_t cpumap_soft, + uint32_t flags) { DECLARE_DOMCTL; - DECLARE_HYPERCALL_BUFFER(uint8_t, local); + DECLARE_HYPERCALL_BOUNCE(cpumap_hard, 0, XC_HYPERCALL_BUFFER_BOUNCE_OUT); + DECLARE_HYPERCALL_BOUNCE(cpumap_soft, 0, XC_HYPERCALL_BUFFER_BOUNCE_OUT); int ret = -1; int cpusize; cpusize = xc_get_cpumap_size(xch); - if (!cpusize) + if ( !cpusize ) { PERROR("Could not get number of cpus"); - goto out; + return -1; } - local = xc_hypercall_buffer_alloc(xch, local, cpusize); - if (local == NULL) + HYPERCALL_BOUNCE_SET_SIZE(cpumap_hard, cpusize); + HYPERCALL_BOUNCE_SET_SIZE(cpumap_soft, cpusize); + + if ( xc_hypercall_bounce_pre(xch, cpumap_hard) || + xc_hypercall_bounce_pre(xch, cpumap_soft) ) { - PERROR("Could not allocate memory for getvcpuaffinity domctl hypercall"); + PERROR("Could not allocate hcall buffers for DOMCTL_getvcpuaffinity"); goto out; } domctl.cmd = XEN_DOMCTL_getvcpuaffinity; domctl.domain = (domid_t)domid; domctl.u.vcpuaffinity.vcpu = vcpu; - domctl.u.vcpuaffinity.flags = XEN_VCPUAFFINITY_HARD; + domctl.u.vcpuaffinity.flags = flags; - set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_hard.bitmap, local); + set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_hard.bitmap, + cpumap_hard); domctl.u.vcpuaffinity.cpumap_hard.nr_bits = cpusize * 8; + set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap_soft.bitmap, + cpumap_soft); + domctl.u.vcpuaffinity.cpumap_soft.nr_bits = cpusize * 8; ret = do_domctl(xch, &domctl); - memcpy(cpumap, local, cpusize); + out: + xc_hypercall_bounce_post(xch, cpumap_hard); + xc_hypercall_bounce_post(xch, cpumap_soft); - xc_hypercall_buffer_free(xch, local); -out: return ret; } diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h index 733ed03..7663e2f 100644 --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -582,14 +582,65 @@ int xc_domain_node_getaffinity(xc_interface *xch, uint32_t domind, xc_nodemap_t nodemap); +/** + * This function specifies the CPU affinity for a vcpu. + * + * There are two kinds of affinity. Soft affinity is on what CPUs a vcpu + * prefers to run. Hard affinity is on what CPUs a vcpu is allowed to run. + * If flags contains XEN_VCPUAFFINITY_SOFT, the soft affinity it is set to + * what cpumap_soft_inout contains. If flags contains XEN_VCPUAFFINITY_HARD, + * the hard affinity is set to what cpumap_hard_inout contains. Both flags + * can be set at the same time, in which case both soft and hard affinity are + * set to what the respective parameter contains. + * + * The function also returns the effective hard or/and soft affinity, still + * via the cpumap_soft_inout and cpumap_hard_inout parameters. Effective + * affinity is, in case of soft affinity, the intersection of soft affinity, + * hard affinity and the cpupool''s online CPUs for the domain, and is returned + * in cpumap_soft_inout, if XEN_VCPUAFFINITY_SOFT is set in flags. In case of + * hard affinity, it is the intersection between hard affinity and the + * cpupool''s online CPUs, and is returned in cpumap_hard_inout, if + * XEN_VCPUAFFINITY_HARD is set in flags. If both flags are set, both soft + * and hard affinity are returned in the respective parameter. + * + * We do report it back as effective affinity is what the Xen scheduler will + * actually use, and we thus allow checking whether or not that matches with, + * or at least is good enough for, the caller''s purposes. + * + * @param xch a handle to an open hypervisor interface. + * @param domid the id of the domain to which the vcpu belongs + * @param vcpu the vcpu id wihin the domain + * @param cpumap_hard_inout specifies(/returns) the (effective) hard affinity + * @param cpumap_soft_inout specifies(/returns) the (effective) soft affinity + * @param flags what we want to set + */ int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu, - xc_cpumap_t cpumap); + xc_cpumap_t cpumap_hard_inout, + xc_cpumap_t cpumap_soft_inout, + uint32_t flags); + +/** + * This function retrieves hard and soft CPU affinity of a vcpu, + * depending on what flags are set. + * + * Soft affinity is returned in cpumap_soft if XEN_VCPUAFFINITY_SOFT is set. + * Hard affinity is returned in cpumap_hard if XEN_VCPUAFFINITY_HARD is set. + * + * @param xch a handle to an open hypervisor interface. + * @param domid the id of the domain to which the vcpu belongs + * @param vcpu the vcpu id wihin the domain + * @param cpumap_hard is where hard affinity is returned + * @param cpumap_soft is where soft affinity is returned + * @param flags what we want get + */ int xc_vcpu_getaffinity(xc_interface *xch, uint32_t domid, int vcpu, - xc_cpumap_t cpumap); + xc_cpumap_t cpumap_hard, + xc_cpumap_t cpumap_soft, + uint32_t flags); /** diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index d3180dc..e55bd68 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4546,7 +4546,8 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LOGE(ERROR, "getting vcpu info"); goto err; } - if (xc_vcpu_getaffinity(ctx->xch, domid, *nb_vcpu, ptr->cpumap.map) == -1) { + if (xc_vcpu_getaffinity(ctx->xch, domid, *nb_vcpu, ptr->cpumap.map, + NULL, XEN_VCPUAFFINITY_HARD) == -1) { LOGE(ERROR, "getting vcpu affinity"); goto err; } @@ -4570,7 +4571,8 @@ err: int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, libxl_bitmap *cpumap) { - if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, cpumap->map)) { + if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, cpumap->map, NULL, + XEN_VCPUAFFINITY_HARD)) { LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu affinity"); return ERROR_FAIL; } diff --git a/tools/ocaml/libs/xc/xenctrl_stubs.c b/tools/ocaml/libs/xc/xenctrl_stubs.c index f5cf0ed..4d22b82 100644 --- a/tools/ocaml/libs/xc/xenctrl_stubs.c +++ b/tools/ocaml/libs/xc/xenctrl_stubs.c @@ -438,7 +438,9 @@ CAMLprim value stub_xc_vcpu_setaffinity(value xch, value domid, c_cpumap[i/8] |= 1 << (i&7); } retval = xc_vcpu_setaffinity(_H(xch), _D(domid), - Int_val(vcpu), c_cpumap); + Int_val(vcpu), + c_cpumap, NULL, + XEN_VCPUAFFINITY_HARD); free(c_cpumap); if (retval < 0) @@ -460,7 +462,9 @@ CAMLprim value stub_xc_vcpu_getaffinity(value xch, value domid, failwith_xc(_H(xch)); retval = xc_vcpu_getaffinity(_H(xch), _D(domid), - Int_val(vcpu), c_cpumap); + Int_val(vcpu), + c_cpumap, NULL, + XEN_VCPUAFFINITY_HARD); if (retval < 0) { free(c_cpumap); failwith_xc(_H(xch)); diff --git a/tools/python/xen/lowlevel/xc/xc.c b/tools/python/xen/lowlevel/xc/xc.c index 737bdac..ae2522b 100644 --- a/tools/python/xen/lowlevel/xc/xc.c +++ b/tools/python/xen/lowlevel/xc/xc.c @@ -256,7 +256,8 @@ static PyObject *pyxc_vcpu_setaffinity(XcObject *self, } } - if ( xc_vcpu_setaffinity(self->xc_handle, dom, vcpu, cpumap) != 0 ) + if ( xc_vcpu_setaffinity(self->xc_handle, dom, vcpu, cpumap, + NULL, XEN_VCPUAFFINITY_HARD) != 0 ) { free(cpumap); return pyxc_error_to_exception(self->xc_handle); @@ -403,7 +404,8 @@ static PyObject *pyxc_vcpu_getinfo(XcObject *self, if(cpumap == NULL) return pyxc_error_to_exception(self->xc_handle); - rc = xc_vcpu_getaffinity(self->xc_handle, dom, vcpu, cpumap); + rc = xc_vcpu_getaffinity(self->xc_handle, dom, vcpu, cpumap, + NULL, XEN_VCPUAFFINITY_HARD); if ( rc < 0 ) { free(cpumap);
Make space for two new cpumap-s, one in vcpu_info (for getting soft affinity) and build_info (for setting it) and amend the API for setting vCPU affinity. libxl_set_vcpuaffinity() now takes two cpumaps, one for hard and one for soft affinity (LIBXL_API_VERSION is exploited to retain source level backword compatibility). Either of the two cpumap can be NULL, in which case, only the affinity corresponding to the non-NULL cpumap will be affected. Getting soft affinity happens indirectly, via `xl vcpu-list'' (as it is already for hard affinity). This commit also introduces some logic to check whether the affinity which will be used by Xen to schedule the vCPU(s) does actually match with the cpumaps provided. In fact, we want to allow every possible combination of hard and soft affinity to be set, but we warn the user upon particularly weird situations (e.g., hard and soft being disjoint sets of pCPUs). This is the first change breaking the libxl ABI, so it bumps the MAJOR. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com> --- Changes from v4: * adapt to the new xc interface; * avoid leaking cpumap_soft in libxl_list_vcpu on error, as requested during review; * fix bogous `return 0'' in libxl_set_vcpuaffinity, as requested during review; * clarify the comment for LIBXL_HAVE_SOFT_AFFINITY, as suggested during review; * renamed libxl_bitmap_valid() to libxl_bitmap_bit_valid(), as suggested uring review. Changes from v3: * only introduce one LIBXL_HAVE_ symbol for soft affinity, as requested during review; * use LIBXL_API_VERSION instead of having multiple version of the same function, as suggested during review; * use libxl_get_nr_cpus() rather than libxl_get_cputopology(), as suggested during review; * use LOGE() instead of LIBXL__LOG_ERRNO(), as requested during review; * kill the flags and use just one _set_vcpuaffinity() function with two cpumaps, allowing either of them to be NULL, as suggested during review; * avoid overflowing the bitmaps in libxl_bitmap_equal(), as suggested during review. Changes from v2: * interface completely redesigned, as discussed during review. --- tools/libxl/Makefile | 2 + tools/libxl/libxl.c | 85 ++++++++++++++++++++++++++++++++++++++----- tools/libxl/libxl.h | 39 +++++++++++++++++++- tools/libxl/libxl_create.c | 6 +++ tools/libxl/libxl_dom.c | 3 +- tools/libxl/libxl_types.idl | 4 ++ tools/libxl/libxl_utils.h | 25 ++++++++++++- tools/libxl/xl_cmdimpl.c | 6 ++- 8 files changed, 151 insertions(+), 19 deletions(-) diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile index d8495bb..1410c44 100644 --- a/tools/libxl/Makefile +++ b/tools/libxl/Makefile @@ -5,7 +5,7 @@ XEN_ROOT = $(CURDIR)/../.. include $(XEN_ROOT)/tools/Rules.mk -MAJOR = 4.3 +MAJOR = 4.4 MINOR = 0 XLUMAJOR = 4.3 diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c index e55bd68..870d2a2 100644 --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4542,6 +4542,9 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, libxl_bitmap_init(&ptr->cpumap); if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap, 0)) goto err; + libxl_bitmap_init(&ptr->cpumap_soft); + if (libxl_cpu_bitmap_alloc(ctx, &ptr->cpumap_soft, 0)) + goto err; if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1) { LOGE(ERROR, "getting vcpu info"); goto err; @@ -4551,6 +4554,12 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, LOGE(ERROR, "getting vcpu affinity"); goto err; } + if (xc_vcpu_getaffinity(ctx->xch, domid, *nb_vcpu, NULL, + ptr->cpumap_soft.map, + XEN_VCPUAFFINITY_SOFT) == -1) { + LOGE(ERROR, "getting vcpu affinity"); + goto err; + } ptr->vcpuid = *nb_vcpu; ptr->cpu = vcpuinfo.cpu; ptr->online = !!vcpuinfo.online; @@ -4563,34 +4572,90 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, err: libxl_bitmap_dispose(&ptr->cpumap); + libxl_bitmap_dispose(&ptr->cpumap_soft); free(ret); GC_FREE; return NULL; } int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, - libxl_bitmap *cpumap) + const libxl_bitmap *cpumap_hard, + const libxl_bitmap *cpumap_soft) { - if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, cpumap->map, NULL, - XEN_VCPUAFFINITY_HARD)) { - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu affinity"); - return ERROR_FAIL; + GC_INIT(ctx); + libxl_bitmap cpumap; + int nr_cpus = 0, rc; + + if (!cpumap_hard && !cpumap_soft) + return ERROR_INVAL; + + nr_cpus = libxl_get_online_cpus(ctx); + if (nr_cpus < 0) + return nr_cpus; + + rc = libxl_cpu_bitmap_alloc(ctx, &cpumap, 0); + if (rc) + return rc; + + if (cpumap_hard) { + libxl_bitmap_copy(ctx, &cpumap, cpumap_hard); + + if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, cpumap.map, NULL, + XEN_VCPUAFFINITY_HARD)) { + LOGE(ERROR, "setting vcpu hard affinity"); + rc = ERROR_FAIL; + goto out; + } + + if (!libxl_bitmap_equal(cpumap_hard, &cpumap, nr_cpus)) + LOG(DEBUG, "New hard affinity for vcpu %d has unreachable cpus", + vcpuid); } - return 0; + if (cpumap_soft) { + libxl_bitmap_copy(ctx, &cpumap, cpumap_soft); + + if (xc_vcpu_setaffinity(ctx->xch, domid, vcpuid, NULL, cpumap.map, + XEN_VCPUAFFINITY_SOFT)) { + LOGE(ERROR, "setting vcpu soft affinity"); + rc = ERROR_FAIL; + goto out; + } + + if (!libxl_bitmap_equal(cpumap_soft, &cpumap, nr_cpus)) + LOG(DEBUG, "New soft affinity for vcpu %d has unreachable cpus", + vcpuid); + /* + * When setting hard affinity, it is guaranteed that the result will + * not be empty, so we need to check for that only for soft affinity. + */ + if (cpumap_soft && libxl_bitmap_is_empty(&cpumap)) + LOG(WARN, "Only unreachable cpus in the soft affinity of vcpu %d." + " Only hard affinity will be considered for scheduling", + vcpuid); + } + rc = 0; + out: + libxl_bitmap_dispose(&cpumap); + GC_FREE; + return rc; } int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, - unsigned int max_vcpus, libxl_bitmap *cpumap) + unsigned int max_vcpus, + const libxl_bitmap *cpumap_hard, + const libxl_bitmap *cpumap_soft) { + GC_INIT(ctx); int i, rc = 0; for (i = 0; i < max_vcpus; i++) { - if (libxl_set_vcpuaffinity(ctx, domid, i, cpumap)) { - LIBXL__LOG(ctx, LIBXL__LOG_WARNING, - "failed to set affinity for %d", i); + if (libxl_set_vcpuaffinity(ctx, domid, i, cpumap_hard, cpumap_soft)) { + LOG(WARN, "failed to set affinity for %d", i); rc = ERROR_FAIL; } } + + GC_FREE; return rc; } diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h index a0d0908..d7fefcf 100644 --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -82,6 +82,15 @@ #define LIBXL_HAVE_DOMAIN_NODEAFFINITY 1 /* + * LIBXL_HAVE_SOFT_AFFINITY indicates that a ''cpumap_soft'' + * field (of libxl_bitmap type) is present in: + * - libxl_vcpuinfo, containing the soft affinity of a vcpu; + * - libxl_domain_build_info, for setting the soft affinity od + * all vcpus while creating the domain. + */ +#define LIBXL_HAVE_SOFT_AFFINITY 1 + +/* * LIBXL_HAVE_BUILDINFO_HVM_VENDOR_DEVICE indicates that the * libxl_vendor_device field is present in the hvm sections of * libxl_domain_build_info. This field tells libxl which @@ -994,9 +1003,35 @@ int libxl_userdata_retrieve(libxl_ctx *ctx, uint32_t domid, int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo); int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid, - libxl_bitmap *cpumap); + const libxl_bitmap *cpumap_hard, + const libxl_bitmap *cpumap_soft); int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, - unsigned int max_vcpus, libxl_bitmap *cpumap); + unsigned int max_vcpus, + const libxl_bitmap *cpumap_hard, + const libxl_bitmap *cpumap_soft); + +#if defined (LIBXL_API_VERSION) && LIBXL_API_VERSION < 0x040400 + +static inline +int libxl_set_vcpuaffinity_0x040200(libxl_ctx *ctx, uint32_t domid, + uint32_t vcpuid, libxl_bitmap *cpumap) +{ + return libxl_set_vcpuaffinity(ctx, domid, vcpuid, cpumap, NULL); +} + +static inline +int libxl_set_vcpuaffinity_all_0x040200(libxl_ctx *ctx, uint32_t domid, + unsigned int max_vcpus, + libxl_bitmap *cpumap) +{ + return libxl_set_vcpuaffinity_all(ctx, domid, max_vcpus, cpumap, NULL); +} + +#define libxl_set_vcpuaffinity libxl_set_vcpuaffinity_0x040200 +#define libxl_set_vcpuaffinity_all libxl_set_vcpuaffinity_all_0x040200 + +#endif + int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *nodemap); int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c index fe7ba0d..58f80df 100644 --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -193,6 +193,12 @@ int libxl__domain_build_info_setdefault(libxl__gc *gc, libxl_bitmap_set_any(&b_info->cpumap); } + if (!b_info->cpumap_soft.size) { + if (libxl_cpu_bitmap_alloc(CTX, &b_info->cpumap_soft, 0)) + return ERROR_FAIL; + libxl_bitmap_set_any(&b_info->cpumap_soft); + } + libxl_defbool_setdefault(&b_info->numa_placement, true); if (!b_info->nodemap.size) { diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 72489f8..4bfed60 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -236,7 +236,8 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, return rc; } libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); - libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); + libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap, + &info->cpumap_soft); xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); xs_domid = xs_read(ctx->xsh, XBT_NULL, "/tool/xenstored/domid", NULL); diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index cba8eff..48699c9 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -298,6 +298,7 @@ libxl_domain_build_info = Struct("domain_build_info",[ ("max_vcpus", integer), ("avail_vcpus", libxl_bitmap), ("cpumap", libxl_bitmap), + ("cpumap_soft", libxl_bitmap), ("nodemap", libxl_bitmap), ("numa_placement", libxl_defbool), ("tsc_mode", libxl_tsc_mode), @@ -510,7 +511,8 @@ libxl_vcpuinfo = Struct("vcpuinfo", [ ("blocked", bool), ("running", bool), ("vcpu_time", uint64), # total vcpu time ran (ns) - ("cpumap", libxl_bitmap), # current cpu''s affinities + ("cpumap", libxl_bitmap), # current hard cpu affinity + ("cpumap_soft", libxl_bitmap), # current soft cpu affinity ], dir=DIR_OUT) libxl_physinfo = Struct("physinfo", [ diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h index e37fb89..4d743c9 100644 --- a/tools/libxl/libxl_utils.h +++ b/tools/libxl/libxl_utils.h @@ -90,7 +90,7 @@ static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap) { memset(bitmap->map, 0, bitmap->size); } -static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit) +static inline int libxl_bitmap_bit_valid(const libxl_bitmap *bitmap, int bit) { return bit >= 0 && bit < (bitmap->size * 8); } @@ -98,6 +98,29 @@ static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit) #define libxl_for_each_set_bit(v, m) for (v = 0; v < (m).size * 8; v++) \ if (libxl_bitmap_test(&(m), v)) +static inline int libxl_bitmap_equal(const libxl_bitmap *ba, + const libxl_bitmap *bb, + int nr_bits) +{ + int i; + + /* Only check nr_bits (all bits if <= 0) */ + nr_bits = (nr_bits <= 0) ? ba->size * 8 : nr_bits; + for (i = 0; i < nr_bits; i++) { + /* If overflowing one of the bitmaps, we call them different */ + if (!libxl_bitmap_bit_valid(ba, i) || !libxl_bitmap_bit_valid(bb, i)) + return 0; + if (libxl_bitmap_test(ba, i) != libxl_bitmap_test(bb, i)) + return 0; + } + return 1; +} + +static inline int libxl_bitmap_cpu_valid(libxl_bitmap *cpumap, int cpu) +{ + return libxl_bitmap_bit_valid(cpumap, cpu); +} + int libxl_cpu_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *cpumap, int max_cpus); int libxl_node_bitmap_alloc(libxl_ctx *ctx, libxl_bitmap *nodemap, int max_nodes); diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index b756b70..172d89d 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -2221,7 +2221,7 @@ start: } else { libxl_bitmap_set_any(&vcpu_cpumap); } - if (libxl_set_vcpuaffinity(ctx, domid, i, &vcpu_cpumap)) { + if (libxl_set_vcpuaffinity(ctx, domid, i, &vcpu_cpumap, NULL)) { fprintf(stderr, "setting affinity failed on vcpu `%d''.\n", i); libxl_bitmap_dispose(&vcpu_cpumap); free(vcpu_to_pcpu); @@ -4629,7 +4629,7 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) } if (vcpuid != -1) { - if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap) == -1) { + if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap, NULL)) { fprintf(stderr, "Could not set affinity for vcpu `%u''.\n", vcpuid); goto out; } @@ -4641,7 +4641,7 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) } for (i = 0; i < nb_vcpu; i++) { if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid, - &cpumap) == -1) { + &cpumap, NULL)) { fprintf(stderr, "libxl_set_vcpuaffinity failed" " on vcpu `%u''.\n", vcpuinfo[i].vcpuid); }
Getting happens via `xl vcpu-list'', which now looks like this: # xl vcpu-list -s Name ID VCPU CPU State Time(s) Affinity (Hard / Soft) Domain-0 0 0 11 -b- 5.4 8-15 / all Domain-0 0 1 11 -b- 1.0 8-15 / all Domain-0 0 14 13 -b- 1.4 8-15 / all Domain-0 0 15 8 -b- 1.6 8-15 / all vm-test 3 0 4 -b- 2.5 0-12 / 0-7 vm-test 3 1 0 -b- 3.2 0-12 / 0-7 Setting happens by adding a ''-s''/''--soft'' switch to `xl vcpu-pin''. xl manual page is updated accordingly. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v4: * fix and rephrased the manual entry, as suggested during review; * more refactoring to remove some leftover special casing, as suggested during review. Changes from v3: * fix typos in doc, rephrased the help message and changed the title of the column for hard/soft affinity, as suggested during review. Changes from v2: * this patch folds what in v2 were patches 13 and 14; * `xl vcpu-pin'' always shows both had and soft affinity, without the need of passing ''-s''. --- docs/man/xl.pod.1 | 20 ++++++++++++++- tools/libxl/xl_cmdimpl.c | 61 ++++++++++++++++++++++++--------------------- tools/libxl/xl_cmdtable.c | 3 +- 3 files changed, 54 insertions(+), 30 deletions(-) diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1 index e7b9de2..2769318 100644 --- a/docs/man/xl.pod.1 +++ b/docs/man/xl.pod.1 @@ -619,7 +619,7 @@ after B<vcpu-set>, go to B<SEE ALSO> section for information. Lists VCPU information for a specific domain. If no domain is specified, VCPU information for all domains will be provided. -=item B<vcpu-pin> I<domain-id> I<vcpu> I<cpus> +=item B<vcpu-pin> [I<OPTIONS>] I<domain-id> I<vcpu> I<cpus> Pins the VCPU to only run on the specific CPUs. The keyword B<all> can be used to apply the I<cpus> list to all VCPUs in the @@ -630,6 +630,24 @@ different run state is appropriate. Pinning can be used to restrict this, by ensuring certain VCPUs can only run on certain physical CPUs. +B<OPTIONS> + +=over 4 + +=item B<-s>, B<--soft> + +The same as above, but affect I<soft affinity> rather than pinning +(also called I<hard affinity>). + +Normally VCPUs can float between a set of CPUs (subject to pinning). +Soft affinity offers a means to specify one or more I<preferred> +CPUs where the VCPU will prefer to run whenever possible. + +This feature needs special support in the scheduler, which is only +provided in credit1. + +=back + =item B<vm-list> Prints information about guests. This list excludes information about diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index 172d89d..ed4801a 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -4513,8 +4513,10 @@ static void print_vcpuinfo(uint32_t tdomid, } /* TIM */ printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); - /* CPU AFFINITY */ + /* CPU HARD AND SOFT AFFINITY */ print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout); + printf(" / "); + print_bitmap(vcpuinfo->cpumap_soft.map, nr_cpus, stdout); printf("\n"); } @@ -4549,7 +4551,8 @@ static void vcpulist(int argc, char **argv) } printf("%-32s %5s %5s %5s %5s %9s %s\n", - "Name", "ID", "VCPU", "CPU", "State", "Time(s)", "CPU Affinity"); + "Name", "ID", "VCPU", "CPU", "State", "Time(s)", + "Affinity (Hard / Soft)"); if (!argc) { if (!(dominfo = libxl_list_domain(ctx, &nb_domain))) { fprintf(stderr, "libxl_list_domain failed.\n"); @@ -4582,17 +4585,34 @@ int main_vcpulist(int argc, char **argv) return 0; } -static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) +int main_vcpupin(int argc, char **argv) { libxl_vcpuinfo *vcpuinfo; libxl_bitmap cpumap; - - uint32_t vcpuid; - char *endptr; - int i, nb_cpu, nb_vcpu, rc = -1; + uint32_t vcpuid, domid; + const char *vcpu; + char *endptr, *cpu; + int opt, nb_cpu, nb_vcpu, rc = -1; + libxl_bitmap *soft = NULL, *hard = &cpumap; + static struct option opts[] = { + {"soft", 0, 0, ''s''}, + COMMON_LONG_OPTS, + {0, 0, 0, 0} + }; libxl_bitmap_init(&cpumap); + SWITCH_FOREACH_OPT(opt, "s", opts, "vcpu-pin", 3) { + case ''s'': + soft = &cpumap; + hard = NULL; + break; + } + + domid = find_domain(argv[optind]); + vcpu = argv[optind+1]; + cpu = argv[optind+2]; + vcpuid = strtoul(vcpu, &endptr, 10); if (vcpu == endptr) { if (strcmp(vcpu, "all")) { @@ -4629,23 +4649,19 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) } if (vcpuid != -1) { - if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap, NULL)) { - fprintf(stderr, "Could not set affinity for vcpu `%u''.\n", vcpuid); + if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, hard, soft)) { + fprintf(stderr, "Could not set affinity for vcpu `%u''.\n", + vcpuid); goto out; } } else { - if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) { + if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &nb_cpu))) { fprintf(stderr, "libxl_list_vcpu failed.\n"); goto out; } - for (i = 0; i < nb_vcpu; i++) { - if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid, - &cpumap, NULL)) { - fprintf(stderr, "libxl_set_vcpuaffinity failed" - " on vcpu `%u''.\n", vcpuinfo[i].vcpuid); - } - } + if (libxl_set_vcpuaffinity_all(ctx, domid, nb_vcpu, hard, soft)) + fprintf(stderr, "Could not set affinity.\n"); libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu); } @@ -4655,17 +4671,6 @@ static int vcpupin(uint32_t domid, const char *vcpu, char *cpu) return rc; } -int main_vcpupin(int argc, char **argv) -{ - int opt; - - SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-pin", 3) { - /* No options */ - } - - return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]); -} - static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host) { char *endptr; diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c index ebe0220..8aee343 100644 --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -213,7 +213,8 @@ struct cmd_spec cmd_table[] = { { "vcpu-pin", &main_vcpupin, 1, 1, "Set which CPUs a VCPU can use", - "<Domain> <VCPU|all> <CPUs|all>", + "[option] <Domain> <VCPU|all> <CPUs|all>", + "-s, --soft Set soft affinity", }, { "vcpu-set", &main_vcpuset, 0, 1,
Dario Faggioli
2013-Dec-02 18:29 UTC
[PATCH v5 16/17] xl: enable for specifying node-affinity in the config file
in a similar way to how it is possible to specify vcpu-affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v4: * fix typos and rephrase docs, as suggested during review; * more refactoring, i.e., more addressing factor of potential common code, as requested during review. Changes from v3: * fix typos and language issues in docs and comments, as suggested during review; * common code to soft and hard affinity parsing factored together, as requested uring review. Changes from v2: * use the new libxl API. Although the implementation changed only a little bit, I removed IanJ''s Acked-by, although I am here saying that he did provided it, as requested. --- docs/man/xl.cfg.pod.5 | 23 ++++- tools/libxl/xl_cmdimpl.c | 216 ++++++++++++++++++++++++++++++---------------- 2 files changed, 161 insertions(+), 78 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index 70d6d9f..cce51ae 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -144,19 +144,36 @@ run on cpu #3 of the host. =back If this option is not specified, no vcpu to cpu pinning is established, -and the vcpus of the guest can run on all the cpus of the host. +and the vcpus of the guest can run on all the cpus of the host. If this +option is specified, the intersection of the vcpu pinning mask, provided +here, and the soft affinity mask, provided via B<cpus\_soft=> (if any), +is utilized to compute the domain node-affinity, for driving memory +allocations. If we are on a NUMA machine (i.e., if the host has more than one NUMA node) and this option is not specified, libxl automatically tries to place the guest on the least possible number of nodes. That, however, will not affect vcpu pinning, so the guest will still be able to run on -all the cpus, it will just prefer the ones from the node it has been -placed on. A heuristic approach is used for choosing the best node (or +all the cpus. A heuristic approach is used for choosing the best node (or set of nodes), with the goals of maximizing performance for the guest and, at the same time, achieving efficient utilization of host cpus and memory. See F<docs/misc/xl-numa-placement.markdown> for more details. +=item B<cpus_soft="CPU-LIST"> + +Exactly as B<cpus=>, but specifies soft affinity, rather than pinning +(hard affinity). When using the credit scheduler, this means what cpus +the vcpus of the domain prefer. + +A C<CPU-LIST> is specified exactly as above, for B<cpus=>. + +If this option is not specified, the vcpus of the guest will not have +any preference regarding on what cpu to run. If this option is specified, +the intersection of the soft affinity mask, provided here, and the vcpu +pinning, provided via B<cpus=> (if any), is utilized to compute the +domain node-affinity, for driving memory allocations. + =back =head3 CPU Scheduling diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index ed4801a..768b6da 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -76,8 +76,9 @@ xlchild children[child_max]; static const char *common_domname; static int fd_lock = -1; -/* Stash for specific vcpu to pcpu mappping */ +/* Stash for specific vcpu to pcpu hard and soft mapping */ static int *vcpu_to_pcpu; +static int *vcpu_to_pcpu_soft; static const char savefileheader_magic[32] "Xen saved domain, xl format\n \0 \r"; @@ -686,6 +687,92 @@ static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap) return rc; } +static int *parse_config_cpumap_list(XLU_ConfigList *cpus, + libxl_bitmap *cpumap, + int max_vcpus) +{ + int i, n_cpus = 0; + int *to_pcpu; + const char *buf; + + if (libxl_cpu_bitmap_alloc(ctx, cpumap, 0)) { + fprintf(stderr, "Unable to allocate cpumap\n"); + exit(1); + } + + /* Prepare the array for single vcpu to pcpu mappings */ + to_pcpu = xmalloc(sizeof(int) * max_vcpus); + memset(to_pcpu, -1, sizeof(int) * max_vcpus); + + /* + * Idea here is to let libxl think all the domain''s vcpus + * have cpu affinity with all the pcpus on the list. Doing + * that ensures memory is allocated on the proper NUMA nodes. + * It is then us, here in xl, that matches each single vcpu + * to its pcpu (and that''s why we need to stash such info in + * the to_pcpu array now) after the domain has been created. + * This way, we avoid having to pass to libxl some big array + * hosting the single mappings. + */ + libxl_bitmap_set_none(cpumap); + while ((buf = xlu_cfg_get_listitem(cpus, n_cpus)) != NULL) { + i = atoi(buf); + if (!libxl_bitmap_cpu_valid(cpumap, i)) { + fprintf(stderr, "cpu %d illegal\n", i); + exit(1); + } + libxl_bitmap_set(cpumap, i); + if (n_cpus < max_vcpus) + to_pcpu[n_cpus] = i; + n_cpus++; + } + + return to_pcpu; +} + +static void parse_config_cpumap_string(const char *buf, libxl_bitmap *cpumap) +{ + char *buf2 = strdup(buf); + + if (libxl_cpu_bitmap_alloc(ctx, cpumap, 0)) { + fprintf(stderr, "Unable to allocate cpumap\n"); + exit(1); + } + + libxl_bitmap_set_none(cpumap); + if (vcpupin_parse(buf2, cpumap)) + exit(1); + free(buf2); +} + +static void parse_cpu_affinity(XLU_Config *config, const char *what, + libxl_domain_build_info *b_info) +{ + XLU_ConfigList *cpus; + const char *buf; + libxl_bitmap *map; + int **array; + + if (!strcmp(what, "cpus")) { + map = &b_info->cpumap; + array = &vcpu_to_pcpu; + } else if (!strcmp(what, "cpus_soft")) { + map = &b_info->cpumap_soft; + array = &vcpu_to_pcpu_soft; + } else + return; + + if (!xlu_cfg_get_list (config, what, &cpus, 0, 1)) + *array = parse_config_cpumap_list(cpus, map, b_info->max_vcpus); + else if (!xlu_cfg_get_string (config, what, &buf, 0)) + parse_config_cpumap_string(buf, map); + else + return; + + /* We have an hard and/or soft affinity: disable automatic placement */ + libxl_defbool_set(&b_info->numa_placement, false); +} + static void parse_config_data(const char *config_source, const char *config_data, int config_len, @@ -696,7 +783,8 @@ static void parse_config_data(const char *config_source, const char *buf; long l; XLU_Config *config; - XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms; + XLU_ConfigList *vbds, *nics, *pcis; + XLU_ConfigList *cvfbs, *cpuids, *vtpms; XLU_ConfigList *ioports, *irqs, *iomem; int num_ioports, num_irqs, num_iomem; int pci_power_mgmt = 0; @@ -818,60 +906,9 @@ static void parse_config_data(const char *config_source, if (!xlu_cfg_get_long (config, "maxvcpus", &l, 0)) b_info->max_vcpus = l; - if (!xlu_cfg_get_list (config, "cpus", &cpus, 0, 1)) { - int n_cpus = 0; - if (libxl_cpu_bitmap_alloc(ctx, &b_info->cpumap, 0)) { - fprintf(stderr, "Unable to allocate cpumap\n"); - exit(1); - } - - /* Prepare the array for single vcpu to pcpu mappings */ - vcpu_to_pcpu = xmalloc(sizeof(int) * b_info->max_vcpus); - memset(vcpu_to_pcpu, -1, sizeof(int) * b_info->max_vcpus); - - /* - * Idea here is to let libxl think all the domain''s vcpus - * have cpu affinity with all the pcpus on the list. - * It is then us, here in xl, that matches each single vcpu - * to its pcpu (and that''s why we need to stash such info in - * the vcpu_to_pcpu array now) after the domain has been created. - * Doing it like this saves the burden of passing to libxl - * some big array hosting the single mappings. Also, using - * the cpumap derived from the list ensures memory is being - * allocated on the proper nodes anyway. - */ - libxl_bitmap_set_none(&b_info->cpumap); - while ((buf = xlu_cfg_get_listitem(cpus, n_cpus)) != NULL) { - i = atoi(buf); - if (!libxl_bitmap_cpu_valid(&b_info->cpumap, i)) { - fprintf(stderr, "cpu %d illegal\n", i); - exit(1); - } - libxl_bitmap_set(&b_info->cpumap, i); - if (n_cpus < b_info->max_vcpus) - vcpu_to_pcpu[n_cpus] = i; - n_cpus++; - } - - /* We have a cpumap, disable automatic placement */ - libxl_defbool_set(&b_info->numa_placement, false); - } - else if (!xlu_cfg_get_string (config, "cpus", &buf, 0)) { - char *buf2 = strdup(buf); - - if (libxl_cpu_bitmap_alloc(ctx, &b_info->cpumap, 0)) { - fprintf(stderr, "Unable to allocate cpumap\n"); - exit(1); - } - - libxl_bitmap_set_none(&b_info->cpumap); - if (vcpupin_parse(buf2, &b_info->cpumap)) - exit(1); - free(buf2); - - libxl_defbool_set(&b_info->numa_placement, false); - } + parse_cpu_affinity(config, "cpus", b_info); + parse_cpu_affinity(config, "cpus_soft", b_info); if (!xlu_cfg_get_long (config, "memory", &l, 0)) { b_info->max_memkb = l * 1024; @@ -1990,6 +2027,40 @@ static void evdisable_disk_ejects(libxl_evgen_disk_eject **diskws, } } +static inline int set_vcpu_to_pcpu_affinity(uint32_t domid, int *to_pcpu, + int max_vcpus, int soft) +{ + libxl_bitmap vcpu_cpumap; + libxl_bitmap *softmap = NULL, *hardmap = NULL; + int i, ret = 0; + + ret = libxl_cpu_bitmap_alloc(ctx, &vcpu_cpumap, 0); + if (ret) + return -1; + + if (soft) + softmap = &vcpu_cpumap; + else + hardmap = &vcpu_cpumap; + + for (i = 0; i < max_vcpus; i++) { + if (to_pcpu[i] != -1) { + libxl_bitmap_set_none(&vcpu_cpumap); + libxl_bitmap_set(&vcpu_cpumap, to_pcpu[i]); + } else { + libxl_bitmap_set_any(&vcpu_cpumap); + } + if (libxl_set_vcpuaffinity(ctx, domid, i, hardmap, softmap)) { + fprintf(stderr, "setting affinity failed on vcpu `%d''.\n", i); + ret = -1; + break; + } + } + libxl_bitmap_dispose(&vcpu_cpumap); + + return ret; +} + static uint32_t create_domain(struct domain_create *dom_info) { uint32_t domid = INVALID_DOMID; @@ -2206,31 +2277,26 @@ start: if ( ret ) goto error_out; - /* If single vcpu to pcpu mapping was requested, honour it */ + /* If single vcpu pinning or soft affinity was requested, honour it */ if (vcpu_to_pcpu) { - libxl_bitmap vcpu_cpumap; + ret = set_vcpu_to_pcpu_affinity(domid, vcpu_to_pcpu, + d_config.b_info.max_vcpus, 0); + free(vcpu_to_pcpu); - ret = libxl_cpu_bitmap_alloc(ctx, &vcpu_cpumap, 0); if (ret) goto error_out; - for (i = 0; i < d_config.b_info.max_vcpus; i++) { - if (vcpu_to_pcpu[i] != -1) { - libxl_bitmap_set_none(&vcpu_cpumap); - libxl_bitmap_set(&vcpu_cpumap, vcpu_to_pcpu[i]); - } else { - libxl_bitmap_set_any(&vcpu_cpumap); - } - if (libxl_set_vcpuaffinity(ctx, domid, i, &vcpu_cpumap, NULL)) { - fprintf(stderr, "setting affinity failed on vcpu `%d''.\n", i); - libxl_bitmap_dispose(&vcpu_cpumap); - free(vcpu_to_pcpu); - ret = ERROR_FAIL; - goto error_out; - } - } - libxl_bitmap_dispose(&vcpu_cpumap); - free(vcpu_to_pcpu); vcpu_to_pcpu = NULL; + vcpu_to_pcpu = NULL; + } + if (vcpu_to_pcpu_soft) { + ret = set_vcpu_to_pcpu_affinity(domid, vcpu_to_pcpu_soft, + d_config.b_info.max_vcpus, 1); + free(vcpu_to_pcpu_soft); + + if (ret) + goto error_out; + + vcpu_to_pcpu_soft = NULL; } ret = libxl_userdata_store(ctx, domid, "xl",
Dario Faggioli
2013-Dec-02 18:29 UTC
[PATCH v5 17/17] libxl: automatic NUMA placement affects soft affinity
vCPU soft affinity and NUMA-aware scheduling does not have to be related. However, soft affinity is how NUMA-aware scheduling is actually implemented, and therefore, by default, the results of automatic NUMA placement (at VM creation time) are also used to set the soft affinity of all the vCPUs of the domain. Of course, this only happens if automatic NUMA placement is enabled and actually takes place (for instance, if the user does not specify any hard and soft affiniy in the xl config file). This also takes care of the vice-versa, i.e., don''t trigger automatic placement if the config file specifies either an hard (the check for which was already there) or a soft (the check for which is introduced by this commit) affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v3: * rephrase comments and docs, as suggestd during review. --- docs/man/xl.cfg.pod.5 | 21 +++++++++++---------- docs/misc/xl-numa-placement.markdown | 14 ++++++++++++-- tools/libxl/libxl_dom.c | 20 ++++++++++++++++++-- 3 files changed, 41 insertions(+), 14 deletions(-) diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5 index cce51ae..99a43a7 100644 --- a/docs/man/xl.cfg.pod.5 +++ b/docs/man/xl.cfg.pod.5 @@ -150,16 +150,6 @@ here, and the soft affinity mask, provided via B<cpus\_soft=> (if any), is utilized to compute the domain node-affinity, for driving memory allocations. -If we are on a NUMA machine (i.e., if the host has more than one NUMA -node) and this option is not specified, libxl automatically tries to -place the guest on the least possible number of nodes. That, however, -will not affect vcpu pinning, so the guest will still be able to run on -all the cpus. A heuristic approach is used for choosing the best node (or -set of nodes), with the goals of maximizing performance for the guest -and, at the same time, achieving efficient utilization of host cpus -and memory. See F<docs/misc/xl-numa-placement.markdown> for more -details. - =item B<cpus_soft="CPU-LIST"> Exactly as B<cpus=>, but specifies soft affinity, rather than pinning @@ -174,6 +164,17 @@ the intersection of the soft affinity mask, provided here, and the vcpu pinning, provided via B<cpus=> (if any), is utilized to compute the domain node-affinity, for driving memory allocations. +If this option is not specified (and B<cpus=> is not specified either), +libxl automatically tries to place the guest on the least possible +number of nodes. A heuristic approach is used for choosing the best +node (or set of nodes), with the goal of maximizing performance for +the guest and, at the same time, achieving efficient utilization of +host cpus and memory. In that case, the soft affinity of all the vcpus +of the domain will be set to the pcpus belonging to the NUMA nodes +chosen during placement. + +For more details, see F<docs/misc/xl-numa-placement.markdown>. + =back =head3 CPU Scheduling diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown index b1ed361..09ae95e 100644 --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -126,10 +126,20 @@ or Xen won''t be able to guarantee the locality for their memory accesses. That, of course, also mean the vCPUs of the domain will only be able to execute on those same pCPUs. +It is is also possible to have a "cpus\_soft=" option in the xl config file, +to specify the soft affinity for all the vCPUs of the domain. This affects +the NUMA placement in the following way: + + * if only "cpus\_soft=" is present, the VM''s node-affinity will be equal + to the nodes to which the pCPUs in the soft affinity mask belong; + * if both "cpus\_soft=" and "cpus=" are present, the VM''s node-affinity + will be equal to the nodes to which the pCPUs present both in hard and + soft affinity belong. + ### Placing the guest automatically ### -If no "cpus=" option is specified in the config file, libxl tries -to figure out on its own on which node(s) the domain could fit best. +If neither "cpus=" nor "cpus\_soft=" are present in the config file, libxl +tries to figure out on its own on which node(s) the domain could fit best. If it finds one (some), the domain''s node affinity get set to there, and both memory allocations and NUMA aware scheduling (for the credit scheduler and starting from Xen 4.3) will comply with it. Starting from diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c index 4bfed60..ceb8643 100644 --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -222,18 +222,34 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid, * some weird error manifests) the subsequent call to * libxl_domain_set_nodeaffinity() will do the actual placement, * whatever that turns out to be. + * + * As far as scheduling is concerned, we achieve NUMA-aware scheduling + * by having the results of placement affect the soft affinity of all + * the vcpus of the domain. Of course, we want that iff placement is + * enabled and actually happens, so we only change info->cpumap_soft to + * reflect the placement result if that is the case */ if (libxl_defbool_val(info->numa_placement)) { - if (!libxl_bitmap_is_full(&info->cpumap)) { + /* We require both hard and soft affinity not to be set */ + if (!libxl_bitmap_is_full(&info->cpumap) || + !libxl_bitmap_is_full(&info->cpumap_soft)) { LOG(ERROR, "Can run NUMA placement only if no vcpu " - "affinity is specified"); + "(hard or soft) affinity is specified"); return ERROR_INVAL; } rc = numa_place_domain(gc, domid, info); if (rc) return rc; + + /* + * We change the soft affinity in domain_build_info here, of course + * after converting the result of placement from nodes to cpus. the + * following call to libxl_set_vcpuaffinity_all_soft() will do the + * actual updating of the domain''s vcpus'' soft affinity. + */ + libxl_nodemap_to_cpumap(ctx, &info->nodemap, &info->cpumap_soft); } libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap,
Jan Beulich
2013-Dec-03 10:02 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
>>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: > +static inline > +int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff) > +{ > + return vcpuaff->flags == 0 || > + (vcpuaff->flags & XEN_VCPUAFFINITY_HARD && > + guest_handle_is_null(vcpuaff->cpumap_hard.bitmap)) || > + (vcpuaff->flags & XEN_VCPUAFFINITY_SOFT &&The binary &-s clearly want to be parenthesized.> @@ -605,31 +615,112 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > case XEN_DOMCTL_getvcpuaffinity: > { > struct vcpu *v; > + xen_domctl_vcpuaffinity_t *vcpuaff = &op->u.vcpuaffinity; > > ret = -EINVAL; > - if ( op->u.vcpuaffinity.vcpu >= d->max_vcpus ) > + if ( vcpuaff->vcpu >= d->max_vcpus ) > break; > > ret = -ESRCH; > - if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL ) > + if ( (v = d->vcpu[vcpuaff->vcpu]) == NULL ) > break; > > if ( op->cmd == XEN_DOMCTL_setvcpuaffinity ) > { > - cpumask_var_t new_affinity; > + cpumask_var_t new_affinity, old_affinity; > + cpumask_t *online = cpupool_online_cpumask(v->domain->cpupool);; > + > + /* > + * We want to be able to restore hard affinity if we are trying > + * setting both and changing soft affinity (which happens later, > + * when hard affinity has been succesfully chaged already) fails. > + */ > + if ( !alloc_cpumask_var(&old_affinity) ) > + { > + ret = -ENOMEM; > + break; > + } > + cpumask_copy(old_affinity, v->cpu_hard_affinity); > > - ret = xenctl_bitmap_to_cpumask( > - &new_affinity, &op->u.vcpuaffinity.cpumap); > - if ( !ret ) > + if ( !alloc_cpumask_var(&new_affinity) ) > { > - ret = vcpu_set_affinity(v, new_affinity); > - free_cpumask_var(new_affinity); > + free_cpumask_var(old_affinity); > + ret = -ENOMEM; > + break; > } > + > + ret = -EINVAL; > + if (vcpuaffinity_params_invalid(vcpuaff))Coding style. Further - can''t this be done in the code shared between "get" and "set"?> + goto setvcpuaffinity_out; > + > + /* > + * We both set a new affinity and report back to the caller what > + * the scheduler will be effectively using. > + */ > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > + { > + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > + &vcpuaff->cpumap_hard, > + vcpuaff->cpumap_hard.nr_bits);There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, yet xenctl_bitmap_to_bitmap() uses the passed in value to write into the array pointed to by the first argument. Why is this not xenctl_bitmap_to_cpumask() in the first place?> + if ( !ret ) > + ret = vcpu_set_hard_affinity(v, new_affinity); > + if ( ret ) > + goto setvcpuaffinity_out; > + > + /* > + * For hard affinity, what we return is the intersection of > + * cpupool''s online mask and the new hard affinity. > + */ > + cpumask_and(new_affinity, online, v->cpu_hard_affinity); > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_hard, > + new_affinity);So what you return back here ...> + } > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_SOFT ) > + { > + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > + &vcpuaff->cpumap_soft, > + vcpuaff->cpumap_soft.nr_bits); > + if ( !ret) > + ret = vcpu_set_soft_affinity(v, new_affinity); > + if ( ret ) > + { > + /* > + * Since we''re returning error, the caller expects nothing > + * happened, so we rollback the changes to hard affinity > + * (if any). > + */ > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > + vcpu_set_hard_affinity(v, old_affinity); > + goto setvcpuaffinity_out; > + } > + > + /* > + * For soft affinity, we return the intersection between the > + * new soft affinity, the cpupool''s online map and the (new) > + * hard affinity. > + */ > + cpumask_and(new_affinity, new_affinity, online); > + cpumask_and(new_affinity, new_affinity, v->cpu_hard_affinity); > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_soft, > + new_affinity);... and here differs from ...> + } > + > + setvcpuaffinity_out: > + free_cpumask_var(new_affinity); > + free_cpumask_var(old_affinity); > } > else > { > - ret = cpumask_to_xenctl_bitmap( > - &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); > + ret = -EINVAL; > + if (vcpuaffinity_params_invalid(vcpuaff)) > + break; > + > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_hard, > + v->cpu_hard_affinity); > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_SOFT ) > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_soft, > + v->cpu_soft_affinity);... the simple "get". Why? And if really intended, this should be mentioned in the public header. Jan
Jan Beulich
2013-Dec-03 10:06 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
>>> On 03.12.13 at 11:02, "Jan Beulich" <JBeulich@suse.com> wrote: >>>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: >> + goto setvcpuaffinity_out; >> + >> + /* >> + * We both set a new affinity and report back to the caller what >> + * the scheduler will be effectively using. >> + */ >> + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) >> + { >> + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), >> + &vcpuaff->cpumap_hard, >> + vcpuaff->cpumap_hard.nr_bits); > > There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, > yet xenctl_bitmap_to_bitmap() uses the passed in value to write into > the array pointed to by the first argument. Why is this not > xenctl_bitmap_to_cpumask() in the first place?And just to make it explicit - with fundamental flaws like this, I''m not certain anymore whether we really ought to rush this series in for 4.4. Jan
Dario Faggioli
2013-Dec-03 10:59 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mar, 2013-12-03 at 10:02 +0000, Jan Beulich wrote:> >>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: > > +static inline > > +int vcpuaffinity_params_invalid(const xen_domctl_vcpuaffinity_t *vcpuaff) > > +{ > > + return vcpuaff->flags == 0 || > > + (vcpuaff->flags & XEN_VCPUAFFINITY_HARD && > > + guest_handle_is_null(vcpuaff->cpumap_hard.bitmap)) || > > + (vcpuaff->flags & XEN_VCPUAFFINITY_SOFT && > > The binary &-s clearly want to be parenthesized. >Ok.> > @@ -605,31 +615,112 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > > case XEN_DOMCTL_getvcpuaffinity: > > { > > struct vcpu *v; > > + xen_domctl_vcpuaffinity_t *vcpuaff = &op->u.vcpuaffinity; > > > > ret = -EINVAL; > > - if ( op->u.vcpuaffinity.vcpu >= d->max_vcpus ) > > + if ( vcpuaff->vcpu >= d->max_vcpus ) > > break; > > > > ret = -ESRCH; > > - if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL ) > > + if ( (v = d->vcpu[vcpuaff->vcpu]) == NULL ) > > break; > > > > if ( op->cmd == XEN_DOMCTL_setvcpuaffinity ) > > { > > - cpumask_var_t new_affinity; > > + cpumask_var_t new_affinity, old_affinity; > > + cpumask_t *online = cpupool_online_cpumask(v->domain->cpupool);; > > + > > + /* > > + * We want to be able to restore hard affinity if we are trying > > + * setting both and changing soft affinity (which happens later, > > + * when hard affinity has been succesfully chaged already) fails. > > + */ > > + if ( !alloc_cpumask_var(&old_affinity) ) > > + { > > + ret = -ENOMEM; > > + break; > > + } > > + cpumask_copy(old_affinity, v->cpu_hard_affinity); > > > > - ret = xenctl_bitmap_to_cpumask( > > - &new_affinity, &op->u.vcpuaffinity.cpumap); > > - if ( !ret ) > > + if ( !alloc_cpumask_var(&new_affinity) ) > > { > > - ret = vcpu_set_affinity(v, new_affinity); > > - free_cpumask_var(new_affinity); > > + free_cpumask_var(old_affinity); > > + ret = -ENOMEM; > > + break; > > } > > + > > + ret = -EINVAL; > > + if (vcpuaffinity_params_invalid(vcpuaff)) > > Coding style. >Ah, sure, sorry.> Further - can''t this be done in the code shared between "get" and > "set"? >It certainly can be, yes.> > + goto setvcpuaffinity_out; > > + > > + /* > > + * We both set a new affinity and report back to the caller what > > + * the scheduler will be effectively using. > > + */ > > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > > + { > > + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > > + &vcpuaff->cpumap_hard, > > + vcpuaff->cpumap_hard.nr_bits); > > There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, > yet xenctl_bitmap_to_bitmap() uses the passed in value to write into > the array pointed to by the first argument. Why is this not > xenctl_bitmap_to_cpumask() in the first place? >Because xenctl_bitmap_to_cpumask() allocate the cpumask, which I don''t want, since I already did that above. What xenctl_bitmap_to_cpumask() does is allocate the cpumask and then call xenctl_bitmap_to_bitmap() although, yes, it uses nr_cpu_ids for nbits, not what comes from outside, which I overlooked while adapting this to the new design we agreed upon. Thanks for pointing this out, I''ll fix it.> > + if ( !ret ) > > + ret = vcpu_set_hard_affinity(v, new_affinity); > > + if ( ret ) > > + goto setvcpuaffinity_out; > > + > > + /* > > + * For hard affinity, what we return is the intersection of > > + * cpupool''s online mask and the new hard affinity. > > + */ > > + cpumask_and(new_affinity, online, v->cpu_hard_affinity); > > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_hard, > > + new_affinity); > > So what you return back here ... > > > + } > > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_SOFT ) > > + { > > + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > > + &vcpuaff->cpumap_soft, > > + vcpuaff->cpumap_soft.nr_bits); > > + if ( !ret) > > + ret = vcpu_set_soft_affinity(v, new_affinity); > > + if ( ret ) > > + { > > + /* > > + * Since we''re returning error, the caller expects nothing > > + * happened, so we rollback the changes to hard affinity > > + * (if any). > > + */ > > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > > + vcpu_set_hard_affinity(v, old_affinity); > > + goto setvcpuaffinity_out; > > + } > > + > > + /* > > + * For soft affinity, we return the intersection between the > > + * new soft affinity, the cpupool''s online map and the (new) > > + * hard affinity. > > + */ > > + cpumask_and(new_affinity, new_affinity, online); > > + cpumask_and(new_affinity, new_affinity, v->cpu_hard_affinity); > > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_soft, > > + new_affinity); > > ... and here differs from ... > > > + } > > + > > + setvcpuaffinity_out: > > + free_cpumask_var(new_affinity); > > + free_cpumask_var(old_affinity); > > } > > else > > { > > - ret = cpumask_to_xenctl_bitmap( > > - &op->u.vcpuaffinity.cpumap, v->cpu_hard_affinity); > > + ret = -EINVAL; > > + if (vcpuaffinity_params_invalid(vcpuaff)) > > + break; > > + > > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_hard, > > + v->cpu_hard_affinity); > > + if ( vcpuaff->flags & XEN_VCPUAFFINITY_SOFT ) > > + ret = cpumask_to_xenctl_bitmap(&vcpuaff->cpumap_soft, > > + v->cpu_soft_affinity); > > ... the simple "get". Why? And if really intended, this should be > mentioned in the public header. >The main reason is that I asked you what you think I should be returning (in setvcpuaffinity) and that''s what you told me. :-) http://bugs.xenproject.org/xen/mid/%3C52932DCB020000780010673E@nat28.tlf.novell.com%3E Well, actually, I only asked what we should be returning from DOMCTL_vcpu_setaffinity, and never mentioned DOMCTL_vcpu_getaffinity because I seriously thought there was no need to discuss what the latter should be returning, and whether or not the two return values should match. I thought, and still think, it''s pretty obvious there must be a way for the user to retrieve what he has set, independently on the various intersections between online, hard affinity, etc, mustn''t it? Well, that way is using DOMCTL_get_vcpuaffinity. Actually, if DOMCTL_vcpu_setaffinity an DOMCTL_vcpu_getaffinity would be returning the exact same thing, why are we bothering having both doing that? It would have been pretty easy to just, in the toolstack, issue a call to _get_ right after a _set_ and check. No, the point was to have _set_ return something actually useful, i.e., something that can''t be retrieved in any other way, as an indication of how effective the set operation itself was. That, by definition, makes what _set_ and _get_ returns different, and it may be my fault to no have mentioned this clearly in our earlier discussions, but I really was giving it for granted. :-/ That being said, I of course can say something about this in some docs/header. Actually, the following patch does that quite thoroughly, I think, in tools/libxc/xenctrl.h. I can certainly add something similar in Xen''s public .h files. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Dec-03 11:08 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mar, 2013-12-03 at 10:06 +0000, Jan Beulich wrote:> >>> On 03.12.13 at 11:02, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: > >> + goto setvcpuaffinity_out; > >> + > >> + /* > >> + * We both set a new affinity and report back to the caller what > >> + * the scheduler will be effectively using. > >> + */ > >> + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > >> + { > >> + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > >> + &vcpuaff->cpumap_hard, > >> + vcpuaff->cpumap_hard.nr_bits); > > > > There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, > > yet xenctl_bitmap_to_bitmap() uses the passed in value to write into > > the array pointed to by the first argument. Why is this not > > xenctl_bitmap_to_cpumask() in the first place? > > And just to make it explicit - with fundamental flaws like this, I''m > not certain anymore whether we really ought to rush this series > in for 4.4. >Well, I certainly have no intention to say that this isn''t something that I overlooked. Just for the sake of completeness, that''s what''s required to fix it: diff --git a/xen/common/domctl.c b/xen/common/domctl.c index 9eecb5e..cd817f3 100644 --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -661,7 +661,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) { ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), &vcpuaff->cpumap_hard, - vcpuaff->cpumap_hard.nr_bits); + nr_cpu_ids); if ( !ret ) ret = vcpu_set_hard_affinity(v, new_affinity); if ( ret ) @@ -679,7 +679,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) { ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), &vcpuaff->cpumap_soft, - vcpuaff->cpumap_soft.nr_bits); + nr_cpu_ids); if ( !ret) ret = vcpu_set_soft_affinity(v, new_affinity); if ( ret ) That being said, I definitely won''t interfere with the decision of taking it or not for 4.4. Just let me know. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Dec-03 11:20 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
>>> On 03.12.13 at 11:59, Dario Faggioli <dario.faggioli@citrix.com> wrote: > Well, actually, I only asked what we should be returning from > DOMCTL_vcpu_setaffinity, and never mentioned DOMCTL_vcpu_getaffinity > because I seriously thought there was no need to discuss what the latter > should be returning, and whether or not the two return values should > match.And I would also have doubted that I ever said to do something intentionally inconsistent...> I thought, and still think, it''s pretty obvious there must be a way for > the user to retrieve what he has set, independently on the various > intersections between online, hard affinity, etc, mustn''t it? Well, that > way is using DOMCTL_get_vcpuaffinity. > > Actually, if DOMCTL_vcpu_setaffinity an DOMCTL_vcpu_getaffinity would be > returning the exact same thing, why are we bothering having both doing > that? It would have been pretty easy to just, in the toolstack, issue a > call to _get_ right after a _set_ and check. No, the point was to have > _set_ return something actually useful, i.e., something that can''t be > retrieved in any other way, as an indication of how effective the set > operation itself was. That, by definition, makes what _set_ and _get_ > returns different, and it may be my fault to no have mentioned this > clearly in our earlier discussions, but I really was giving it for > granted. :-/ > > That being said, I of course can say something about this in some > docs/header. Actually, the following patch does that quite thoroughly, I > think, in tools/libxc/xenctrl.h. I can certainly add something similar > in Xen''s public .h files.As said - as long as there is proper reasoning and at least a brief explanation in the public header (xenctrl.h not being considered "public" in the context here"), I''m fine with the returned masks differing. Jan
Dario Faggioli
2013-Dec-03 11:30 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mar, 2013-12-03 at 11:20 +0000, Jan Beulich wrote:> >>> On 03.12.13 at 11:59, Dario Faggioli <dario.faggioli@citrix.com> wrote: > > Well, actually, I only asked what we should be returning from > > DOMCTL_vcpu_setaffinity, and never mentioned DOMCTL_vcpu_getaffinity > > because I seriously thought there was no need to discuss what the latter > > should be returning, and whether or not the two return values should > > match. > > And I would also have doubted that I ever said to do something > intentionally inconsistent... >Well, if you ask me, you never did, as I think this is perfectly consistent. :-)> > That being said, I of course can say something about this in some > > docs/header. Actually, the following patch does that quite thoroughly, I > > think, in tools/libxc/xenctrl.h. I can certainly add something similar > > in Xen''s public .h files. > > As said - as long as there is proper reasoning and at least a brief > explanation in the public header (xenctrl.h not being considered > "public" in the context here"), I''m fine with the returned masks > differing. >Ok, as I anticipated, I''ll make a note about this in an hypervisor public header. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Dec-03 13:25 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mar, 2013-12-03 at 12:08 +0100, Dario Faggioli wrote:> On mar, 2013-12-03 at 10:06 +0000, Jan Beulich wrote: > > And just to make it explicit - with fundamental flaws like this, I''m > > not certain anymore whether we really ought to rush this series > > in for 4.4. > > > Well, I certainly have no intention to say that this isn''t something > that I overlooked. Just for the sake of completeness, that''s what''s > required to fix it: > > diff --git a/xen/common/domctl.c b/xen/common/domctl.c > index 9eecb5e..cd817f3 100644 > --- a/xen/common/domctl.c > +++ b/xen/common/domctl.c > @@ -661,7 +661,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > { > ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > &vcpuaff->cpumap_hard, > - vcpuaff->cpumap_hard.nr_bits); > + nr_cpu_ids); > if ( !ret ) > ret = vcpu_set_hard_affinity(v, new_affinity); > if ( ret ) > @@ -679,7 +679,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t) u_domctl) > { > ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > &vcpuaff->cpumap_soft, > - vcpuaff->cpumap_soft.nr_bits); > + nr_cpu_ids); > if ( !ret) > ret = vcpu_set_soft_affinity(v, new_affinity); > if ( ret ) >BTW, there''s a v6 with this fixed, as well as other Jan''s (coding style and explaining interface in public headers) and IanC''s (about code motion) comments addressed: git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v6 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/numa/per-vcpu-affinity-v6 Just waiting a bit more to post it, to see if there are other comments. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Dec-03 14:05 UTC
Re: [PATCH v5 00/17] Implement vcpu soft affinity for credit1
On 12/02/2013 06:27 PM, Dario Faggioli wrote:> Take 5. Main changes from last round are: > - changed the Xen and libxc interface, as agreed with Jan during v4 review. > Now it is possible to specify and get back both the (effective) hard and > soft affinity with just one hypercall. > - applied all IanC''s comments on the tool patches. > > Release whise, George expressed himself in favour of a freeze exception here: > http://bugs.xenproject.org/xen/mid/<528B8B25.6000608@eu.citrix.com> > > The patch series is available in the following git branch: > > git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v5 > > Thanks and Regards, > Dario > --- > > Dario Faggioli (17): > a xl: match output of vcpu-list with pinning syntax > n libxl: better name for last parameter of libxl_list_vcpu > n libxl: fix memory leak in libxl_list_vcpu > a libxc/libxl: sanitize error handling in *_get_max_{cpus,nodes} > libxc/libxl: allow to retrieve the number of online pCPUs > ra xl: allow for node-wise specification of vcpu pinning > a xl: implement and enable dryrun mode for `xl vcpu-pin'' > a xl: test script for the cpumap parser (for vCPU pinning)It is getting very late into the code freeze now. Most of these look like good improvements in their own right, and it would be good to get more testing of them -- could we check these in sooner rather than later? I think the ones which are reviewed / acked are fairly independent of the ones not acked. The node-wise specification for pinning is one in particular which I would have been much happier to check in two weeks ago when it first got IanJ''s ack... -George
George Dunlap
2013-Dec-03 18:21 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On 12/03/2013 10:06 AM, Jan Beulich wrote:>>>> On 03.12.13 at 11:02, "Jan Beulich" <JBeulich@suse.com> wrote: >>>>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>> + goto setvcpuaffinity_out; >>> + >>> + /* >>> + * We both set a new affinity and report back to the caller what >>> + * the scheduler will be effectively using. >>> + */ >>> + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) >>> + { >>> + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), >>> + &vcpuaff->cpumap_hard, >>> + vcpuaff->cpumap_hard.nr_bits); >> >> There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, >> yet xenctl_bitmap_to_bitmap() uses the passed in value to write into >> the array pointed to by the first argument. Why is this not >> xenctl_bitmap_to_cpumask() in the first place? > > And just to make it explicit - with fundamental flaws like this, I''m > not certain anymore whether we really ought to rush this series > in for 4.4.I''m certainly getting nervous about the prospect. However, the above bug would only be triggered by bad input from domain 0, right? I suppose even that would be a potential security issue in a highly disaggregated environment. Other bugs in this patch would be similar. This path is taken on domain creation IIUC; so bugs in this particular patch would probably either be unexpected behavior of the affinities, or failure to handle unusual input from a trusted source (domain 0). -George
Dario Faggioli
2013-Dec-03 18:29 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mar, 2013-12-03 at 18:21 +0000, George Dunlap wrote:> On 12/03/2013 10:06 AM, Jan Beulich wrote: > >>>> On 03.12.13 at 11:02, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: > >>> + goto setvcpuaffinity_out; > >>> + > >>> + /* > >>> + * We both set a new affinity and report back to the caller what > >>> + * the scheduler will be effectively using. > >>> + */ > >>> + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) > >>> + { > >>> + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), > >>> + &vcpuaff->cpumap_hard, > >>> + vcpuaff->cpumap_hard.nr_bits); > >> > >> There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, > >> yet xenctl_bitmap_to_bitmap() uses the passed in value to write into > >> the array pointed to by the first argument. Why is this not > >> xenctl_bitmap_to_cpumask() in the first place? > > > > And just to make it explicit - with fundamental flaws like this, I''m > > not certain anymore whether we really ought to rush this series > > in for 4.4. > > I''m certainly getting nervous about the prospect. >Sory. :-/> However, the above > bug would only be triggered by bad input from domain 0, right? >Exactly, either at domain creation or `xl vcpu-pin'' time.> I suppose > even that would be a potential security issue in a highly disaggregated > environment. >And, it''s fixed here (again, just waiting to gather a bit more feedback/ack before reposting): git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v6 http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/numa/per-vcpu-affinity-v6> Other bugs in this patch would be similar. This path is taken on domain > creation IIUC; so bugs in this particular patch would probably either be > unexpected behavior of the affinities, or failure to handle unusual > input from a trusted source (domain 0). >Indeed. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2013-Dec-03 18:37 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On 12/03/2013 06:29 PM, Dario Faggioli wrote:> On mar, 2013-12-03 at 18:21 +0000, George Dunlap wrote: >> On 12/03/2013 10:06 AM, Jan Beulich wrote: >>>>>> On 03.12.13 at 11:02, "Jan Beulich" <JBeulich@suse.com> wrote: >>>>>>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>>>> + goto setvcpuaffinity_out; >>>>> + >>>>> + /* >>>>> + * We both set a new affinity and report back to the caller what >>>>> + * the scheduler will be effectively using. >>>>> + */ >>>>> + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) >>>>> + { >>>>> + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), >>>>> + &vcpuaff->cpumap_hard, >>>>> + vcpuaff->cpumap_hard.nr_bits); >>>> >>>> There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, >>>> yet xenctl_bitmap_to_bitmap() uses the passed in value to write into >>>> the array pointed to by the first argument. Why is this not >>>> xenctl_bitmap_to_cpumask() in the first place? >>> >>> And just to make it explicit - with fundamental flaws like this, I''m >>> not certain anymore whether we really ought to rush this series >>> in for 4.4. >> >> I''m certainly getting nervous about the prospect. >> > Sory. :-/Nothing to be sorry about, Dario -- this is software development. Not predictable, not really going to try. :-) The core functionality was simple and easy and finished months ago; what we''ve been working out is the interface into that functionality, which always takes a lot longer than you think it possibly could. (My USB hotplug series didn''t make it in for 4.3 for similar reasons.) It is worth looking at the whole series again to try to see what the risks are, and if it''s still worth taking. I''ll probably send something out tomorrow. -George
Dario Faggioli
2013-Dec-03 19:06 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mar, 2013-12-03 at 18:37 +0000, George Dunlap wrote:> On 12/03/2013 06:29 PM, Dario Faggioli wrote: > > On mar, 2013-12-03 at 18:21 +0000, George Dunlap wrote: > >> I''m certainly getting nervous about the prospect. > >> > > Sory. :-/ > > Nothing to be sorry about, Dario -- this is software development. Not > predictable, not really going to try. :-) The core functionality was > simple and easy and finished months ago; what we''ve been working out is > the interface into that functionality, which always takes a lot longer > than you think it possibly could. (My USB hotplug series didn''t make it > in for 4.3 for similar reasons.) >I know, I know... And yet, sometime it happens that you''re sorry even if there are no real reasons to be. :-P> It is worth looking at the whole series again to try to see what the > risks are, and if it''s still worth taking. I''ll probably send something > out tomorrow. >Right. Since you pronounced yourself for the exception fairly early, I never include such analysis in further releases. I think it''s on me to provide it, so I will do that (tomorrow too, so feel free to wait for mine, if you want). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Dec-04 09:03 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mar, 2013-12-03 at 20:06 +0100, Dario Faggioli wrote:> On mar, 2013-12-03 at 18:37 +0000, George Dunlap wrote: > > It is worth looking at the whole series again to try to see what the > > risks are, and if it''s still worth taking. I''ll probably send something > > out tomorrow. > > > Right. Since you pronounced yourself for the exception fairly early, I > never include such analysis in further releases. I think it''s on me to > provide it, so I will do that (tomorrow too, so feel free to wait for > mine, if you want). >So, risk-vs-benefits analysis. If this still was only per-vCPU NUMA affinity, as it started, there would be no point in having it: no in-tree consumers, unlikely to be noticed and used by actual users. However, the way we redesigned and put it, makes it general enough to be interesting even independently from NUMA. It now is quite an advanced feature which, as far as I know, not many other OSes or hypervisors have, and it comes at a very reasonable cost, in terms of amount of code and magnitude of infrastructural (in the scheduling subsystem) changes. Actually, the latter is a simplification wrt what we have now! Granted that it provides benefits, risks. I think there are two kinds of risks: one is related bugs (of course), the other has to do with the interface. Bugs wise, I tend to agree to what George said in his last e-mail. Most of the hypervisor work which happens in ''common code'' (i.e., will affect people not using this feature) is just refactoring and rewiring various bits and pieces. Most of the new code is in enabling the feature at Xen, libxc and libxl level and bugs there, for one, shouldn''t be too hard to spot and fix (as it happened right during v5), and could only be triggered from dom0 (domU creation or via the new toolstack command being introduced). I think the (potential) interface issues are the more important. In fact, the interface not being the optimal one (at the Xen and xc level) and not getting much attention (at the libxl level) in the first versions of the patch series is what brought us here, this late into code freeze. My personal opinion is that we have finally reached a point where the interface is consistent and easy to maintain and to extend in a compatible way, where that is needed (see, for instance, the conversation with Ian Campbell about xl options). Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2013-Dec-04 11:40 UTC
Re: [PATCH v5 02/17] libxl: better name for last parameter of libxl_list_vcpu
Dario Faggioli writes ("[PATCH v5 02/17] libxl: better name for last parameter of libxl_list_vcpu"):> as it is now called `nr_vcpus_out'', but what is returned there is > the number of pCPUs, rather than the number of vCPUs (which is > also returned, but in another parameter, `nb_vcpu''). > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> Although while you were there you could have renamed "nb_vcpu" to "nr_vcpus_out"... Thanks, Ian.
George Dunlap
2013-Dec-04 15:49 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On 12/04/2013 09:03 AM, Dario Faggioli wrote:> On mar, 2013-12-03 at 20:06 +0100, Dario Faggioli wrote: >> On mar, 2013-12-03 at 18:37 +0000, George Dunlap wrote: >>> It is worth looking at the whole series again to try to see what the >>> risks are, and if it''s still worth taking. I''ll probably send something >>> out tomorrow. >>> >> Right. Since you pronounced yourself for the exception fairly early, I >> never include such analysis in further releases. I think it''s on me to >> provide it, so I will do that (tomorrow too, so feel free to wait for >> mine, if you want). >> > So, risk-vs-benefits analysis. > > If this still was only per-vCPU NUMA affinity, as it started, there > would be no point in having it: no in-tree consumers, unlikely to be > noticed and used by actual users. However, the way we redesigned and put > it, makes it general enough to be interesting even independently from > NUMA. It now is quite an advanced feature which, as far as I know, not > many other OSes or hypervisors have, and it comes at a very reasonable > cost, in terms of amount of code and magnitude of infrastructural (in > the scheduling subsystem) changes. Actually, the latter is a > simplification wrt what we have now! > > Granted that it provides benefits, risks. I think there are two kinds of > risks: one is related bugs (of course), the other has to do with the > interface. > > Bugs wise, I tend to agree to what George said in his last e-mail. Most > of the hypervisor work which happens in ''common code'' (i.e., will affect > people not using this feature) is just refactoring and rewiring various > bits and pieces. Most of the new code is in enabling the feature at Xen, > libxc and libxl level and bugs there, for one, shouldn''t be too hard to > spot and fix (as it happened right during v5), and could only be > triggered from dom0 (domU creation or via the new toolstack command > being introduced). > > I think the (potential) interface issues are the more important. In > fact, the interface not being the optimal one (at the Xen and xc level) > and not getting much attention (at the libxl level) in the first > versions of the patch series is what brought us here, this late into > code freeze. My personal opinion is that we have finally reached a point > where the interface is consistent and easy to maintain and to extend in > a compatible way, where that is needed (see, for instance, the > conversation with Ian Campbell about xl options).I agree with the general description of the situation. I just went through all of the patches yesterday and tried to break down the "risk" by evaluating the complexity of the patch, the potential impact if there were a bug, and the "reviewer risk" (i.e., how well the reviewers seemed comfortable with the code / interface). (I''ll paste in my notes below for those interested.) The patches are fairly straightforward, and the risk for most patches is fairly small and simple. The core patches I think it likely that the worst that could happen if there is a bug would be a performance regression. So from a pure code perspective, I think the risk is on the low side. The thing that is more risky is, as you say, the interface. It''s still quite a bit in flux, particularly the command-line part. And as far as coolness -- while it is definitely a cool feature, I''m not sure it''s so critical that it can''t wait for the next release. If we really want to prioritize a stable, bug-free release, I think we''ll probably have to say ''wait'' at this point to the soft affinity feature. Other views are welcome. :-) -George
Dario Faggioli
2013-Dec-04 16:03 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
On mer, 2013-12-04 at 15:49 +0000, George Dunlap wrote:> And as far as coolness -- while it is definitely a cool feature, I''m not > sure it''s so critical that it can''t wait for the next release. >Sure it can wait! I was only pointing that out in order to make it clear that there would have been benefits for taking the risks, as it usually happens in cost-vs-benefit analysis. :-) As I said myself, if we still think there are concerns on the interface, I''m happy to wait. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Dec-04 16:20 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
>>> On 04.12.13 at 16:49, George Dunlap <george.dunlap@eu.citrix.com> wrote: > And as far as coolness -- while it is definitely a cool feature, I''m not > sure it''s so critical that it can''t wait for the next release. If we > really want to prioritize a stable, bug-free release, I think we''ll > probably have to say ''wait'' at this point to the soft affinity feature.+1 Jan
Ian Jackson
2013-Dec-05 12:07 UTC
Re: [PATCH v5 03/17] libxl: fix memory leak in libxl_list_vcpu
Dario Faggioli writes ("[PATCH v5 03/17] libxl: fix memory leak in libxl_list_vcpu"):> more specifically, of the cpumap inside libxl_vcpuinfo, in case > of failure after it has been allocated. > > While at it, use the correct libxl memory allocation wrapper for > calloc() in there and turn the function into using the new LOGE() > logging style....> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c > index 0cb7ca9..028106b 100644 > --- a/tools/libxl/libxl.c > +++ b/tools/libxl/libxl.c > @@ -4527,31 +4527,31 @@ const libxl_version_info* libxl_get_version_info(libxl_ctx *ctx) > libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t domid, > int *nb_vcpu, int *nr_cpus_out) > { > + GC_INIT(ctx); > libxl_vcpuinfo *ptr, *ret; > xc_domaininfo_t domaininfo; > xc_vcpuinfo_t vcpuinfo; > > if (xc_domain_getinfolist(ctx->xch, domid, 1, &domaininfo) != 1) { > - LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting infolist"); > + LOGE(ERROR, "getting infolist"); > return NULL; > }Would it be better to initialise ret to NULL and then change this "return NULL" to "goto err" ? But it''s good as far as it goes. Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Ian Jackson
2013-Dec-05 12:10 UTC
Re: [PATCH v5 04/17] libxc/libxl: sanitize error handling in *_get_max_{cpus, nodes}
Dario Faggioli writes ("[PATCH v5 04/17] libxc/libxl: sanitize error handling in *_get_max_{cpus, nodes}"):> In libxc, make xc_get_max_{cpus,node}() always return either a > positive number or -1, and change all the callers to deal with > that. > > In libxl, make libxl_get_max_{cpus,nodes}() always return either a > positive number or a libxl error code. Thanks to that, it is also > possible to fix loggig for libxl_{cpu,node}_bitmap_alloc(), which > now happens inside the functions themselves, more accurately > reporting what happened. > > Note that libxl_{cpu,node}_bitmap_alloc() are also moved to > libxl_utils.c as, apart from having become less trivial and hence > better suited there, that''s required for using GC_INIT()/GC_FREE, > which in turn is needed for the LOG()/LOGE() macros. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Acked-by: George Dunlap <george.dunlap@eu.citrix.com>Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
Dario Faggioli
2013-Dec-06 10:34 UTC
Re: [PATCH v5 04/17] libxc/libxl: sanitize error handling in *_get_max_{cpus, nodes}
On gio, 2013-12-05 at 12:10 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v5 04/17] libxc/libxl: sanitize error handling in *_get_max_{cpus, nodes}"): > > In libxc, make xc_get_max_{cpus,node}() always return either a > > positive number or -1, and change all the callers to deal with > > that. > > > > In libxl, make libxl_get_max_{cpus,nodes}() always return either a > > positive number or a libxl error code. Thanks to that, it is also > > possible to fix loggig for libxl_{cpu,node}_bitmap_alloc(), which > > now happens inside the functions themselves, more accurately > > reporting what happened. > > > > Note that libxl_{cpu,node}_bitmap_alloc() are also moved to > > libxl_utils.c as, apart from having become less trivial and hence > > better suited there, that''s required for using GC_INIT()/GC_FREE, > > which in turn is needed for the LOG()/LOGE() macros. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> > > Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> >So, are these going to be committed, at some point before RC0? I''m talking about the first half of the series, i.e., patches 1 to 8, containint _not_ the soft affinity work, but just fixes for and libxl and the node-wise specification of vcpu pinning for xl. Here''s the current situation, as far as I can tell: a xl: match output of vcpu-list with pinning syntax a libxl: better name for last parameter of libxl_list_vcpu a libxl: fix memory leak in libxl_list_vcpu a libxc/libxl: sanitize error handling in *_get_max_{cpus,nodes} a libxc/libxl: allow to retrieve the number of online pCPUs ra xl: allow for node-wise specification of vcpu pinning a xl: implement and enable dryrun mode for `xl vcpu-pin'' a xl: test script for the cpumap parser (for vCPU pinning) r = has been ''Reviewed-by'' a = has been ''Acked-by'' Should I resend that part of the series with all the Ack-s? Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2013-Dec-06 11:52 UTC
Re: [PATCH v5 04/17] libxc/libxl: sanitize error handling in *_get_max_{cpus, nodes}
Dario Faggioli writes ("Re: [Xen-devel] [PATCH v5 04/17] libxc/libxl: sanitize error handling in *_get_max_{cpus, nodes}"):> Here''s the current situation, as far as I can tell: > > a xl: match output of vcpu-list with pinning syntax > a libxl: better name for last parameter of libxl_list_vcpu > a libxl: fix memory leak in libxl_list_vcpu > a libxc/libxl: sanitize error handling in *_get_max_{cpus,nodes} > a libxc/libxl: allow to retrieve the number of online pCPUs > ra xl: allow for node-wise specification of vcpu pinning > a xl: implement and enable dryrun mode for `xl vcpu-pin'' > a xl: test script for the cpumap parser (for vCPU pinning) > > r = has been ''Reviewed-by'' > a = has been ''Acked-by''Thanks for that summary.> Should I resend that part of the series with all the Ack-s?Yes, please do. (I have a vague feeling that the last time I looked at this ISTR some of the patches didn''t apply cleanly to staging, due to other changes, so you may have to fix up some conflicts.) Ian.
Dario Faggioli
2013-Dec-06 14:40 UTC
Re: [PATCH v5 02/17] libxl: better name for last parameter of libxl_list_vcpu
On mer, 2013-12-04 at 11:40 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v5 02/17] libxl: better name for last parameter of libxl_list_vcpu"): > > as it is now called `nr_vcpus_out'', but what is returned there is > > the number of pCPUs, rather than the number of vCPUs (which is > > also returned, but in another parameter, `nb_vcpu''). > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Ian Jackson <ian.jackson@eu.citrix.com> > > Although while you were there you could have renamed "nb_vcpu" to > "nr_vcpus_out"... >Ok, while I am resending the first part of the series, I''m going to do this. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Dec-11 11:33 UTC
Re: [PATCH v5 12/17] xen/libxc: sched: DOMCTL_*vcpuaffinity works with hard and soft affinity
>>> On 03.12.13 at 19:21, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 12/03/2013 10:06 AM, Jan Beulich wrote: >>>>> On 03.12.13 at 11:02, "Jan Beulich" <JBeulich@suse.com> wrote: >>>>>> On 02.12.13 at 19:29, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>>> + goto setvcpuaffinity_out; >>>> + >>>> + /* >>>> + * We both set a new affinity and report back to the caller > what >>>> + * the scheduler will be effectively using. >>>> + */ >>>> + if ( vcpuaff->flags & XEN_VCPUAFFINITY_HARD ) >>>> + { >>>> + ret = xenctl_bitmap_to_bitmap(cpumask_bits(new_affinity), >>>> + &vcpuaff->cpumap_hard, >>>> + vcpuaff->cpumap_hard.nr_bits); >>> >>> There''s no code above range checking vcpuaff->cpumap_hard.nr_bits, >>> yet xenctl_bitmap_to_bitmap() uses the passed in value to write into >>> the array pointed to by the first argument. Why is this not >>> xenctl_bitmap_to_cpumask() in the first place? >> >> And just to make it explicit - with fundamental flaws like this, I''m >> not certain anymore whether we really ought to rush this series >> in for 4.4. > > I''m certainly getting nervous about the prospect. However, the above > bug would only be triggered by bad input from domain 0, right? I suppose > even that would be a potential security issue in a highly disaggregated > environment. > > Other bugs in this patch would be similar. This path is taken on domain > creation IIUC; so bugs in this particular patch would probably either be > unexpected behavior of the affinities, or failure to handle unusual > input from a trusted source (domain 0).Now that XSA-77 went out, I can properly respond to this: With disaggregation (as said above), the problem isn''t limited to Dom0. While XSA-77 declares all such issues non-security ones for the time being, we should strive to avoid introducing new similar issues. Jan