thr3ads.net - Xen devel - [PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1 [Nov 2013]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2013-Nov-05 14:33 UTC

[PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1

Hi,

This is basically a resend of
http://lists.xen.org/archives/html/xen-devel/2013-10/msg00164.html, as someone
requested, for easier review. Nothing change, apart from the fact that I remove
the patches that have already been applied from the series.

The git branch is now this one:

 git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-RESEND

The original cover letter also follows.

Thanks and Regards,
Dario

---

Hi everyone,

So, this series introduces the concept of per-vcpu NUMA node-affinity. In fact,
up to now, node-affinity has only been "per-domain". That means it was
the
domain that had a node-affinity and:
 - that node-affinity was used to decide where to allocate the memory for the
   domain;
 - that node-affinity was used to decide on what nodes _all_ the vcpus of the
   domain prefer to be scheduled.

After this series this changes like this:
 - each vcpu of a domain has (well, may have) its own node-affinity, and that
   is what is used to determine (if the credit1 scheduler is used) where each
   specific vcpu prefers to run;
 - the node-affinity of the whole domain is the _union_ of all the
   node-affinities of the domain''s vcpus;
 - the memory is still allocated following what the node-affinity of the whole
   domain (so, the union of vcpu node-affinities, as said above) says.

In practise, it''s not such a big change, I''m just extending at
the per-vcpu
level what we already had at the domain level. This is also making
node-affinity a lot more similar to vcpu-pinning, both in terms of functioning
and user interface. As a side efect, that simplify the scheduling code (at
least the NUMA-aware part) by quite a bit. Finally, and most important, this is
something that will become really important when we will start to support
virtual NUMA topologies, as, a that point, having the same node-affinity for
all the vcpus in a domain won''t be enough any longer (we''ll
want the vcpus from
a particular vnode to have node-afinity with a particular pnode).

More detailed description of the mechanism and of the implementation choices
are provided in the changelogs and in the documentation (docs/misc and
manpages).

One last thing is that this series relies on some other patches and series that
I sent on xen-devel already, but have not been applied yet.  I''m
re-sending
them here, as a part of this series, so feel free to pick them up from here, if
wanting to apply them, or comment on them in this thread, if you want me to
change them.  In particular, patches 01 and 03, I already sent as single
patches, patches 04-07, I already sent them as a series. Sorry if that is a bit
clumsy, but I couldn''t find a better way to do it. :-)

In the detailed list of patches below, ''x'' means previously
submitted, ''*''
means already acked/reviewed-by.

Finally, Elena, that is not super important, but perhaps, in the next release
of your vNUMA series, you could try to integrate it with this (and of course,
ask if you need anything while trying to do that).

Matt, if/when you eventually get to release, even as RFC or something like
that, your HVM vNUMA series, we can try to figure out how to integrate that
with this, so to use node-affinity instead than pinning.

The series is also available at the following git coordinates:

 git://xenbits.xen.org/people/dariof/xen.git numa/per-vcpu-affinity-v1
 
http://xenbits.xen.org/gitweb/?p=people/dariof/xen.git;a=shortlog;h=refs/heads/numa/per-vcpu-affinity-v1

Let me know what you think about all this.

Regards,
Dario

---

Dario Faggioli (12):
 x    xen: numa-sched: leave node-affinity alone if not in "auto" mode
      xl: allow for node-wise specification of vcpu pinning
 x    xl: implement and enable dryrun mode for `xl vcpu-pin''
 x    xl: test script for the cpumap parser (for vCPU pinning)
      xen: numa-sched: make space for per-vcpu node-affinity
      xen: numa-sched: domain node-affinity always comes from vcpu node-affinity
      xen: numa-sched: use per-vcpu node-affinity for actual scheduling
      xen: numa-sched: enable getting/specifying per-vcpu node-affinity
      libxc: numa-sched: enable getting/specifying per-vcpu node-affinity
      libxl: numa-sched: enable getting/specifying per-vcpu node-affinity
      xl: numa-sched: enable getting/specifying per-vcpu node-affinity
      xl: numa-sched: enable specifying node-affinity in VM config file


 docs/man/xl.cfg.pod.5                           |   90 ++++-
 docs/man/xl.pod.1                               |   25 +
 docs/misc/xl-numa-placement.markdown            |  124 ++++--
 tools/libxc/xc_domain.c                         |   90 ++++-
 tools/libxc/xenctrl.h                           |   19 +
 tools/libxl/check-xl-vcpupin-parse              |  294 +++++++++++++++
 tools/libxl/check-xl-vcpupin-parse.data-example |   53 +++
 tools/libxl/libxl.c                             |   28 +
 tools/libxl/libxl.h                             |   11 +
 tools/libxl/libxl_dom.c                         |   18 +
 tools/libxl/libxl_numa.c                        |   14 -
 tools/libxl/libxl_types.idl                     |    1 
 tools/libxl/libxl_utils.h                       |   12 +
 tools/libxl/xl.h                                |    1 
 tools/libxl/xl_cmdimpl.c                        |  458 +++++++++++++++++++----
 tools/libxl/xl_cmdtable.c                       |   11 -
 xen/common/domain.c                             |   97 ++---
 xen/common/domctl.c                             |   47 ++
 xen/common/keyhandler.c                         |    6 
 xen/common/sched_credit.c                       |   63 ---
 xen/common/schedule.c                           |   55 +++
 xen/include/public/domctl.h                     |    8 
 xen/include/xen/sched-if.h                      |    2 
 xen/include/xen/sched.h                         |   13 +
 xen/xsm/flask/hooks.c                           |    2 
 25 files changed, 1254 insertions(+), 288 deletions(-)
 create mode 100755 tools/libxl/check-xl-vcpupin-parse
 create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example

-- 
Signature

Dario Faggioli

2013-Nov-05 14:34 UTC

head link

[PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode

If the domain''s NUMA node-affinity is being specified by the
user/toolstack (instead of being automatically computed by Xen),
we really should stick to that. This means domain_update_node_affinity()
is wrong when it filters out some stuff from there even in "!auto"
mode.

This commit fixes that. Of course, this does not mean node-affinity
is always honoured (e.g., a vcpu won''t run on a pcpu of a different
cpupool) but the necessary logic for taking into account all the
possible situations lives in the scheduler code, where it belongs.

What could happen without this change is that, under certain
circumstances, the node-affinity of a domain may change when the
user modifies the vcpu-affinity of the domain''s vcpus. This, even
if probably not a real bug, is at least something the user does
not expect, so let''s avoid it.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
---
This has been submitted already as a single patch on its own.
Since this series needs the change done here, just include it
in here, instead of pinging the original submission and deferring
posting this series.
---
 xen/common/domain.c |   28 +++++++++-------------------
 1 file changed, 9 insertions(+), 19 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 5999779..af31ab4 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -352,7 +352,6 @@ void domain_update_node_affinity(struct domain *d)
     cpumask_var_t cpumask;
     cpumask_var_t online_affinity;
     const cpumask_t *online;
-    nodemask_t nodemask = NODE_MASK_NONE;
     struct vcpu *v;
     unsigned int node;
 
@@ -374,28 +373,19 @@ void domain_update_node_affinity(struct domain *d)
         cpumask_or(cpumask, cpumask, online_affinity);
     }
 
+    /*
+     * If d->auto_node_affinity is true, the domain''s node-affinity
mask
+     * (d->node_affinity) is automaically computed from all the
domain''s
+     * vcpus'' vcpu-affinity masks (the union of which we have just
built
+     * above in cpumask). OTOH, if d->auto_node_affinity is false, we
+     * must leave the node-affinity of the domain alone.
+     */
     if ( d->auto_node_affinity )
     {
-        /* Node-affinity is automaically computed from all vcpu-affinities */
+        nodes_clear(d->node_affinity);
         for_each_online_node ( node )
             if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
-                node_set(node, nodemask);
-
-        d->node_affinity = nodemask;
-    }
-    else
-    {
-        /* Node-affinity is provided by someone else, just filter out cpus
-         * that are either offline or not in the affinity of any vcpus. */
-        nodemask = d->node_affinity;
-        for_each_node_mask ( node, d->node_affinity )
-            if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) )
-                node_clear(node, nodemask);//d->node_affinity);
-
-        /* Avoid loosing track of node-affinity because of a bad
-         * vcpu-affinity has been specified. */
-        if ( !nodes_empty(nodemask) )
-            d->node_affinity = nodemask;
+                node_set(node, d->node_affinity);
     }
 
     sched_set_node_affinity(d, &d->node_affinity);

Dario Faggioli

2013-Nov-05 14:34 UTC

head link

[PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Making it possible to use something like the following:
 * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
 * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
 * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2
   but not pCPU 6;
 * ...

In both domain config file and `xl vcpu-pin''.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
Picking this up from a previously submitted series ("xl:
allow for node-wise specification of vcpu pinning") as the
changes in that and in this series would otherwise be
conflicting. If this is considered fine, Feel free to apply
it from here and skip the corresponding e-mail in the
original submission.
---
Changes from v2:
 * turned a ''return'' into ''goto out'',
consistently with the
   most of exit patterns;
 * harmonized error handling: now parse_range() return a
   libxl error code, as requested during review;
 * dealing with "all" moved inside update_cpumap_range().
   It''s tricky to move it in parse_range() (as requested
   during review), since we need the cpumap being modified
   handy when dealing with it. However, having it in
   update_cpumap_range() simplifies the code just as much
   as that;
 * explicitly checking for junk after a valid value or range
   in parse_range(), as requested during review;
 * xl exits on parsing failing, so no need to reset the
   cpumap to something sensible in vcpupin_parse(), as
   suggested during review;

Changes from v1:
 * code rearranged in order to look more simple to follow
   and understand, as requested during review;
 * improved docs in xl.cfg.pod.5, as requested during
   review;
 * strtoul() now returns into unsigned long, and the
   case where it returns ULONG_MAX is now taken into
   account, as requested during review;
 * stuff like "all,^7" now works, as requested during
   review. Specifying just "^7" does not work either
   before or after this change
 * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`)
   by introducing STR_SKIP_PREFIX() macro, as requested
   during review.
---
 docs/man/xl.cfg.pod.5    |   20 ++++++
 tools/libxl/xl_cmdimpl.c |  145 ++++++++++++++++++++++++++++++++--------------
 2 files changed, 121 insertions(+), 44 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index d2d8921..1c98cb4 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the cpus
on the host.
 
 =item "0-3,5,^1"
 
-To allow all the vcpus of the guest to run on cpus 0,2,3,5.
+To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining
+this with "all" is possible, meaning "all,^7" results in
all the vcpus
+of the guest running on all the cpus on the host except cpu 7.
+
+=item "nodes:0-3,node:^2"
+
+To allow all the vcpus of the guest to run on the cpus from NUMA nodes
+0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs
+to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus
+of the guest will run on cpus 0-3,8-11.
+
+Combining this notation with the one above is possible. For instance,
+"1,node:2,^6", means all the vcpus of the guest will run on cpu 1 and
+on all the cpus of NUMA node 2, but not on cpu 6. Following the same
+example as above, that would be cpus 1,4,5,7.
+
+Combining this with "all" is also possible, meaning
"all,^nodes:1"
+results in all the vcpus of the guest running on all the cpus on the
+host, except for the cpus belonging to the host NUMA node 1.
 
 =item ["2", "3"] (or [2, 3])
 
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 40feb7d..b8755b9 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -59,6 +59,11 @@
         }                                                               \
     })
 
+#define STR_HAS_PREFIX( a, b )  \
+    ( strncmp(a, b, strlen(b)) == 0 )
+#define STR_SKIP_PREFIX( a, b ) \
+    ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL )
+
 
 int logfile = 2;
 
@@ -513,61 +518,115 @@ static void split_string_into_string_list(const char
*str,
     free(s);
 }
 
-static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
+static int parse_range(const char *str, unsigned long *a, unsigned long *b)
+{
+    char *nstr, *endptr;
+
+    *a = *b = strtoul(str, &endptr, 10);
+    if (endptr == str || *a == ULONG_MAX)
+        return ERROR_INVAL;
+
+    if (*endptr == ''-'') {
+        nstr = endptr + 1;
+
+        *b = strtoul(nstr, &endptr, 10);
+        if (endptr == nstr || *b == ULONG_MAX || *b < *a)
+            return ERROR_INVAL;
+    }
+
+    /* Valid value or range so far, but we also don''t want junk after
that */
+    if (*endptr != ''\0'')
+        return ERROR_INVAL;
+
+    return 0;
+}
+
+/*
+ * Add or removes a specific set of cpus (specified in str, either as
+ * single cpus or as entire NUMA nodes) to/from cpumap.
+ */
+static int update_cpumap_range(const char *str, libxl_bitmap *cpumap)
 {
-    libxl_bitmap exclude_cpumap;
-    uint32_t cpuida, cpuidb;
-    char *endptr, *toka, *tokb, *saveptr = NULL;
-    int i, rc = 0, rmcpu;
+    unsigned long ida, idb;
+    libxl_bitmap node_cpumap;
+    bool is_not = false, is_nodes = false;
+    int rc = 0;
+
+    libxl_bitmap_init(&node_cpumap);
 
-    if (!strcmp(cpu, "all")) {
+    rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0);
+    if (rc) {
+        fprintf(stderr, "libxl_node_bitmap_alloc failed.\n");
+        goto out;
+    }
+
+    /* Are we adding or removing cpus/nodes? */
+    if (STR_SKIP_PREFIX(str, "^")) {
+        is_not = true;
+    }
+
+    /* Are we dealing with cpus or full nodes? */
+    if (STR_SKIP_PREFIX(str, "nodes:")) {
+        is_nodes = true;
+    }
+
+    if (STR_HAS_PREFIX(str, "all")) {
         libxl_bitmap_set_any(cpumap);
-        return 0;
+        goto out;
     }
 
-    if (libxl_cpu_bitmap_alloc(ctx, &exclude_cpumap, 0)) {
-        fprintf(stderr, "Error: Failed to allocate cpumap.\n");
-        return ENOMEM;
+    rc = parse_range(str, &ida, &idb);
+    if (rc) {
+        fprintf(stderr, "Invalid pcpu range: %s.\n", str);
+        goto out;
     }
 
-    for (toka = strtok_r(cpu, ",", &saveptr); toka;
-         toka = strtok_r(NULL, ",", &saveptr)) {
-        rmcpu = 0;
-        if (*toka == ''^'') {
-            /* This (These) Cpu(s) will be removed from the map */
-            toka++;
-            rmcpu = 1;
-        }
-        /* Extract a valid (range of) cpu(s) */
-        cpuida = cpuidb = strtoul(toka, &endptr, 10);
-        if (endptr == toka) {
-            fprintf(stderr, "Error: Invalid argument.\n");
-            rc = EINVAL;
-            goto vcpp_out;
-        }
-        if (*endptr == ''-'') {
-            tokb = endptr + 1;
-            cpuidb = strtoul(tokb, &endptr, 10);
-            if (endptr == tokb || cpuida > cpuidb) {
-                fprintf(stderr, "Error: Invalid argument.\n");
-                rc = EINVAL;
-                goto vcpp_out;
+    /* Add or remove the specified cpus in the range */
+    while (ida <= idb) {
+        if (is_nodes) {
+            /* Add/Remove all the cpus of a NUMA node */
+            int i;
+
+            rc = libxl_node_to_cpumap(ctx, ida, &node_cpumap);
+            if (rc) {
+                fprintf(stderr, "libxl_node_to_cpumap failed.\n");
+                goto out;
             }
+
+            /* Add/Remove all the cpus in the node cpumap */
+            libxl_for_each_set_bit(i, node_cpumap) {
+                is_not ? libxl_bitmap_reset(cpumap, i) :
+                         libxl_bitmap_set(cpumap, i);
+            }
+        } else {
+            /* Add/Remove this cpu */
+            is_not ? libxl_bitmap_reset(cpumap, ida) :
+                     libxl_bitmap_set(cpumap, ida);
         }
-        while (cpuida <= cpuidb) {
-            rmcpu == 0 ? libxl_bitmap_set(cpumap, cpuida) :
-                         libxl_bitmap_set(&exclude_cpumap, cpuida);
-            cpuida++;
-        }
+        ida++;
     }
 
-    /* Clear all the cpus from the removal list */
-    libxl_for_each_set_bit(i, exclude_cpumap) {
-        libxl_bitmap_reset(cpumap, i);
-    }
+ out:
+    libxl_bitmap_dispose(&node_cpumap);
+    return rc;
+}
+
+/*
+ * Takes a string representing a set of cpus (specified either as
+ * single cpus or as eintire NUMA nodes) and turns it into the
+ * corresponding libxl_bitmap (in cpumap).
+ */
+static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
+{
+    char *ptr, *saveptr = NULL;
+    int rc = 0;
 
-vcpp_out:
-    libxl_bitmap_dispose(&exclude_cpumap);
+    for (ptr = strtok_r(cpu, ",", &saveptr); ptr;
+         ptr = strtok_r(NULL, ",", &saveptr)) {
+        rc = update_cpumap_range(ptr, cpumap);
+        if (rc)
+            break;
+    }
 
     return rc;
 }

Dario Faggioli

2013-Nov-05 14:34 UTC

head link

[PATCH RESEND 03/12] xl: implement and enable dryrun mode for `xl vcpu-pin''

As it can be useful to see if the outcome of some complex vCPU
pinning bitmap specification looks as expected.

This also allow for the introduction of some automatic testing
and verification for the bitmap parsing code, as it happens
already in check-xl-disk-parse and check-xl-vif-parse.

In particular, to make the above possible, this commit also
changes the implementation of the vcpu-pin command so that,
instead of always returning 0, it returns an error if the
parsing fails.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>
---
Picking this up from a previously submitted series ("xl:
allow for node-wise specification of vcpu pinning") as the
changes in that and in this series would otherwise be
conflicting. If this is considered fine, Feel free to apply
it from here and skip the corresponding e-mail in the
original submission.
---
Changes since v2:
 * fixed a typo in the changelog
---
 tools/libxl/xl_cmdimpl.c  |   49 +++++++++++++++++++++++++++++++++------------
 tools/libxl/xl_cmdtable.c |    2 +-
 2 files changed, 37 insertions(+), 14 deletions(-)

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index b8755b9..f6239d5 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -4572,30 +4572,53 @@ int main_vcpulist(int argc, char **argv)
     return 0;
 }
 
-static void vcpupin(uint32_t domid, const char *vcpu, char *cpu)
+static int vcpupin(uint32_t domid, const char *vcpu, char *cpu)
 {
     libxl_vcpuinfo *vcpuinfo;
     libxl_bitmap cpumap;
 
     uint32_t vcpuid;
     char *endptr;
-    int i, nb_vcpu;
+    int i = 0, nb_vcpu, rc = -1;
+
+    libxl_bitmap_init(&cpumap);
 
     vcpuid = strtoul(vcpu, &endptr, 10);
     if (vcpu == endptr) {
         if (strcmp(vcpu, "all")) {
             fprintf(stderr, "Error: Invalid argument.\n");
-            return;
+            goto out;
         }
         vcpuid = -1;
     }
 
-    if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0)) {
-        goto vcpupin_out;
-    }
+    if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0))
+        goto out;
 
     if (vcpupin_parse(cpu, &cpumap))
-        goto vcpupin_out1;
+        goto out;
+
+    if (dryrun_only) {
+        libxl_cputopology *info = libxl_get_cpu_topology(ctx, &i);
+
+        if (!info) {
+            fprintf(stderr, "libxl_get_cpu_topology failed.\n");
+            goto out;
+        }
+        libxl_cputopology_list_free(info, i);
+
+        fprintf(stdout, "cpumap: ");
+        print_cpumap(cpumap.map, i, stdout);
+        fprintf(stdout, "\n");
+
+        if (ferror(stdout) || fflush(stdout)) {
+            perror("stdout");
+            exit(-1);
+        }
+
+        rc = 0;
+        goto out;
+    }
 
     if (vcpuid != -1) {
         if (libxl_set_vcpuaffinity(ctx, domid, vcpuid, &cpumap) == -1) {
@@ -4605,7 +4628,7 @@ static void vcpupin(uint32_t domid, const char *vcpu, char
*cpu)
     else {
         if (!(vcpuinfo = libxl_list_vcpu(ctx, domid, &nb_vcpu, &i))) {
             fprintf(stderr, "libxl_list_vcpu failed.\n");
-            goto vcpupin_out1;
+            goto out;
         }
         for (i = 0; i < nb_vcpu; i++) {
             if (libxl_set_vcpuaffinity(ctx, domid, vcpuinfo[i].vcpuid,
@@ -4616,10 +4639,11 @@ static void vcpupin(uint32_t domid, const char *vcpu,
char *cpu)
         }
         libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu);
     }
-  vcpupin_out1:
+
+    rc = 0;
+ out:
     libxl_bitmap_dispose(&cpumap);
-  vcpupin_out:
-    ;
+    return rc;
 }
 
 int main_vcpupin(int argc, char **argv)
@@ -4630,8 +4654,7 @@ int main_vcpupin(int argc, char **argv)
         /* No options */
     }
 
-    vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]);
-    return 0;
+    return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]);
 }
 
 static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host)
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index 326a660..d3dcbf0 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -211,7 +211,7 @@ struct cmd_spec cmd_table[] = {
       "[Domain, ...]",
     },
     { "vcpu-pin",
-      &main_vcpupin, 0, 1,
+      &main_vcpupin, 1, 1,
       "Set which CPUs a VCPU can use",
       "<Domain> <VCPU|all> <CPUs|all>",
     },

Dario Faggioli

2013-Nov-05 14:34 UTC

head link

[PATCH RESEND 04/12] xl: test script for the cpumap parser (for vCPU pinning)

This commit introduces "check-xl-vcpupin-parse" for helping
verifying and debugging the (v)CPU bitmap parsing code in xl.

The script runs "xl -N vcpu-pin 0 all <some strings>"
repeatedly, with various input strings, and checks that the
output is as expected.

This is what the script can do:

# ./check-xl-vcpupin-parse -h
 usage: ./check-xl-vcpupin-parse [options]

 Tests various vcpu-pinning strings. If run without arguments acts
 as follows:
  - generates some test data and saves them in
    check-xl-vcpupin-parse.data;
  - tests all the generated configurations (reading them back from
    check-xl-vcpupin-parse.data).

 An example of a test vector file is provided in
 check-xl-vcpupin-parse.data-example.

 Options:
  -h         prints this message
  -r seed    uses seed for initializing the rundom number generator
             (default: the script PID)
  -s string  tries using string as a vcpu pinning configuration and
             reports whether that succeeds or not
  -o ofile   save the test data in ofile
             (default: check-xl-vcpupin-parse.data)
  -i ifile   read test data from ifile

An example test data file (generated on a 2 NUMA nodes, 16 CPUs
host) is being provided in check-xl-vcpupin-parse.data-example.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>
---
Picking this up from a previously submitted series ("xl:
allow for node-wise specification of vcpu pinning") as the
changes in that and in this series would otherwise be
conflicting. If this is considered fine, Feel free to apply
it from here and skip the corresponding e-mail in the
original submission.
---
Changes from v2:
 * killed the `sleep 1'', as requested during review;
 * allow for passing a custom randon seed, and report what is
   the actual random seed used, as requested during review;
 * allow for testing for specific pinning configuration strings,
   as suggested during review;
 * stores the test data in a file, after generating them, and
   read them back from there for actual testing, as suggested
   during review;
 * allow for reading the test data from an existing test file
   instead than always generating new ones.

Changes from v1:
 * this was not there in v1, and adding it has been requested
   during review.
---
 tools/libxl/check-xl-vcpupin-parse              |  294 +++++++++++++++++++++++
 tools/libxl/check-xl-vcpupin-parse.data-example |   53 ++++
 2 files changed, 347 insertions(+)
 create mode 100755 tools/libxl/check-xl-vcpupin-parse
 create mode 100644 tools/libxl/check-xl-vcpupin-parse.data-example

diff --git a/tools/libxl/check-xl-vcpupin-parse
b/tools/libxl/check-xl-vcpupin-parse
new file mode 100755
index 0000000..6189660
--- /dev/null
+++ b/tools/libxl/check-xl-vcpupin-parse
@@ -0,0 +1,294 @@
+#!/bin/bash
+
+set -e
+
+if [ -x ./xl ] ; then
+    export LD_LIBRARY_PATH=.:../libxc:../xenstore:
+    XL=./xl
+else
+    XL=xl
+fi
+
+fprefix=tmp.check-xl-vcpupin-parse
+outfile=check-xl-vcpupin-parse.data
+
+usage () {
+cat <<END
+usage: $0 [options]
+
+Tests various vcpu-pinning strings. If run without arguments acts
+as follows:
+ - generates some test data and saves them in $outfile;
+ - tests all the generated configurations (reading them back from
+   $outfile).
+
+An example of a test vector file is provided in ${outfile}-example.
+
+Options:
+ -h         prints this message
+ -r seed    uses seed for initializing the rundom number generator
+            (default: the script PID)
+ -s string  tries using string as a vcpu pinning configuration and
+            reports whether that succeeds or not
+ -o ofile   save the test data in ofile (default: $outfile)
+ -i ifile   read test data from ifile
+END
+}
+
+expected () {
+    cat >$fprefix.expected
+}
+
+# by default, re-seed with our PID
+seed=$$
+failures=0
+
+# Execute one test and check the result against the provided
+# rc value and output
+one () {
+    expected_rc=$1; shift
+    printf "test case %s...\n" "$*"
+    set +e
+    ${XL} -N vcpu-pin 0 all "$@" </dev/null >$fprefix.actual
2>/dev/null
+    actual_rc=$?
+    if [ $actual_rc != $expected_rc ]; then
+        diff -u $fprefix.expected $fprefix.actual
+        echo >&2 "test case \`$*'' failed ($actual_rc
$diff_rc)"
+        failures=$(( $failures + 1 ))
+    fi
+    set -e
+}
+
+# Write an entry in the test vector file. Format is as follows:
+#  test-string*expected-rc*expected-output
+write () {
+    printf "$1*$2*$3\n" >> $outfile
+}
+
+complete () {
+    if [ "$failures" = 0 ]; then
+        echo all ok.; exit 0
+    else
+        echo "$failures tests failed."; exit 1
+    fi
+}
+
+# Test a specific pinning string
+string () {
+    expected_rc=$1; shift
+    printf "test case %s...\n" "$*"
+    set +e
+    ${XL} -N vcpu-pin 0 all "$@" &> /dev/null
+    actual_rc=$?
+    set -e
+
+    if [ $actual_rc != $expected_rc ]; then
+        echo >&2 "test case \`$*'' failed
($actual_rc)"
+    else
+        echo >&2 "test case \`$*'' succeeded"
+    fi
+
+    exit 0
+}
+
+# Read a test vector file (provided as $1) line by line and
+# test all the entries it contains
+run ()
+{
+    while read line
+    do
+        if [ ${line:0:1} != ''#'' ]; then
+            test_string="`echo $line | cut -f1
-d''*''`"
+            exp_rc="`echo $line | cut -f2 -d''*''`"
+            exp_output="`echo $line | cut -f3
-d''*''`"
+
+            expected <<END
+$exp_output
+END
+            one $exp_rc "$test_string"
+        fi
+    done < $1
+
+    complete
+
+    exit 0
+}
+
+while getopts "hr:s:o:i:" option
+do
+    case $option in
+    h)
+        usage
+        exit 0
+        ;;
+    r)
+        seed=$OPTARG
+        ;;
+    s)
+        string 0 "$OPTARG"
+        ;;
+    o)
+        outfile=$OPTARG
+        ;;
+    i)
+        run $OPTARG
+        ;;
+    esac
+done
+
+#---------- test data ----------
+#
+nr_cpus=`xl info | grep nr_cpus | cut -f2 -d'':''`
+nr_nodes=`xl info | grep nr_nodes | cut -f2 -d'':''`
+nr_cpus_per_node=`xl info -n | sed ''/cpu:/,/numa_info/!d'' |
head -n -1 | \
+    awk ''{print $4}'' | uniq -c | tail -1 | awk
''{print $1}''`
+cat >$outfile <<END
+# WARNING: some of these tests are topology based tests.
+# Expect failures if the topology is not detected correctly
+# detected topology: $nr_cpus CPUs, $nr_nodes nodes, $nr_cpus_per_node CPUs per
node.
+#
+# seed used for random number generation: seed=${seed}.
+#
+# Format is as follows:
+#  test-string*expected-return-code*expected-output
+#
+END
+
+# Re-seed the random number generator
+RANDOM=$seed
+
+echo "# Testing a wrong configuration" >> $outfile
+write foo 255 ""
+
+echo "# Testing the ''all'' syntax" >> $outfile
+write "all" 0 "cpumap: any cpu"
+write "nodes:all" 0 "cpumap: any cpu"
+write "all,nodes:all" 0 "cpumap: any cpu"
+write "all,^nodes:0,all" 0 "cpumap: any cpu"
+
+echo "# Testing the empty cpumap case" >> $outfile
+write "^0" 0 "cpumap: none"
+
+echo "# A few attempts of pinning to just one random cpu" >>
$outfile
+if [ $nr_cpus -gt 1 ]; then
+    for i in `seq 0 3`; do
+        cpu=$(($RANDOM % nr_cpus))
+        write "$cpu" 0 "cpumap: $cpu"
+    done
+fi
+
+echo "# A few attempts of pinning to all but one random cpu" >>
$outfile
+if [ $nr_cpus -gt 2 ]; then
+    for i in `seq 0 3`; do
+        cpu=$(($RANDOM % nr_cpus))
+        if [ $cpu -eq 0 ]; then
+            expected_range="1-$((nr_cpus - 1))"
+        elif [ $cpu -eq 1 ]; then
+            expected_range="0,2-$((nr_cpus - 1))"
+        elif [ $cpu -eq $((nr_cpus - 2)) ]; then
+            expected_range="0-$((cpu - 1)),$((nr_cpus - 1))"
+        elif [ $cpu -eq $((nr_cpus - 1)) ]; then
+            expected_range="0-$((nr_cpus - 2))"
+        else
+            expected_range="0-$((cpu - 1)),$((cpu + 1))-$((nr_cpus -
1))"
+        fi
+        write "all,^$cpu" 0 "cpumap: $expected_range"
+    done
+fi
+
+echo "# A few attempts of pinning to a random range of cpus" >>
$outfile
+if [ $nr_cpus -gt 2 ]; then
+    for i in `seq 0 3`; do
+        cpua=$(($RANDOM % nr_cpus))
+        range=$((nr_cpus - cpua))
+        cpub=$(($RANDOM % range))
+        cpubb=$((cpua + cpub))
+        if [ $cpua -eq $cpubb ]; then
+            expected_range="$cpua"
+        else
+            expected_range="$cpua-$cpubb"
+        fi
+        write "$expected_range" 0 "cpumap: $expected_range"
+    done
+fi
+
+echo "# A few attempts of pinning to just one random node" >>
$outfile
+if [ $nr_nodes -gt 1 ]; then
+    for i in `seq 0 3`; do
+        node=$(($RANDOM % nr_nodes))
+        # this assumes that the first $nr_cpus_per_node (from cpu
+        # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node
+        # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node
+        # to 2*$nr_cpus_per_node-1) are assigned to the second node (node
+        # 1), etc. Expect failures if that is not the case.
+        write "nodes:$node" 0 "cpumap:
$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1)-1))"
+    done
+fi
+
+echo "# A few attempts of pinning to all but one random node"
>> $outfile
+if [ $nr_nodes -gt 1 ]; then
+    for i in `seq 0 3`; do
+        node=$(($RANDOM % nr_nodes))
+        # this assumes that the first $nr_cpus_per_node (from cpu
+        # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node
+        # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node
+        # to 2*$nr_cpus_per_node-1) are assigned to the second node (node
+        # 1), etc. Expect failures if that is not the case.
+        if [ $node -eq 0 ]; then
+            expected_range="$nr_cpus_per_node-$((nr_cpus - 1))"
+        elif [ $node -eq $((nr_nodes - 1)) ]; then
+            expected_range="0-$((nr_cpus - nr_cpus_per_node - 1))"
+        else
+           
expected_range="0-$((nr_cpus_per_node*node-1)),$((nr_cpus_per_node*(node+1)))-$nr_cpus"
+        fi
+        write "all,^nodes:$node" 0 "cpumap:
$expected_range"
+    done
+fi
+
+echo "# A few attempts of pinning to a random range of nodes"
>> $outfile
+if [ $nr_nodes -gt 1 ]; then
+    for i in `seq 0 3`; do
+        nodea=$(($RANDOM % nr_nodes))
+        range=$((nr_nodes - nodea))
+        nodeb=$(($RANDOM % range))
+        nodebb=$((nodea + nodeb))
+        # this assumes that the first $nr_cpus_per_node (from cpu
+        # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node
+        # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node
+        # to 2*$nr_cpus_per_node-1) are assigned to the second node (node
+        # 1), etc. Expect failures if that is not the case.
+        if [ $nodea -eq 0 ] && [ $nodebb -eq $((nr_nodes - 1)) ]; then
+            expected_range="any cpu"
+        else
+           
expected_range="$((nr_cpus_per_node*nodea))-$((nr_cpus_per_node*(nodebb+1)
- 1))"
+        fi
+        write "nodes:$nodea-$nodebb" 0 "cpumap:
$expected_range"
+    done
+fi
+
+echo "# A few attempts of pinning to a node but excluding one random
cpu" >> $outfile
+if [ $nr_nodes -gt 1 ]; then
+    for i in `seq 0 3`; do
+        node=$(($RANDOM % nr_nodes))
+        # this assumes that the first $nr_cpus_per_node (from cpu
+        # 0 to cpu $nr_cpus_per_node-1) are assigned to the first node
+        # (node 0), the second $nr_cpus_per_node (from $nr_cpus_per_node
+        # to 2*$nr_cpus_per_node-1) are assigned to the second node (node
+        # 1), etc. Expect failures if that is not the case.
+        cpu=$(($RANDOM % nr_cpus_per_node + nr_cpus_per_node*node))
+        if [ $cpu -eq $((nr_cpus_per_node*node)) ]; then
+            expected_range="$((nr_cpus_per_node*node +
1))-$((nr_cpus_per_node*(node+1) - 1))"
+        elif [ $cpu -eq $((nr_cpus_per_node*node + 1)) ]; then
+           
expected_range="$((nr_cpus_per_node*node)),$((nr_cpus_per_node*node +
2))-$((nr_cpus_per_node*(node+1) - 1))"
+        elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 2)) ]; then
+           
expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) -
3)),$((nr_cpus_per_node*(node+1) - 1))"
+        elif [ $cpu -eq $((nr_cpus_per_node*(node+1) - 1)) ]; then
+           
expected_range="$((nr_cpus_per_node*node))-$((nr_cpus_per_node*(node+1) -
2))"
+        else
+            expected_range="$((nr_cpus_per_node*node))-$((cpu - 1)),$((cpu
+ 1))-$((nr_cpus_per_node*(node+1) - 1))"
+        fi
+        write "nodes:$node,^$cpu" 0 "cpumap:
$expected_range"
+    done
+fi
+
+run $outfile
diff --git a/tools/libxl/check-xl-vcpupin-parse.data-example
b/tools/libxl/check-xl-vcpupin-parse.data-example
new file mode 100644
index 0000000..c9cf660
--- /dev/null
+++ b/tools/libxl/check-xl-vcpupin-parse.data-example
@@ -0,0 +1,53 @@
+# WARNING: some of these tests are topology based tests.
+# Expect failures if the topology is not detected correctly
+# detected topology:  16 CPUs,  2 nodes, 8 CPUs per node.
+#
+# seed used for random number generation: seed=13328.
+#
+# Format is as follows:
+#  test-string*expected-return-code*expected-output
+#
+# Testing a wring configuration
+foo*255*
+# Testing the ''all'' syntax
+all*0*cpumap: any cpu
+nodes:all*0*cpumap: any cpu
+all,nodes:all*0*cpumap: any cpu
+all,^nodes:0,all*0*cpumap: any cpu
+# Testing the empty cpumap case
+^0*0*cpumap: none
+# A few attempts of pinning to just one random cpu
+0*0*cpumap: 0
+9*0*cpumap: 9
+6*0*cpumap: 6
+0*0*cpumap: 0
+# A few attempts of pinning to all but one random cpu
+all,^12*0*cpumap: 0-11,13-15
+all,^6*0*cpumap: 0-5,7-15
+all,^3*0*cpumap: 0-2,4-15
+all,^7*0*cpumap: 0-6,8-15
+# A few attempts of pinning to a random range of cpus
+13-15*0*cpumap: 13-15
+7*0*cpumap: 7
+3-5*0*cpumap: 3-5
+8-11*0*cpumap: 8-11
+# A few attempts of pinning to just one random node
+nodes:1*0*cpumap: 8-15
+nodes:0*0*cpumap: 0-7
+nodes:0*0*cpumap: 0-7
+nodes:0*0*cpumap: 0-7
+# A few attempts of pinning to all but one random node
+all,^nodes:0*0*cpumap: 8-15
+all,^nodes:1*0*cpumap: 0-7
+all,^nodes:1*0*cpumap: 0-7
+all,^nodes:0*0*cpumap: 8-15
+# A few attempts of pinning to a random range of nodes
+nodes:1-1*0*cpumap: 8-15
+nodes:1-1*0*cpumap: 8-15
+nodes:0-1*0*cpumap: any cpu
+nodes:0-0*0*cpumap: 0-7
+# A few attempts of pinning to a node but excluding one random cpu
+nodes:1,^8*0*cpumap: 9-15
+nodes:0,^6*0*cpumap: 0-5,7
+nodes:1,^9*0*cpumap: 8,10-15
+nodes:0,^5*0*cpumap: 0-4,6-7

Dario Faggioli

2013-Nov-05 14:35 UTC

head link

[PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Before this change, each vcpu had its own vcpu-affinity (also
called pinning), while the whole domain had a NUMA node-affinity.
Of course, as the (credit) scheduler schedules vcpus and not
whole domains, this means that all the vcpus of a domain had
the same NUMA node-affinity.

This change is the first step toward overcoming such limitation.
It adds the data structures for storing the node-affinity on a
per-vcpu basis (along with allocating and initializing it).

As far as this change only is concerned, there is no specific
way to change the node-affinity of a vcpu to something which
is not automatically computed (basing on its vcpu-affinity).
Such logic is being introduced in subsequent commits.

Also, now that each vcpu has its own node-affinity, and in
case the domain''s node-affinity is set to ''automatically
computed'', we build it up as the union of all the
node-affinities of all the vcpus of the domain.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 xen/common/domain.c     |   39 ++++++++++++++++++++++++++++++++-----
 xen/common/keyhandler.c |    6 +++++-
 xen/common/schedule.c   |   50 +++++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/sched.h |   10 +++++++++
 4 files changed, 99 insertions(+), 6 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index af31ab4..8d2ff49 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -128,6 +128,7 @@ struct vcpu *alloc_vcpu(
     if ( !zalloc_cpumask_var(&v->cpu_affinity) ||
          !zalloc_cpumask_var(&v->cpu_affinity_tmp) ||
          !zalloc_cpumask_var(&v->cpu_affinity_saved) ||
+         !zalloc_cpumask_var(&v->node_affinity) ||
          !zalloc_cpumask_var(&v->vcpu_dirty_cpumask) )
         goto fail_free;
 
@@ -159,6 +160,7 @@ struct vcpu *alloc_vcpu(
         free_cpumask_var(v->cpu_affinity);
         free_cpumask_var(v->cpu_affinity_tmp);
         free_cpumask_var(v->cpu_affinity_saved);
+        free_cpumask_var(v->node_affinity);
         free_cpumask_var(v->vcpu_dirty_cpumask);
         free_vcpu_struct(v);
         return NULL;
@@ -353,7 +355,7 @@ void domain_update_node_affinity(struct domain *d)
     cpumask_var_t online_affinity;
     const cpumask_t *online;
     struct vcpu *v;
-    unsigned int node;
+    unsigned int cpu;
 
     if ( !zalloc_cpumask_var(&cpumask) )
         return;
@@ -367,9 +369,36 @@ void domain_update_node_affinity(struct domain *d)
 
     spin_lock(&d->node_affinity_lock);
 
+    /*
+     * Let''s prepare the cpumask that will be used below to actually
update
+     * the node-affinity of the whole domain. Each vcpu has a vcpu-affinity
+     * and a numa-affinity. What gets built in cpumask (and used below) is
+     * the union of all the (online) cpus in all the vcpu''s
numa-affinity
+     * masks.
+     *
+     * On its turn, the numa-affinity mask of the i-eth vcpu (say,
''v'') is
+     * either derived directly from the vcpu''s vcpu-affinity mask (in
case
+     * v->auto_node_affinity is true) or has its own value, (potentially)
+     * completely independent from v->cpu_affinity. In the former case, it
+     * is here that we make sure the two affinity masks matches (since this
+     * function gets called in correspondence of each modification to
+     * v->cpu_affinity happening in vcpu_set_affinity() ); in the latter
+     * case, we just leave v->node_affinity alone.
+     */
     for_each_vcpu ( d, v )
     {
-        cpumask_and(online_affinity, v->cpu_affinity, online);
+        if ( v->auto_node_affinity )
+        {
+            cpumask_clear(v->node_affinity);
+            for_each_cpu ( cpu, v->cpu_affinity )
+                cpumask_or(v->node_affinity, v->node_affinity,
+                           &node_to_cpumask(cpu_to_node(cpu)));
+
+            cpumask_and(online_affinity, v->node_affinity, online);
+        }
+        else
+            cpumask_copy(online_affinity, v->node_affinity);
+
         cpumask_or(cpumask, cpumask, online_affinity);
     }
 
@@ -383,9 +412,8 @@ void domain_update_node_affinity(struct domain *d)
     if ( d->auto_node_affinity )
     {
         nodes_clear(d->node_affinity);
-        for_each_online_node ( node )
-            if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
-                node_set(node, d->node_affinity);
+        for_each_cpu ( cpu, cpumask )
+            node_set(cpu_to_node(cpu), d->node_affinity);
     }
 
     sched_set_node_affinity(d, &d->node_affinity);
@@ -734,6 +762,7 @@ static void complete_domain_destroy(struct rcu_head *head)
         {
             free_cpumask_var(v->cpu_affinity);
             free_cpumask_var(v->cpu_affinity_tmp);
+            free_cpumask_var(v->node_affinity);
             free_cpumask_var(v->vcpu_dirty_cpumask);
             free_vcpu_struct(v);
         }
diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
index 8e4b3f8..8d5e8b2 100644
--- a/xen/common/keyhandler.c
+++ b/xen/common/keyhandler.c
@@ -297,7 +297,11 @@ static void dump_domains(unsigned char key)
             cpuset_print(tmpstr, sizeof(tmpstr), v->vcpu_dirty_cpumask);
             printk("dirty_cpus=%s ", tmpstr);
             cpuset_print(tmpstr, sizeof(tmpstr), v->cpu_affinity);
-            printk("cpu_affinity=%s\n", tmpstr);
+            printk("cpu_affinity=%s ", tmpstr);
+            cpuset_print(tmpstr, sizeof(tmpstr), v->node_affinity);
+            printk("node_affinity=%s%s\n",
+                   v->auto_node_affinity ? "(auto)" :
"(manual)",
+                   tmpstr);
             printk("    pause_count=%d pause_flags=%lx\n",
                    atomic_read(&v->pause_count), v->pause_flags);
             arch_dump_vcpu_info(v);
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index 0f45f07..b3966ad 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -198,6 +198,9 @@ int sched_init_vcpu(struct vcpu *v, unsigned int processor)
     else
         cpumask_setall(v->cpu_affinity);
 
+    v->auto_node_affinity = 1;
+    cpumask_copy(v->node_affinity, v->cpu_affinity);
+
     /* Initialise the per-vcpu timers. */
     init_timer(&v->periodic_timer, vcpu_periodic_timer_fn,
                v, v->processor);
@@ -684,6 +687,53 @@ int vcpu_set_affinity(struct vcpu *v, const cpumask_t
*affinity)
     return 0;
 }
 
+int vcpu_set_node_affinity(struct vcpu *v, const nodemask_t *nodes)
+{
+    nodemask_t online_nodes;
+    int node;
+
+    nodes_and(online_nodes, node_online_map, *nodes);
+
+    /* Having no affinity at all is just wrong */
+    if ( nodes_empty(online_nodes) )
+        return -EINVAL;
+
+    spin_lock(&v->domain->node_affinity_lock);
+
+    /*
+     * Explicitly saying "all nodes" is not particularly useful here.
+     * Let''s use it as the `reset numa-affinity to auto''
command.
+     */
+    if ( nodes_full(*nodes) )
+    {
+        v->auto_node_affinity = 1;
+        goto out;
+    }
+
+    /*
+     * When someone asks for a specific numa-affinity for a vcpu we need to
+     * clear auto_node_affinity, convert the nodemask in online_nodes
+     * into a cpumask_t and store it in node_affinity.
+     */
+    v->auto_node_affinity = 0;
+
+    cpumask_clear(v->node_affinity);
+    for_each_node_mask( node, online_nodes )
+        cpumask_or(v->node_affinity, v->node_affinity,
+                   &node_to_cpumask(node));
+
+out:
+    spin_unlock(&v->domain->node_affinity_lock);
+
+    /*
+     * Changing the numa-affinity of a vcpu calls for an update
+     * of the node-affinity of the whole domain.
+     */
+    domain_update_node_affinity(v->domain);
+
+    return 0;
+}
+
 /* Block the currently-executing domain until a pertinent event occurs. */
 void vcpu_block(void)
 {
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 25bf637..732d6b6 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -172,6 +172,8 @@ struct vcpu
     /* VCPU need affinity restored */
     bool_t           affinity_broken;
 
+    /* is node_affinity (below) automatically computed from vcpu-affinity? */
+    bool_t           auto_node_affinity;
 
     /*
      * > 0: a single port is being polled;
@@ -197,6 +199,13 @@ struct vcpu
     /* Used to restore affinity across S3. */
     cpumask_var_t    cpu_affinity_saved;
 
+    /*
+     * Bitmask of CPUs on which this VCPU prefers to run. For both this
+     * and auto_node_affinity access is serialized against
+     * v->domain->node_affinity_lock.
+     */
+    cpumask_var_t    node_affinity;
+
     /* Bitmask of CPUs which are holding onto this VCPU''s state. */
     cpumask_var_t    vcpu_dirty_cpumask;
 
@@ -740,6 +749,7 @@ int schedule_cpu_switch(unsigned int cpu, struct cpupool
*c);
 void vcpu_force_reschedule(struct vcpu *v);
 int cpu_disable_scheduler(unsigned int cpu);
 int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity);
+int vcpu_set_node_affinity(struct vcpu *v, const nodemask_t *nodes);
 void restore_vcpu_affinity(struct domain *d);
 
 void vcpu_runstate_get(struct vcpu *v, struct vcpu_runstate_info *runstate);

Dario Faggioli

2013-Nov-05 14:35 UTC

head link

[PATCH RESEND 06/12] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity

Now that we have per-vcpu node-affinity we can do the
following:

 * always consider the domain''s node-affinity as
   ''automaically computed'';

 * always construct it out of the domain''s vcpus'' own
   node-affinity.

That means, if someone wants to change the node-affinity of
a domain, it is the node-affinities of all the domain''s vcpus
that need to be modified.

This change modifies domain_set_node_affinity() in such a way
that it does right the above, i.e., it goes through all the
domain''s vcpus and set their node-affinity to some specified
mask. This means that, "seen from the outside", nothing
changes: you call domain_set_node_affinity(), passing a
nodemask_t to it, and you get, (1) that mask to be the
node-affinity for the domain (which basically means on what
NUMA nodes memory is allocated), and (2) all the vcpus of the
domains prefer to run on the pcpus from the nodes in that
mask, exactly as it was before this commit.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 xen/common/domain.c       |   48 ++++++++++++++-------------------------------
 xen/common/sched_credit.c |    3 +--
 xen/include/xen/sched.h   |    2 --
 3 files changed, 16 insertions(+), 37 deletions(-)

diff --git a/xen/common/domain.c b/xen/common/domain.c
index 8d2ff49..366d9b9 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -228,7 +228,6 @@ struct domain *domain_create(
 
     spin_lock_init(&d->node_affinity_lock);
     d->node_affinity = NODE_MASK_ALL;
-    d->auto_node_affinity = 1;
 
     spin_lock_init(&d->shutdown_lock);
     d->shutdown_code = -1;
@@ -403,18 +402,13 @@ void domain_update_node_affinity(struct domain *d)
     }
 
     /*
-     * If d->auto_node_affinity is true, the domain''s node-affinity
mask
-     * (d->node_affinity) is automaically computed from all the
domain''s
-     * vcpus'' vcpu-affinity masks (the union of which we have just
built
-     * above in cpumask). OTOH, if d->auto_node_affinity is false, we
-     * must leave the node-affinity of the domain alone.
+     * A domain''s node-affinity is just the union of all the
domain''s vcpus''
+     * numa-affinity masks, which is exactly what we have in cpumask
+     * (although, we need to convert it from cpumask to nodemask, of course).
      */
-    if ( d->auto_node_affinity )
-    {
-        nodes_clear(d->node_affinity);
-        for_each_cpu ( cpu, cpumask )
-            node_set(cpu_to_node(cpu), d->node_affinity);
-    }
+    nodes_clear(d->node_affinity);
+    for_each_cpu ( cpu, cpumask )
+        node_set(cpu_to_node(cpu), d->node_affinity);
 
     sched_set_node_affinity(d, &d->node_affinity);
 
@@ -425,33 +419,21 @@ void domain_update_node_affinity(struct domain *d)
 }
 
 
+/* Sets the numa-affinity (via vcpu_set_node_affinity() ) for all
+ * the vcpus of the domain. */
 int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
 {
-    /* Being affine with no nodes is just wrong */
-    if ( nodes_empty(*affinity) )
-        return -EINVAL;
-
-    spin_lock(&d->node_affinity_lock);
+    struct vcpu *v;
+    int rc = 0;
 
-    /*
-     * Being/becoming explicitly affine to all nodes is not particularly
-     * useful. Let''s take it as the `reset node affinity` command.
-     */
-    if ( nodes_full(*affinity) )
+    for_each_vcpu ( d, v )
     {
-        d->auto_node_affinity = 1;
-        goto out;
+        rc = vcpu_set_node_affinity(v, affinity);
+        if ( rc )
+            break;
     }
 
-    d->auto_node_affinity = 0;
-    d->node_affinity = *affinity;
-
-out:
-    spin_unlock(&d->node_affinity_lock);
-
-    domain_update_node_affinity(d);
-
-    return 0;
+    return rc;
 }
 
 
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index 28dafcf..c53a36b 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -311,8 +311,7 @@ static inline int __vcpu_has_node_affinity(const struct vcpu
*vc,
     const struct domain *d = vc->domain;
     const struct csched_dom *sdom = CSCHED_DOM(d);
 
-    if ( d->auto_node_affinity
-         || cpumask_full(sdom->node_affinity_cpumask)
+    if (cpumask_full(sdom->node_affinity_cpumask)
          || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
         return 0;
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index 732d6b6..d8e4735 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -325,8 +325,6 @@ struct domain
     /* Does this guest need iommu mappings? */
     bool_t           need_iommu;
 #endif
-    /* is node-affinity automatically computed? */
-    bool_t           auto_node_affinity;
     /* Is this guest fully privileged (aka dom0)? */
     bool_t           is_privileged;
     /* Which guest this guest has privileges on */

Dario Faggioli

2013-Nov-05 14:35 UTC

head link

[PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling

instead of relying the domain-wide node-affinity. To achieve
that, we use the specific vcpu''s node-affinity when computing
the NUMA load balancing mask in csched_balance_cpumask().

As a side effects, this simplifies the code quit a bit. In
fact, prior to this change, we needed to cache the translation
of d->node_affinity (which is a nodemask_t) to a cpumask_t,
since that is what is needed during the actual scheduling
(we used to keep it in node_affinity_cpumask).

Since the per-vcpu node-affinity is maintained in a cpumask_t
field (v->node_affinity) already, we don''t need that complicated
updating logic in place any longer, and this change hence can
remove stuff like sched_set_node_affinity(),
csched_set_node_affinity() and, of course, node_affinity_cpumask
from csched_dom.

The high level description of NUMA placement and scheduling in
docs/misc/xl-numa-placement.markdown is updated too, to match
the new behavior. While at it, an attempt is made to make such
document as few ambiguous as possible, with respect to the
concepts of vCPU pinning, domain node-affinity and vCPU
node-affinity.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 docs/misc/xl-numa-placement.markdown |  124 +++++++++++++++++++++++-----------
 xen/common/domain.c                  |    2 -
 xen/common/sched_credit.c            |   62 ++---------------
 xen/common/schedule.c                |    5 -
 xen/include/xen/sched-if.h           |    2 -
 xen/include/xen/sched.h              |    1 
 6 files changed, 94 insertions(+), 102 deletions(-)

diff --git a/docs/misc/xl-numa-placement.markdown
b/docs/misc/xl-numa-placement.markdown
index caa3fec..890b856 100644
--- a/docs/misc/xl-numa-placement.markdown
+++ b/docs/misc/xl-numa-placement.markdown
@@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a NUMA
node is usually
 defined as a set of processor cores (typically a physical CPU package) and
 the memory directly attached to the set of cores.
 
-The Xen hypervisor deals with NUMA machines by assigning to each domain
-a "node affinity", i.e., a set of NUMA nodes of the host from which
they
-get their memory allocated. Also, even if the node affinity of a domain
-is allowed to change on-line, it is very important to "place" the
domain
-correctly when it is fist created, as the most of its memory is allocated
-at that time and can not (for now) be moved easily.
-
 NUMA awareness becomes very important as soon as many domains start
 running memory-intensive workloads on a shared host. In fact, the cost
 of accessing non node-local memory locations is very high, and the
@@ -27,13 +20,37 @@ performance degradation is likely to be noticeable.
 For more information, have a look at the [Xen NUMA Introduction][numa_intro]
 page on the Wiki.
 
+## Xen and NUMA machines: the concept of _node-affinity_ ##
+
+The Xen hypervisor deals with NUMA machines throughout the concept of
+_node-affinity_. When talking about node-affinity, it is important to
+distinguish two different situations. The node-affinity *of a domain* is
+the set of NUMA nodes of the host where the memory for the domain is
+being allocated (mostly, at domain creation time). The node-affinity *of
+a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to
+run on, and the (credit1 only, for now) scheduler will try to accomplish
+that, whenever it is possible.
+
+Of course, despite the fact that they belong to and affect different
+subsystems domain and vCPUs node-affinities are related. In fact, the
+node-affinity of a domain is the union of the node-affinities of all the
+domain''s vCPUs.
+
+The above means that, when changing the vCPU node-affinity, the domain
+node-affinity also changes. Well, although this is allowed to happen
+on-line (i.e., when a domain is already running), that will not result
+in the memory that has already been allocated being moved to a different
+host NUMA node. This is why it is very important to "place" the
domain
+correctly when it is fist created, as the most of its memory is allocated
+at that time and can not (for now) be moved easily.
+
 ### Placing via pinning and cpupools ###
 
 The simplest way of placing a domain on a NUMA node is statically pinning
-the domain''s vCPUs to the pCPUs of the node. This goes under the name
of
-CPU affinity and can be set through the "cpus=" option in the config
file
-(more about this below). Another option is to pool together the pCPUs
-spanning the node and put the domain in such a cpupool with the
"pool="
+the domain''s vCPUs to the pCPUs of the node. This also goes under the
name
+of vCPU-affinity and can be set through the "cpus=" option in the
config
+file (more about this below). Another option is to pool together the pCPUs
+spanning the node and put the domain in such a _cpupool_ with the
"pool="
 config option (as documented in our [Wiki][cpupools_howto]).
 
 In both the above cases, the domain will not be able to execute outside
@@ -45,24 +62,45 @@ may come at he cost of some load imbalances.
 
 ### NUMA aware scheduling ###
 
-If the credit scheduler is in use, the concept of node affinity defined
-above does not only apply to memory. In fact, starting from Xen 4.3, the
-scheduler always tries to run the domain''s vCPUs on one of the nodes
in
-its node affinity. Only if that turns out to be impossible, it will just
-pick any free pCPU.
+If using credit scheduler, and starting from Xen 4.3, the scheduler always
+tries to run the domain''s vCPUs on one of the nodes in their
node-affinity.
+Only if that turns out to be impossible, it will just pick any free pCPU.
+Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity,
+potentially different from the ones of all the other vCPUs of the domain.
 
-This is, therefore, something more flexible than CPU affinity, as a domain
-can still run everywhere, it just prefers some nodes rather than others.
+This is, therefore, something more flexible than vCPU pinning, as vCPUs
+can still run everywhere, they just prefer some nodes rather than others.
 Locality of access is less guaranteed than in the pinning case, but that
 comes along with better chances to exploit all the host resources (e.g.,
 the pCPUs).
 
-In fact, if all the pCPUs in a domain''s node affinity are busy, it is
-possible for the domain to run outside of there, but it is very likely that
-slower execution (due to remote memory accesses) is still better than no
-execution at all, as it would happen with pinning. For this reason, NUMA
-aware scheduling has the potential of bringing substantial performances
-benefits, although this will depend on the workload.
+In fact, if all the pCPUs in a VCPU''s node-affinity are busy, it is
+possible for the domain to run outside from there. The idea is that
+slower execution (due to remote memory accesses) is still better than
+no execution at all (as it would happen with pinning). For this reason,
+NUMA aware scheduling has the potential of bringing substantial
+performances benefits, although this will depend on the workload.
+
+Notice that, for each vCPU, the following three scenarios are possbile:
+
+  * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU
+    node-affinity. In this case, the vCPU is always scheduled on one
+    of the pCPUs to which it is pinned, without any specific peference
+    on which one of them. Internally, the vCPU''s node-affinity is just
+    automatically computed from the vCPU pinning, and the scheduler
+    just ignores it;
+  * a vCPU *has* its own vCPU node-affinity and *is not* pinned to
+    any particular pCPU. In this case, the vCPU can run on every pCPU.
+    Nevertheless, the scheduler will try to have it running on one of
+    the pCPUs of the node(s) to which it has node-affinity with;
+  * a vCPU *has* its own vCPU node-affinity and *is also* pinned to
+    some pCPUs. In this case, the vCPU is always scheduled on one of the
+    pCPUs to which it is pinned, with, among them, a preference for the
+    ones that are from the node(s) it has node-affinity with. In case
+    pinning and node-affinity form two disjoint sets of pCPUs, pinning
+    "wins", and the node-affinity, although it is still used to
derive
+    the domain''s node-affinity (for memory allocation), is, from the
+    scheduler''s perspective, just ignored.
 
 ## Guest placement in xl ##
 
@@ -71,25 +109,23 @@ both manual or automatic placement of them across the
host''s NUMA nodes.
 
 Note that xm/xend does a very similar thing, the only differences being
 the details of the heuristics adopted for automatic placement (see below),
-and the lack of support (in both xm/xend and the Xen versions where that\
+and the lack of support (in both xm/xend and the Xen versions where that
 was the default toolstack) for NUMA aware scheduling.
 
 ### Placing the guest manually ###
 
 Thanks to the "cpus=" option, it is possible to specify where a
domain
 should be created and scheduled on, directly in its config file. This
-affects NUMA placement and memory accesses as the hypervisor constructs
-the node affinity of a VM basing right on its CPU affinity when it is
-created.
+affects NUMA placement and memory accesses as, in this case, the hypervisor
+constructs the node-affinity of a VM basing right on its vCPU pinning
+when it is created.
 
 This is very simple and effective, but requires the user/system
-administrator to explicitly specify affinities for each and every domain,
+administrator to explicitly specify the pinning of each and every domain,
 or Xen won''t be able to guarantee the locality for their memory
accesses.
 
-Notice that this also pins the domain''s vCPUs to the specified set of
-pCPUs, so it not only sets the domain''s node affinity (its memory will
-come from the nodes to which the pCPUs belong), but at the same time
-forces the vCPUs of the domain to be scheduled on those same pCPUs.
+That, of course, alsos mean the vCPUs of the domain will only be able to
+execute on those same pCPUs.
 
 ### Placing the guest automatically ###
 
@@ -97,7 +133,10 @@ If no "cpus=" option is specified in the config
file, libxl tries
 to figure out on its own on which node(s) the domain could fit best.
 If it finds one (some), the domain''s node affinity get set to there,
 and both memory allocations and NUMA aware scheduling (for the credit
-scheduler and starting from Xen 4.3) will comply with it.
+scheduler and starting from Xen 4.3) will comply with it. Starting from
+Xen 4.4, this just means all the vCPUs of the domain will have the same
+vCPU node-affinity, that is the outcome of such automatic "fitting"
+procedure.
 
 It is worthwhile noting that optimally fitting a set of VMs on the NUMA
 nodes of an host is an incarnation of the Bin Packing Problem. In fact,
@@ -143,22 +182,29 @@ any placement from happening:
     libxl_defbool_set(&domain_build_info->numa_placement, false);
 
 Also, if `numa_placement` is set to `true`, the domain must not
-have any CPU affinity (i.e., `domain_build_info->cpumap` must
-have all its bits set, as it is by default), or domain creation
-will fail returning `ERROR_INVAL`.
+have any vCPU pinning (i.e., `domain_build_info->cpumap` must have
+all its bits set, as it is by default), or domain creation will fail
+with an `ERROR_INVAL`.
 
 Starting from Xen 4.3, in case automatic placement happens (and is
 successful), it will affect the domain''s node affinity and _not_ its
-CPU affinity. Namely, the domain''s vCPUs will not be pinned to any
+vCPU pinning. Namely, the domain''s vCPUs will not be pinned to any
 pCPU on the host, but the memory from the domain will come from the
 selected node(s) and the NUMA aware scheduling (if the credit scheduler
-is in use) will try to keep the domain there as much as possible.
+is in use) will try to keep the domain''s vCPUs there as much as
possible.
 
 Besides than that, looking and/or tweaking the placement algorithm
 search "Automatic NUMA placement" in libxl\_internal.h.
 
 Note this may change in future versions of Xen/libxl.
 
+## Xen < 4.4 ##
+
+The concept of per-vCPU node-affinity has been introduced for the first
+time in Xen 4.4. In Xen versions earlier than that, the node-affinity is
+the same for the whole domain, that is to say the same for all the vCPUs
+of the domain.
+
 ## Xen < 4.3 ##
 
 As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 366d9b9..ae29945 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d)
     for_each_cpu ( cpu, cpumask )
         node_set(cpu_to_node(cpu), d->node_affinity);
 
-    sched_set_node_affinity(d, &d->node_affinity);
-
     spin_unlock(&d->node_affinity_lock);
 
     free_cpumask_var(online_affinity);
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
index c53a36b..d228127 100644
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -178,9 +178,6 @@ struct csched_dom {
     struct list_head active_vcpu;
     struct list_head active_sdom_elem;
     struct domain *dom;
-    /* cpumask translated from the domain''s node-affinity.
-     * Basically, the CPUs we prefer to be scheduled on. */
-    cpumask_var_t node_affinity_cpumask;
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
@@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc)
     list_del_init(&svc->runq_elem);
 }
 
-/*
- * Translates node-affinity mask into a cpumask, so that we can use it during
- * actual scheduling. That of course will contain all the cpus from all the
- * set nodes in the original node-affinity mask.
- *
- * Note that any serialization needed to access mask safely is complete
- * responsibility of the caller of this function/hook.
- */
-static void csched_set_node_affinity(
-    const struct scheduler *ops,
-    struct domain *d,
-    nodemask_t *mask)
-{
-    struct csched_dom *sdom;
-    int node;
-
-    /* Skip idle domain since it doesn''t even have a
node_affinity_cpumask */
-    if ( unlikely(is_idle_domain(d)) )
-        return;
-
-    sdom = CSCHED_DOM(d);
-    cpumask_clear(sdom->node_affinity_cpumask);
-    for_each_node_mask( node, *mask )
-        cpumask_or(sdom->node_affinity_cpumask,
sdom->node_affinity_cpumask,
-                   &node_to_cpumask(node));
-}
 
 #define for_each_csched_balance_step(step) \
     for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++ )
@@ -294,12 +265,12 @@ static void csched_set_node_affinity(
 
 /*
  * vcpu-affinity balancing is always necessary and must never be skipped.
- * OTOH, if a domain''s node-affinity is said to be automatically
computed
- * (or if it just spans all the nodes), we can safely avoid dealing with
- * node-affinity entirely.
+ * OTOH, if the vcpu''s numa-affinity is being automatically computed
out of
+ * the vcpu''s vcpu-affinity, or if it just spans all the nodes, we can
+ * safely avoid dealing with numa-affinity entirely.
  *
- * Node-affinity is also deemed meaningless in case it has empty
- * intersection with mask, to cover the cases where using the node-affinity
+ * A vcpu''s numa-affinity is also deemed meaningless in case it has
empty
+ * intersection with mask, to cover the cases where using the numa-affinity
  * mask seems legit, but would instead led to trying to schedule the vcpu
  * on _no_ pcpu! Typical use cases are for mask to be equal to the
vcpu''s
  * vcpu-affinity, or to the && of vcpu-affinity and the set of online
cpus
@@ -308,11 +279,9 @@ static void csched_set_node_affinity(
 static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
                                            const cpumask_t *mask)
 {
-    const struct domain *d = vc->domain;
-    const struct csched_dom *sdom = CSCHED_DOM(d);
-
-    if (cpumask_full(sdom->node_affinity_cpumask)
-         || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
+    if ( vc->auto_node_affinity == 1
+         || cpumask_full(vc->node_affinity)
+         || !cpumask_intersects(vc->node_affinity, mask) )
         return 0;
 
     return 1;
@@ -330,8 +299,7 @@ csched_balance_cpumask(const struct vcpu *vc, int step,
cpumask_t *mask)
 {
     if ( step == CSCHED_BALANCE_NODE_AFFINITY )
     {
-        cpumask_and(mask, CSCHED_DOM(vc->domain)->node_affinity_cpumask,
-                    vc->cpu_affinity);
+        cpumask_and(mask, vc->node_affinity, vc->cpu_affinity);
 
         if ( unlikely(cpumask_empty(mask)) )
             cpumask_copy(mask, vc->cpu_affinity);
@@ -1110,13 +1078,6 @@ csched_alloc_domdata(const struct scheduler *ops, struct
domain *dom)
     if ( sdom == NULL )
         return NULL;
 
-    if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) )
-    {
-        xfree(sdom);
-        return NULL;
-    }
-    cpumask_setall(sdom->node_affinity_cpumask);
-
     /* Initialize credit and weight */
     INIT_LIST_HEAD(&sdom->active_vcpu);
     INIT_LIST_HEAD(&sdom->active_sdom_elem);
@@ -1146,9 +1107,6 @@ csched_dom_init(const struct scheduler *ops, struct domain
*dom)
 static void
 csched_free_domdata(const struct scheduler *ops, void *data)
 {
-    struct csched_dom *sdom = data;
-
-    free_cpumask_var(sdom->node_affinity_cpumask);
     xfree(data);
 }
 
@@ -1975,8 +1933,6 @@ const struct scheduler sched_credit_def = {
     .adjust         = csched_dom_cntl,
     .adjust_global  = csched_sys_cntl,
 
-    .set_node_affinity  = csched_set_node_affinity,
-
     .pick_cpu       = csched_cpu_pick,
     .do_schedule    = csched_schedule,
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
index b3966ad..454f27d 100644
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -648,11 +648,6 @@ int cpu_disable_scheduler(unsigned int cpu)
     return ret;
 }
 
-void sched_set_node_affinity(struct domain *d, nodemask_t *mask)
-{
-    SCHED_OP(DOM2OP(d), set_node_affinity, d, mask);
-}
-
 int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity)
 {
     cpumask_t online_affinity;
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
index d95e254..4164dff 100644
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -158,8 +158,6 @@ struct scheduler {
                                     struct xen_domctl_scheduler_op *);
     int          (*adjust_global)  (const struct scheduler *,
                                     struct xen_sysctl_scheduler_op *);
-    void         (*set_node_affinity) (const struct scheduler *,
-                                       struct domain *, nodemask_t *);
     void         (*dump_settings)  (const struct scheduler *);
     void         (*dump_cpu_state) (const struct scheduler *, int);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index d8e4735..83f89c7 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -595,7 +595,6 @@ void sched_destroy_domain(struct domain *d);
 int sched_move_domain(struct domain *d, struct cpupool *c);
 long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);
 long sched_adjust_global(struct xen_sysctl_scheduler_op *);
-void sched_set_node_affinity(struct domain *, nodemask_t *);
 int  sched_id(void);
 void sched_tick_suspend(void);
 void sched_tick_resume(void);

Dario Faggioli

2013-Nov-05 14:35 UTC

head link

[PATCH RESEND 08/12] xen: numa-sched: enable getting/specifying per-vcpu node-affinity

via two new DOMCTLs: getvcpunodeaffinity and setvcpunodeaffinity.

They''re very similar to XEN_DOMCTL_{get,set}vcpuaffinity
(with the only exception that they take a nodemap instead than
a cpumap).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 tools/libxc/xc_domain.c     |    8 ++++---
 xen/common/domctl.c         |   47 ++++++++++++++++++++++++++++++++++++++-----
 xen/include/public/domctl.h |    8 +++++--
 xen/xsm/flask/hooks.c       |    2 ++
 4 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index 2cea6e3..f8825a3 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -219,9 +219,9 @@ int xc_vcpu_setaffinity(xc_interface *xch,
 
     memcpy(local, cpumap, cpusize);
 
-    set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
+    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local);
 
-    domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8;
+    domctl.u.vcpuaffinity.map.nr_bits = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
@@ -260,8 +260,8 @@ int xc_vcpu_getaffinity(xc_interface *xch,
     domctl.domain = (domid_t)domid;
     domctl.u.vcpuaffinity.vcpu = vcpu;
 
-    set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
-    domctl.u.vcpuaffinity.cpumap.nr_bits = cpusize * 8;
+    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local);
+    domctl.u.vcpuaffinity.map.nr_bits = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
index 870eef1..6bc3f26 100644
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -584,7 +584,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
     break;
 
     case XEN_DOMCTL_setvcpuaffinity:
-    case XEN_DOMCTL_getvcpuaffinity:
+    case XEN_DOMCTL_setvcpunodeaffinity:
     {
         struct vcpu *v;
 
@@ -600,8 +600,8 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
         {
             cpumask_var_t new_affinity;
 
-            ret = xenctl_bitmap_to_cpumask(
-                &new_affinity, &op->u.vcpuaffinity.cpumap);
+            ret = xenctl_bitmap_to_cpumask(&new_affinity,
+                                           &op->u.vcpuaffinity.map);
             if ( !ret )
             {
                 ret = vcpu_set_affinity(v, new_affinity);
@@ -610,8 +610,45 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xen_domctl_t)
u_domctl)
         }
         else
         {
-            ret = cpumask_to_xenctl_bitmap(
-                &op->u.vcpuaffinity.cpumap, v->cpu_affinity);
+            nodemask_t new_affinity;
+
+            ret = xenctl_bitmap_to_nodemask(&new_affinity,
+                                            &op->u.vcpuaffinity.map);
+            if ( !ret )
+                ret = vcpu_set_node_affinity(v, &new_affinity);
+        }
+    }
+    break;
+
+    case XEN_DOMCTL_getvcpuaffinity:
+    case XEN_DOMCTL_getvcpunodeaffinity:
+    {
+        struct vcpu *v;
+
+        ret = -EINVAL;
+        if ( op->u.vcpuaffinity.vcpu >= d->max_vcpus )
+            break;
+
+        ret = -ESRCH;
+        if ( (v = d->vcpu[op->u.vcpuaffinity.vcpu]) == NULL )
+            break;
+
+        if ( op->cmd == XEN_DOMCTL_getvcpuaffinity )
+        {
+            ret = cpumask_to_xenctl_bitmap(&op->u.vcpuaffinity.map,
+                                           v->cpu_affinity);
+        }
+        else
+        {
+            nodemask_t affinity;
+            int cpu;
+
+            nodes_clear(affinity);
+            for_each_cpu ( cpu, v->node_affinity )
+                node_set(cpu_to_node(cpu), affinity);
+
+            ret = nodemask_to_xenctl_bitmap(&op->u.vcpuaffinity.map,
+                                            &affinity);
         }
     }
     break;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index d4e479f..78b6d70 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -290,12 +290,14 @@ typedef struct xen_domctl_nodeaffinity
xen_domctl_nodeaffinity_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t);
 
 
-/* Get/set which physical cpus a vcpu can execute on. */
+/* Get/set which physical cpus a vcpu can or prefer execute on. */
 /* XEN_DOMCTL_setvcpuaffinity */
 /* XEN_DOMCTL_getvcpuaffinity */
+/* XEN_DOMCTL_setvcpunodeaffinity */
+/* XEN_DOMCTL_getvcpunodeaffinity */
 struct xen_domctl_vcpuaffinity {
     uint32_t  vcpu;              /* IN */
-    struct xenctl_bitmap cpumap; /* IN/OUT */
+    struct xenctl_bitmap map;    /* IN/OUT */
 };
 typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t);
@@ -932,6 +934,8 @@ struct xen_domctl {
 #define XEN_DOMCTL_setnodeaffinity               68
 #define XEN_DOMCTL_getnodeaffinity               69
 #define XEN_DOMCTL_set_max_evtchn                70
+#define XEN_DOMCTL_setvcpunodeaffinity           71
+#define XEN_DOMCTL_getvcpunodeaffinity           72
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
index b1e2593..a9b2401 100644
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -616,10 +616,12 @@ static int flask_domctl(struct domain *d, int cmd)
         return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__UNPAUSE);
 
     case XEN_DOMCTL_setvcpuaffinity:
+    case XEN_DOMCTL_setvcpunodeaffinity:
     case XEN_DOMCTL_setnodeaffinity:
         return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__SETAFFINITY);
 
     case XEN_DOMCTL_getvcpuaffinity:
+    case XEN_DOMCTL_getvcpunodeaffinity:
     case XEN_DOMCTL_getnodeaffinity:
         return current_has_perm(d, SECCLASS_DOMAIN, DOMAIN__GETAFFINITY);

Dario Faggioli

2013-Nov-05 14:35 UTC

head link

[PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

by providing the proper get/set interfaces and wiring them
to the new domctl-s from the previous commit.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 tools/libxc/xc_domain.c |   82 +++++++++++++++++++++++++++++++++++++++++++++++
 tools/libxc/xenctrl.h   |   19 +++++++++++
 2 files changed, 101 insertions(+)

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
index f8825a3..e44ad2e 100644
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -272,6 +272,88 @@ out:
     return ret;
 }
 
+int xc_vcpu_setnodeaffinity(xc_interface *xch,
+                            uint32_t domid,
+                            int vcpu,
+                            xc_nodemap_t nodemap)
+{
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
+    int ret = -1;
+    int nodesize;
+
+    nodesize = xc_get_cpumap_size(xch);
+    if (!nodesize)
+    {
+        PERROR("Could not get number of  nodes");
+        goto out;
+    }
+
+    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
+    if ( local == NULL )
+    {
+        PERROR("Could not allocate memory for setvcpunodeaffinity domctl
hypercall");
+        goto out;
+    }
+
+    domctl.cmd = XEN_DOMCTL_setvcpunodeaffinity;
+    domctl.domain = (domid_t)domid;
+    domctl.u.vcpuaffinity.vcpu = vcpu;
+
+    memcpy(local, nodemap, nodesize);
+
+    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local);
+
+    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
+
+    ret = do_domctl(xch, &domctl);
+
+    xc_hypercall_buffer_free(xch, local);
+
+ out:
+    return ret;
+}
+
+int xc_vcpu_getnodeaffinity(xc_interface *xch,
+                            uint32_t domid,
+                            int vcpu,
+                            xc_nodemap_t nodemap)
+{
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
+    int ret = -1;
+    int nodesize;
+
+    nodesize = xc_get_nodemap_size(xch);
+    if (!nodesize)
+    {
+        PERROR("Could not get number of nodes");
+        goto out;
+    }
+
+    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
+    if (local == NULL)
+    {
+        PERROR("Could not allocate memory for getvcpunodeaffinity domctl
hypercall");
+        goto out;
+    }
+
+    domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity;
+    domctl.domain = (domid_t)domid;
+    domctl.u.vcpuaffinity.vcpu = vcpu;
+
+    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local);
+    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
+
+    ret = do_domctl(xch, &domctl);
+
+    memcpy(nodemap, local, nodesize);
+
+    xc_hypercall_buffer_free(xch, local);
+out:
+    return ret;
+}
+
 int xc_domain_get_guest_width(xc_interface *xch, uint32_t domid,
                               unsigned int *guest_width)
 {
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
index 8cf3f3b..208fa2c 100644
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch,
                                uint32_t domind,
                                xc_nodemap_t nodemap);
 
+/**
+ * These functions set and retrieves the NUMA node-affinity
+ * of a specific vcpu.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id one is interested in.
+ * @parm vcpu the vcpu one wants to set/get the affinity of.
+ * @parm nodemap the map of the affine nodes.
+ * @return 0 on success, -1 on failure.
+ */
+int xc_vcpu_setnodeaffinity(xc_interface *xch,
+                            uint32_t domid,
+                            int vcpu,
+                            xc_nodemap_t nodemap);
+int xc_vcpu_getnodeaffinity(xc_interface *xch,
+                            uint32_t domid,
+                            int vcpu,
+                            xc_nodemap_t nodemap);
+
 int xc_vcpu_setaffinity(xc_interface *xch,
                         uint32_t domid,
                         int vcpu,

Dario Faggioli

2013-Nov-05 14:35 UTC

head link

[PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

by providing the proper get/set interfaces and wiring them
to the new libxc calls from the previous commit.

For the ''get'' part, the node-affinity of all the vcpus of a
domain is also reported via libxl_list_vcpu() (exactly as it
is happening for their vcpu affinity already), adding a
specific field to libxl_vcpuinfo.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 tools/libxl/libxl.c         |   28 ++++++++++++++++++++++++++++
 tools/libxl/libxl.h         |   11 +++++++++++
 tools/libxl/libxl_types.idl |    1 +
 3 files changed, 40 insertions(+)

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
index 0f0f56c..4a36e70 100644
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -4208,6 +4208,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t
domid,
             LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating
cpumap");
             return NULL;
         }
+        if (libxl_node_bitmap_alloc(ctx, &ptr->nodemap, 0)) {
+            LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "allocating
nodemap");
+            return NULL;
+        }
         if (xc_vcpu_getinfo(ctx->xch, domid, *nb_vcpu, &vcpuinfo) == -1)
{
             LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu
info");
             return NULL;
@@ -4216,6 +4220,10 @@ libxl_vcpuinfo *libxl_list_vcpu(libxl_ctx *ctx, uint32_t
domid,
             LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu
affinity");
             return NULL;
         }
+        if (xc_vcpu_getnodeaffinity(ctx->xch, domid, *nb_vcpu,
ptr->nodemap.map) == -1) {
+            LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting
node-affinity");
+            return NULL;
+        }
         ptr->vcpuid = *nb_vcpu;
         ptr->cpu = vcpuinfo.cpu;
         ptr->online = !!vcpuinfo.online;
@@ -4251,6 +4259,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t
domid,
     return rc;
 }
 
+int libxl_set_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid,
+                               const libxl_bitmap *nodemap)
+{
+    if (xc_vcpu_setnodeaffinity(ctx->xch, domid, vcpuid, nodemap->map)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting vcpu
node-affinity");
+        return ERROR_FAIL;
+    }
+    return 0;
+}
+
+int libxl_get_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid,
+                               libxl_bitmap *nodemap)
+{
+    if (xc_vcpu_getnodeaffinity(ctx->xch, domid, vcpuid, nodemap->map)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting vcpu
node-affinity");
+        return ERROR_FAIL;
+    }
+    return 0;
+}
+
 int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
                                   libxl_bitmap *nodemap)
 {
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
index 1c6675d..d5b8ade 100644
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -82,6 +82,13 @@
 #define LIBXL_HAVE_DOMAIN_NODEAFFINITY 1
 
 /*
+ * LIBXL_HAVE_VCPUINFO_NODEAFFINITY indicates that a
''nodemap'' field
+ * (of libxl_bitmap type) is present in libxl_vcpuinfo, containing the
+ * node-affinity for the vcpu.
+ */
+#define LIBXL_HAVE_VCPUINFO_NODEAFFINITY
+
+/*
  * LIBXL_HAVE_BUILDINFO_HVM_VENDOR_DEVICE indicates that the
  * libxl_vendor_device field is present in the hvm sections of
  * libxl_domain_build_info. This field tells libxl which
@@ -985,6 +992,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ctx, uint32_t domid,
uint32_t vcpuid,
                            libxl_bitmap *cpumap);
 int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
                                unsigned int max_vcpus, libxl_bitmap *cpumap);
+int libxl_set_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid,
+                               const libxl_bitmap *nodemap);
+int libxl_get_vcpunodeaffinity(libxl_ctx *ctx, uint32_t domid, uint32_t vcpuid,
+                               libxl_bitmap *nodemap);
 int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
                                   libxl_bitmap *nodemap);
 int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 5c43d6f..5c83f98 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -492,6 +492,7 @@ libxl_vcpuinfo = Struct("vcpuinfo", [
     ("running", bool),
     ("vcpu_time", uint64), # total vcpu time ran (ns)
     ("cpumap", libxl_bitmap), # current cpu''s affinities
+    ("nodemap", libxl_bitmap), # current node-affinity
     ], dir=DIR_OUT)
 
 libxl_physinfo = Struct("physinfo", [

Dario Faggioli

2013-Nov-05 14:36 UTC

head link

[PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

by showing it, upon request (''-n'' switch), as a part of the
output of `xl vcpu-list''. Such output looks like this:

# xl vcpu-list -n

Name          ID  VCPU   CPU State   Time(s) CPU Affinity / NUMA Affinity
Domain-0       0     0    8   -b-       4.4  any cpu / any node
Domain-0       0     1    0   -b-       1.2  any cpu / any node
vm-test        1     0    8   -b-       2.4  any cpu / 1
vm-test        1     1   13   -b-       3.3  any cpu / 1

The "vcpu-pinning / vcpu-affinity" part may indeed look
weird, as it is different from the other fields.
Unfortunately, it''s very hard to properly format it in
columns, since there is no such things as minimal or
maximal lengths for the bitmaps being printed.

Specifying a particular node-affinity happens via a new xl
command, `xl vcpu-node-affinity''. Introducing a "numa mode",
to `xl vcpu-pin'' was the other option, but doing that would
probably fuel the confusion between vcpu pinning and NUMA
node-affinity, and that is certainly something we do not
want.

While at it, the implementation of `xl vcpu-list'' is reworked
a little bit, making it more similar to the one of `xl list''
and more compliant with the libxl programming paradigm (e.g.,
regarding allocating at the beginning and freeing on exit).

xl manual page is updated accordingly.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 docs/man/xl.pod.1         |   25 +++++
 tools/libxl/xl.h          |    1 
 tools/libxl/xl_cmdimpl.c  |  210 ++++++++++++++++++++++++++++++++++++---------
 tools/libxl/xl_cmdtable.c |    9 ++
 4 files changed, 201 insertions(+), 44 deletions(-)

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
index e7b9de2..e1894b4 100644
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -614,11 +614,21 @@ quietly ignored.
 Some guests may need to actually bring the newly added CPU online
 after B<vcpu-set>, go to B<SEE ALSO> section for information.
 
-=item B<vcpu-list> [I<domain-id>]
+=item B<vcpu-list> [I<OPTIONS>] [I<domain-id>, ...]
 
 Lists VCPU information for a specific domain.  If no domain is
 specified, VCPU information for all domains will be provided.
 
+B<OPTIONS>
+
+=over 4
+
+=item B<-n>, B<--numa>
+
+Print the NUMA node-affinity of each VCPU.
+
+=back
+
 =item B<vcpu-pin> I<domain-id> I<vcpu> I<cpus>
 
 Pins the VCPU to only run on the specific CPUs.  The keyword
@@ -630,6 +640,19 @@ different run state is appropriate.  Pinning can be used to
restrict
 this, by ensuring certain VCPUs can only run on certain physical
 CPUs.
 
+=item B<vcpu-node-affinity> I<domain-id> I<vcpu>
I<nodes>
+
+Specifies the I<nodes> on which the I<vcpu> prefers to run. The
+keyword B<all> can be used to set this to I<nodes> for all the
+VCPUs in the domain.
+
+Normally VCPUs can float within the set of CPUs that they are pinned
+to (see B<vcpu-pin> above). Specifying a node-affinity does not change
+that, but, if using the credit scheduler, the VCPUs will strongly prefer
+running on the NUMA nodes they have node-affinity with. If using any
+scheduler other than credit, node-affinity is just ignored (credit2
+support is planned but not ready yet).
+
 =item B<vm-list>
 
 Prints information about guests. This list excludes information about
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
index e005c39..b708199 100644
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -59,6 +59,7 @@ int main_create(int argc, char **argv);
 int main_config_update(int argc, char **argv);
 int main_button_press(int argc, char **argv);
 int main_vcpupin(int argc, char **argv);
+int main_vcpuaff(int argc, char **argv);
 int main_vcpuset(int argc, char **argv);
 int main_memmax(int argc, char **argv);
 int main_memset(int argc, char **argv);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index f6239d5..1659259 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -542,10 +542,32 @@ static int parse_range(const char *str, unsigned long *a,
unsigned long *b)
 }
 
 /*
- * Add or removes a specific set of cpus (specified in str, either as
- * single cpus or as entire NUMA nodes) to/from cpumap.
+ * Set or reset some bits from bitmap, as specified in str.
+ *
+ * If str  * is "all", bitmap gets fully set. If str contains a
specific
+ * value or range (e.g., "2" or "4-6"), the corresponding
bits are set (or
+ * unset, if the value/range is prefixed with a "^"). In this case,
we don''t
+ * really care whether the value range is supposed to represent cpus or nodes
+ * (that is to say whether bitmap is supposed to be a nodemap or a cpumap),
+ * we just set and unset bits.
+ *
+ * In case str is prefixed by "nodes:" (or "^nodes:") we
assume that the
+ * caller is dealing with a cpumap and wants to specify what bits to (re)set
+ * NUMA node-wise. We therefore act as follows:
+ *  - consider the value/range a node or a set of nodes;
+ *  - obtain the cpumap corresponding to such node/set of nodes;
+ *  - set (or reset, in the "^" case) in bitmap what ended up
+ *    in such cpumap.
+ *
+ * This means that, for instance, if calling this with str="3-6",
bits
+ * 3-6 in bitmap will be set, and it''s up to the caller to know
whether
+ * that meant cpus 3,4,5,6 or NUMA nodes 3,4,5,6. On the other hand, if
+ * calling this with str="nodes:2", bits corresponding to all the
cpus
+ * of NUMA node  2 will be the ones that are set in bitmap.
+ *
+ * This may look tricky, but it avoids quite a bit of code duplication.
  */
-static int update_cpumap_range(const char *str, libxl_bitmap *cpumap)
+static int update_bitmap_range(const char *str, libxl_bitmap *bitmap)
 {
     unsigned long ida, idb;
     libxl_bitmap node_cpumap;
@@ -571,7 +593,7 @@ static int update_cpumap_range(const char *str, libxl_bitmap
*cpumap)
     }
 
     if (STR_HAS_PREFIX(str, "all")) {
-        libxl_bitmap_set_any(cpumap);
+        libxl_bitmap_set_any(bitmap);
         goto out;
     }
 
@@ -595,13 +617,13 @@ static int update_cpumap_range(const char *str,
libxl_bitmap *cpumap)
 
             /* Add/Remove all the cpus in the node cpumap */
             libxl_for_each_set_bit(i, node_cpumap) {
-                is_not ? libxl_bitmap_reset(cpumap, i) :
-                         libxl_bitmap_set(cpumap, i);
+                is_not ? libxl_bitmap_reset(bitmap, i) :
+                         libxl_bitmap_set(bitmap, i);
             }
         } else {
             /* Add/Remove this cpu */
-            is_not ? libxl_bitmap_reset(cpumap, ida) :
-                     libxl_bitmap_set(cpumap, ida);
+            is_not ? libxl_bitmap_reset(bitmap, ida) :
+                     libxl_bitmap_set(bitmap, ida);
         }
         ida++;
     }
@@ -612,18 +634,25 @@ static int update_cpumap_range(const char *str,
libxl_bitmap *cpumap)
 }
 
 /*
- * Takes a string representing a set of cpus (specified either as
- * single cpus or as eintire NUMA nodes) and turns it into the
- * corresponding libxl_bitmap (in cpumap).
+ * Takes a string representing one or more values or range of values
+ * separated by '','' (e.g., something like
"3,2-6,8"), split it in substrings
+ * and, for each one of them, update bitmap to reflect that. This may result
+ * in bits in bitmap being both set or reset, since it is possible to append
+ * a not prefix ("^") to both values and ranges.
+ *
+ * It is also possible for the caller to indicate that a specific value or
+ * range should be treated specially, i.e., resulting in not just one bit
+ * (or not just the specified range of bits) to be set or reset. For details
+ * on that, see update_bitmap_range() above.
  */
-static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
+static int parse_bitmap_range(char *cpu, libxl_bitmap *bitmap)
 {
     char *ptr, *saveptr = NULL;
     int rc = 0;
 
     for (ptr = strtok_r(cpu, ",", &saveptr); ptr;
          ptr = strtok_r(NULL, ",", &saveptr)) {
-        rc = update_cpumap_range(ptr, cpumap);
+        rc = update_bitmap_range(ptr, bitmap);
         if (rc)
             break;
     }
@@ -810,7 +839,7 @@ static void parse_config_data(const char *config_source,
         }
 
         libxl_bitmap_set_none(&b_info->cpumap);
-        if (vcpupin_parse(buf2, &b_info->cpumap))
+        if (parse_bitmap_range(buf2, &b_info->cpumap))
             exit(1);
         free(buf2);
 
@@ -4483,7 +4512,9 @@ int main_button_press(int argc, char **argv)
 
 static void print_vcpuinfo(uint32_t tdomid,
                            const libxl_vcpuinfo *vcpuinfo,
-                           uint32_t nr_cpus)
+                           uint32_t nr_cpus,
+                           uint32_t nr_nodes,
+                           int numa)
 {
     char *domname;
 
@@ -4505,10 +4536,18 @@ static void print_vcpuinfo(uint32_t tdomid,
     printf("%9.1f  ", ((float)vcpuinfo->vcpu_time / 1e9));
     /* CPU AFFINITY */
     print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout);
+    /* NUMA Affinity*/
+    if (numa) {
+        printf(" / ");
+        print_nodemap(vcpuinfo->nodemap.map, nr_nodes, stdout);
+    }
     printf("\n");
 }
 
-static void print_domain_vcpuinfo(uint32_t domid, uint32_t nr_cpus)
+static void print_domain_vcpuinfo(uint32_t domid,
+                                  uint32_t nr_cpus,
+                                  uint32_t nr_nodes,
+                                  int numa)
 {
     libxl_vcpuinfo *vcpuinfo;
     int i, nb_vcpu, nrcpus;
@@ -4521,55 +4560,67 @@ static void print_domain_vcpuinfo(uint32_t domid,
uint32_t nr_cpus)
     }
 
     for (i = 0; i < nb_vcpu; i++) {
-        print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus);
+        print_vcpuinfo(domid, &vcpuinfo[i], nr_cpus, nr_nodes, numa);
     }
 
     libxl_vcpuinfo_list_free(vcpuinfo, nb_vcpu);
 }
 
-static void vcpulist(int argc, char **argv)
+int main_vcpulist(int argc, char **argv)
 {
     libxl_dominfo *dominfo;
     libxl_physinfo physinfo;
-    int i, nb_domain;
+    int opt;
+    int numa = 0;
+    static struct option opts[] = {
+        {"numa", 0, 0, ''n''},
+        COMMON_LONG_OPTS,
+        {0, 0, 0, 0}
+    };
+    int rc = -1;
+
+    SWITCH_FOREACH_OPT(opt, "n", opts, "vcpu-list", 0) {
+    case ''n'':
+        numa = 1;
+        break;
+    }
 
     if (libxl_get_physinfo(ctx, &physinfo) != 0) {
         fprintf(stderr, "libxl_physinfo failed.\n");
-        goto vcpulist_out;
+        goto out;
     }
 
-    printf("%-32s %5s %5s %5s %5s %9s %s\n",
+    printf("%-32s %5s %5s %5s %5s %9s %s",
            "Name", "ID", "VCPU", "CPU",
"State", "Time(s)", "CPU Affinity");
-    if (!argc) {
+    if (numa)
+        printf(" / NUMA Affinity");
+    printf("\n");
+
+    if (optind >= argc) {
+        int i, nb_domain;
+
         if (!(dominfo = libxl_list_domain(ctx, &nb_domain))) {
             fprintf(stderr, "libxl_list_domain failed.\n");
-            goto vcpulist_out;
+            goto out;
         }
 
-        for (i = 0; i<nb_domain; i++)
-            print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus);
+        for (i = 0; i < nb_domain; i++)
+            print_domain_vcpuinfo(dominfo[i].domid, physinfo.nr_cpus,
+                                  physinfo.nr_nodes, numa);
 
         libxl_dominfo_list_free(dominfo, nb_domain);
     } else {
-        for (; argc > 0; ++argv, --argc) {
-            uint32_t domid = find_domain(*argv);
-            print_domain_vcpuinfo(domid, physinfo.nr_cpus);
+        for (; argc > optind; ++optind) {
+            uint32_t domid = find_domain(argv[optind]);
+            print_domain_vcpuinfo(domid, physinfo.nr_cpus,
+                                  physinfo.nr_nodes, numa);
         }
     }
-  vcpulist_out:
-    libxl_physinfo_dispose(&physinfo);
-}
-
-int main_vcpulist(int argc, char **argv)
-{
-    int opt;
-
-    SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-list", 0) {
-        /* No options */
-    }
 
-    vcpulist(argc - optind, argv + optind);
-    return 0;
+    rc = 0;
+ out:
+    libxl_physinfo_dispose(&physinfo);
+    return rc;
 }
 
 static int vcpupin(uint32_t domid, const char *vcpu, char *cpu)
@@ -4595,7 +4646,7 @@ static int vcpupin(uint32_t domid, const char *vcpu, char
*cpu)
     if (libxl_cpu_bitmap_alloc(ctx, &cpumap, 0))
         goto out;
 
-    if (vcpupin_parse(cpu, &cpumap))
+    if (parse_bitmap_range(cpu, &cpumap))
         goto out;
 
     if (dryrun_only) {
@@ -4646,6 +4697,69 @@ static int vcpupin(uint32_t domid, const char *vcpu, char
*cpu)
     return rc;
 }
 
+static int vcpuaff(uint32_t domid, const char *vcpu, char *node)
+{
+    libxl_bitmap nodemap;
+    uint32_t vcpuid;
+    char *endptr;
+    int rc = -1;
+
+    libxl_bitmap_init(&nodemap);
+
+    vcpuid = strtoul(vcpu, &endptr, 10);
+    if (vcpu == endptr) {
+        if (strcmp(vcpu, "all")) {
+            fprintf(stderr, "Error: Invalid argument\n");
+            goto out;
+        }
+        vcpuid = -1;
+    }
+
+    if (libxl_node_bitmap_alloc(ctx, &nodemap, 0))
+        goto out;
+
+    if (parse_bitmap_range(node, &nodemap))
+        goto out;
+
+    if (dryrun_only) {
+        int nr_nodes = 0;
+        libxl_numainfo *info = libxl_get_numainfo(ctx, &nr_nodes);
+
+        if (!info) {
+            fprintf(stderr, "libxl_get_numainfo failed\n");
+            goto out;
+        }
+        libxl_numainfo_list_free(info, nr_nodes);
+
+        fprintf(stdout, "nodemap: ");
+        print_nodemap(nodemap.map, nr_nodes, stdout);
+        fprintf(stdout, "\n");
+
+        if (ferror(stdout) || fflush(stdout)) {
+            perror("stdout");
+            exit(-1);
+        }
+
+        rc = 0;
+        goto out;
+    }
+
+    if (vcpuid != -1) {
+        if (libxl_set_vcpunodeaffinity(ctx, domid, vcpuid, &nodemap))
+            fprintf(stderr, "Could not set node-affinity for vcpu
`%u''\n",
+                    vcpuid);
+        goto out;
+    }
+
+    if (libxl_domain_set_nodeaffinity(ctx, domid, &nodemap))
+        fprintf(stderr, "libxl_domain_set_nodeaffinity failed\n");
+
+    rc = 0;
+ out:
+    libxl_bitmap_dispose(&nodemap);
+    return rc;
+}
+
 int main_vcpupin(int argc, char **argv)
 {
     int opt;
@@ -4657,6 +4771,18 @@ int main_vcpupin(int argc, char **argv)
     return vcpupin(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]);
 }
 
+int main_vcpuaff(int argc, char **argv)
+{
+    int opt;
+
+    SWITCH_FOREACH_OPT(opt, "", NULL, "vcpu-node-affinity",
3) {
+        /* No options */
+    }
+
+    vcpuaff(find_domain(argv[optind]), argv[optind+1] , argv[optind+2]);
+    return 0;
+}
+
 static void vcpuset(uint32_t domid, const char* nr_vcpus, int check_host)
 {
     char *endptr;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
index d3dcbf0..d3dd7e6 100644
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -208,13 +208,20 @@ struct cmd_spec cmd_table[] = {
     { "vcpu-list",
       &main_vcpulist, 0, 0,
       "List the VCPUs for all/some domains",
-      "[Domain, ...]",
+      "[option] [Domain, ...]",
+      "-n, --numa         Show NUMA node-affinity",
+
     },
     { "vcpu-pin",
       &main_vcpupin, 1, 1,
       "Set which CPUs a VCPU can use",
       "<Domain> <VCPU|all> <CPUs|all>",
     },
+    { "vcpu-node-affinity",
+      &main_vcpuaff, 1, 1,
+      "Set on which CPUs a VCPU prefers to run",
+      "<Domain> <VCPU|all> <CPUs|all>",
+    },
     { "vcpu-set",
       &main_vcpuset, 0, 1,
       "Set the number of active VCPUs allowed for the domain",

Dario Faggioli

2013-Nov-05 14:36 UTC

head link

[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

in a similar way to how it is possible to specify vcpu-affinity.

Manual page is updated accordingly.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 docs/man/xl.cfg.pod.5     |   70 +++++++++++++++++++++++++++++++++------
 tools/libxl/libxl_dom.c   |   18 ++++++----
 tools/libxl/libxl_numa.c  |   14 +-------
 tools/libxl/libxl_utils.h |   12 ++++++-
 tools/libxl/xl_cmdimpl.c  |   80 ++++++++++++++++++++++++++++++++++++++++++++-
 5 files changed, 159 insertions(+), 35 deletions(-)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index 1c98cb4..1212426 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -144,18 +144,64 @@ run on cpu #3 of the host.
 =back
 
 If this option is not specified, no vcpu to cpu pinning is established,
-and the vcpus of the guest can run on all the cpus of the host.
-
-If we are on a NUMA machine (i.e., if the host has more than one NUMA
-node) and this option is not specified, libxl automatically tries to
-place the guest on the least possible number of nodes. That, however,
-will not affect vcpu pinning, so the guest will still be able to run on
-all the cpus, it will just prefer the ones from the node it has been
-placed on. A heuristic approach is used for choosing the best node (or
-set of nodes), with the goals of maximizing performance for the guest
-and, at the same time, achieving efficient utilization of host cpus
-and memory. See F<docs/misc/xl-numa-placement.markdown> for more
-details.
+and the vcpus of the guest can run on all the cpus of the host. If this
+option is specified, and no B<nodes=> option present the vcpu pinning
+mask of each vcpu is utilized to compute its vcpu node-affinity, and the
+union of all the vcpus node-affinities is what constitutes the domain
+node-affinity (which drives memory allocations).
+
+=back
+
+=item B<nodes="NODE-LIST">
+
+List of on which NUMA nodes the memory for the guest is allocated. This
+also means (starting from Xen 4.3 and if the credit scheduler is used)
+the vcpus of the domain prefers to run on the those NUMA nodes. Default is
+xl (via libxl) guesses. A C<NODE-LIST> may be specified as follows:
+
+=item "all"
+
+To specify no particular preference and avoid xl to automatically pick
+a (set of) NUMA ndoe(s). In practice, using "all", this domain will
have
+no NUMA node-affinity and it''s memory will be spread on all the
host''s
+NUMA nodes.
+
+=item "0-3,5,^1"
+
+To specify a NUMA node-affinity with the host NUMA nodes 0,2,3,5.
+Combining this with "all" is possible, meaning "all,^7"
results in the
+memory being allocated on all the host NUMA nodes except node 7, as well
+as trying to avoid running the domain''s vcpu on the pcpus that belong
to
+node 7.
+
+=item ["1", "4"] (or [1, 4])
+
+To ask for specific vcpu to NUMA node mapping. That means (in this example),
+memory will be allocated on host NUMA nodes 1 and 4 but, at the same time,
+vcpu #0 of the guest prefers to run on the pcpus of host NUMA node 1, while
+vcpu #1 on the pcpus of host NUMA node 4.
+
+=back
+
+If this option is not specified, xl picks up a NUMA node (or a set of NUMA
+nodes), according to some heuristics, and use that as the NUMA node-affinity
+for the guest.
+
+If we are on a NUMA machine (i.e., if the host has more than one NUMA node)
+and this option is not specified, libxl automatically tries to place the
+guest on the least possible number of nodes. A heuristic approach is used
+for choosing the best node (or set of nodes), with the goals of maximizing
+performance for the guest and, at the same time, achieving efficient
+utilization of host cpus and memory. In this case, all the vcpus of the
+guest will have the same vcpu node-affinity.
+
+Notice that, independently from whether the node-affinity is specified
+via this parameter, or automatically decided by libxl, that does not affect
+vcpu pinning, so the guest will still be able to run on all the cpus to
+which its vcpus are pinned, or all the cpus, if no B<cpus=> option is
+provided.
+
+See F<docs/misc/xl-numa-placement.markdown> for more details.
 
 =back
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 1812bdc..bc4cf9a 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -215,19 +215,21 @@ int libxl__build_pre(libxl__gc *gc, uint32_t domid,
     }
 
     /*
-     * Check if the domain has any CPU affinity. If not, try to build
-     * up one. In case numa_place_domain() find at least a suitable
-     * candidate, it will affect info->nodemap accordingly; if it
-     * does not, it just leaves it as it is. This means (unless
-     * some weird error manifests) the subsequent call to
-     * libxl_domain_set_nodeaffinity() will do the actual placement,
+     * Check if the domain has any pinning or node-affinity and, if not, try
+     * to build up one.
+     *
+     * In case numa_place_domain() find at least a suitable candidate, it will
+     * affect info->nodemap accordingly; if it does not, it just leaves it
as
+     * it is. This means (unless some weird error manifests) the subsequent
+     * call to libxl_domain_set_nodeaffinity() will do the actual placement,
      * whatever that turns out to be.
      */
     if (libxl_defbool_val(info->numa_placement)) {
 
-        if (!libxl_bitmap_is_full(&info->cpumap)) {
+        if (!libxl_bitmap_is_full(&info->cpumap) ||
+            !libxl_bitmap_is_full(&info->nodemap)) {
             LOG(ERROR, "Can run NUMA placement only if no vcpu "
-                       "affinity is specified");
+                       "pinning or node-affinity is specified");
             return ERROR_INVAL;
         }
 
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
index 20c99ac..1026579 100644
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc,
libxl_cputopology *tinfo,
                              int vcpus_on_node[])
 {
     libxl_dominfo *dinfo = NULL;
-    libxl_bitmap dom_nodemap, nodes_counted;
+    libxl_bitmap nodes_counted;
     int nr_doms, nr_cpus;
     int i, j, k;
 
@@ -197,12 +197,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc,
libxl_cputopology *tinfo,
         return ERROR_FAIL;
     }
 
-    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
-        libxl_bitmap_dispose(&nodes_counted);
-        libxl_dominfo_list_free(dinfo, nr_doms);
-        return ERROR_FAIL;
-    }
-
     for (i = 0; i < nr_doms; i++) {
         libxl_vcpuinfo *vinfo;
         int nr_dom_vcpus;
@@ -211,9 +205,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc,
libxl_cputopology *tinfo,
         if (vinfo == NULL)
             continue;
 
-        /* Retrieve the domain''s node-affinity map */
-        libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap);
-
         for (j = 0; j < nr_dom_vcpus; j++) {
             /*
              * For each vcpu of each domain, it must have both vcpu-affinity
@@ -225,7 +216,7 @@ static int nr_vcpus_on_nodes(libxl__gc *gc,
libxl_cputopology *tinfo,
                 int node = tinfo[k].node;
 
                 if (libxl_bitmap_test(suitable_cpumap, k) &&
-                    libxl_bitmap_test(&dom_nodemap, node) &&
+                    libxl_bitmap_test(&vinfo[j].nodemap, node) &&
                     !libxl_bitmap_test(&nodes_counted, node)) {
                     libxl_bitmap_set(&nodes_counted, node);
                     vcpus_on_node[node]++;
@@ -236,7 +227,6 @@ static int nr_vcpus_on_nodes(libxl__gc *gc,
libxl_cputopology *tinfo,
         libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
     }
 
-    libxl_bitmap_dispose(&dom_nodemap);
     libxl_bitmap_dispose(&nodes_counted);
     libxl_dominfo_list_free(dinfo, nr_doms);
     return 0;
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
index 7b84e6a..cac057c 100644
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -90,7 +90,7 @@ static inline void libxl_bitmap_set_none(libxl_bitmap *bitmap)
 {
     memset(bitmap->map, 0, bitmap->size);
 }
-static inline int libxl_bitmap_cpu_valid(libxl_bitmap *bitmap, int bit)
+static inline int libxl_bitmap_valid(libxl_bitmap *bitmap, int bit)
 {
     return bit >= 0 && bit < (bitmap->size * 8);
 }
@@ -125,6 +125,16 @@ static inline int libxl_node_bitmap_alloc(libxl_ctx *ctx,
     return libxl_bitmap_alloc(ctx, nodemap, max_nodes);
 }
 
+static inline int libxl_bitmap_cpu_valid(libxl_bitmap *cpumap, int cpu)
+{
+    return libxl_bitmap_valid(cpumap, cpu);
+}
+
+static inline int libxl_bitmap_node_valid(libxl_bitmap *nodemap, int node)
+{
+    return libxl_bitmap_valid(nodemap, node);
+}
+
 /* Populate cpumap with the cpus spanned by the nodes in nodemap */
 int libxl_nodemap_to_cpumap(libxl_ctx *ctx,
                             const libxl_bitmap *nodemap,
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 1659259..a035162 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -76,8 +76,9 @@ xlchild children[child_max];
 static const char *common_domname;
 static int fd_lock = -1;
 
-/* Stash for specific vcpu to pcpu mappping */
+/* Stash for specific vcpu to pcpu and vcpu to node mappping */
 static int *vcpu_to_pcpu;
+static int *vcpu_to_node;
 
 static const char savefileheader_magic[32]     "Xen saved domain, xl
format\n \0 \r";
@@ -670,7 +671,7 @@ static void parse_config_data(const char *config_source,
     const char *buf;
     long l;
     XLU_Config *config;
-    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
+    XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids, *vtpms;
     XLU_ConfigList *ioports, *irqs, *iomem;
     int num_ioports, num_irqs, num_iomem;
     int pci_power_mgmt = 0;
@@ -846,6 +847,53 @@ static void parse_config_data(const char *config_source,
         libxl_defbool_set(&b_info->numa_placement, false);
     }
 
+    if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) {
+        int n_cpus = 0;
+
+        if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        /*
+         * As above, use a temporary storage for the single vcpus''
+         * node-affinities.
+         */
+        vcpu_to_node = xmalloc(sizeof(int) * b_info->max_vcpus);
+        memset(vcpu_to_node, -1, sizeof(int) * b_info->max_vcpus);
+
+        libxl_bitmap_set_none(&b_info->nodemap);
+        while ((buf = xlu_cfg_get_listitem(nodes, n_cpus)) != NULL) {
+            i = atoi(buf);
+            if (!libxl_bitmap_node_valid(&b_info->nodemap, i)) {
+                fprintf(stderr, "node %d illegal\n", i);
+                exit(1);
+            }
+            libxl_bitmap_set(&b_info->nodemap, i);
+            if (n_cpus < b_info->max_vcpus)
+                vcpu_to_node[n_cpus] = i;
+            n_cpus++;
+        }
+
+        /* We have a nodemap, disable automatic placement */
+        libxl_defbool_set(&b_info->numa_placement, false);
+    }
+    else if (!xlu_cfg_get_string (config, "nodes", &buf, 0)) {
+        char *buf2 = strdup(buf);
+
+        if (libxl_node_bitmap_alloc(ctx, &b_info->nodemap, 0)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        libxl_bitmap_set_none(&b_info->nodemap);
+        if (parse_bitmap_range(buf2, &b_info->nodemap))
+            exit(1);
+        free(buf2);
+
+        libxl_defbool_set(&b_info->numa_placement, false);
+    }
+
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
         b_info->max_memkb = l * 1024;
         b_info->target_memkb = b_info->max_memkb;
@@ -2205,6 +2253,34 @@ start:
         free(vcpu_to_pcpu); vcpu_to_pcpu = NULL;
     }
 
+    /* And do the same for single vcpu to node-affinity mapping */
+    if (vcpu_to_node) {
+        libxl_bitmap vcpu_nodemap;
+
+        ret = libxl_node_bitmap_alloc(ctx, &vcpu_nodemap, 0);
+        if (ret)
+            goto error_out;
+        for (i = 0; i < d_config.b_info.max_vcpus; i++) {
+
+            if (vcpu_to_node[i] != -1) {
+                libxl_bitmap_set_none(&vcpu_nodemap);
+                libxl_bitmap_set(&vcpu_nodemap, vcpu_to_node[i]);
+            } else {
+                libxl_bitmap_set_any(&vcpu_nodemap);
+            }
+            if (libxl_set_vcpunodeaffinity(ctx, domid, i, &vcpu_nodemap)) {
+                fprintf(stderr, "setting node-affinity failed"
+                        " on vcpu `%d''.\n", i);
+                libxl_bitmap_dispose(&vcpu_nodemap);
+                free(vcpu_to_node);
+                ret = ERROR_FAIL;
+                goto error_out;
+            }
+        }
+        libxl_bitmap_dispose(&vcpu_nodemap);
+        free(vcpu_to_node); vcpu_to_node = NULL;
+    }
+
     ret = libxl_userdata_store(ctx, domid, "xl",
                                     config_data, config_len);
     if (ret) {

George Dunlap

2013-Nov-05 14:43 UTC

head link

Re: [PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode

On 11/05/2013 02:34 PM, Dario Faggioli wrote:> If the domain''s NUMA node-affinity is being specified by the
> user/toolstack (instead of being automatically computed by Xen),
> we really should stick to that. This means domain_update_node_affinity()
> is wrong when it filters out some stuff from there even in
"!auto"
> mode.
>
> This commit fixes that. Of course, this does not mean node-affinity
> is always honoured (e.g., a vcpu won''t run on a pcpu of a
different
> cpupool) but the necessary logic for taking into account all the
> possible situations lives in the scheduler code, where it belongs.
>
> What could happen without this change is that, under certain
> circumstances, the node-affinity of a domain may change when the
> user modifies the vcpu-affinity of the domain''s vcpus. This, even
> if probably not a real bug, is at least something the user does
> not expect, so let''s avoid it.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> Reviewed-by: George Dunlap <george.dunlap@eu.citrix.com>
> ---
> This has been submitted already as a single patch on its own.
> Since this series needs the change done here, just include it
> in here, instead of pinging the original submission and deferring
> posting this series.
Just reiterating what I said on the last send... this one is independent 
and can be checked in whenever you''re ready.

  -George
> ---
>   xen/common/domain.c |   28 +++++++++-------------------
>   1 file changed, 9 insertions(+), 19 deletions(-)
>
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 5999779..af31ab4 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -352,7 +352,6 @@ void domain_update_node_affinity(struct domain *d)
>       cpumask_var_t cpumask;
>       cpumask_var_t online_affinity;
>       const cpumask_t *online;
> -    nodemask_t nodemask = NODE_MASK_NONE;
>       struct vcpu *v;
>       unsigned int node;
>
> @@ -374,28 +373,19 @@ void domain_update_node_affinity(struct domain *d)
>           cpumask_or(cpumask, cpumask, online_affinity);
>       }
>
> +    /*
> +     * If d->auto_node_affinity is true, the domain''s
node-affinity mask
> +     * (d->node_affinity) is automaically computed from all the
domain''s
> +     * vcpus'' vcpu-affinity masks (the union of which we have
just built
> +     * above in cpumask). OTOH, if d->auto_node_affinity is false, we
> +     * must leave the node-affinity of the domain alone.
> +     */
>       if ( d->auto_node_affinity )
>       {
> -        /* Node-affinity is automaically computed from all vcpu-affinities
*/
> +        nodes_clear(d->node_affinity);
>           for_each_online_node ( node )
>               if ( cpumask_intersects(&node_to_cpumask(node), cpumask)
)
> -                node_set(node, nodemask);
> -
> -        d->node_affinity = nodemask;
> -    }
> -    else
> -    {
> -        /* Node-affinity is provided by someone else, just filter out cpus
> -         * that are either offline or not in the affinity of any vcpus. */
> -        nodemask = d->node_affinity;
> -        for_each_node_mask ( node, d->node_affinity )
> -            if ( !cpumask_intersects(&node_to_cpumask(node), cpumask)
)
> -                node_clear(node, nodemask);//d->node_affinity);
> -
> -        /* Avoid loosing track of node-affinity because of a bad
> -         * vcpu-affinity has been specified. */
> -        if ( !nodes_empty(nodemask) )
> -            d->node_affinity = nodemask;
> +                node_set(node, d->node_affinity);
>       }
>
>       sched_set_node_affinity(d, &d->node_affinity);
>

George Dunlap

2013-Nov-05 14:50 UTC

head link

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

On 11/05/2013 02:34 PM, Dario Faggioli wrote:> Making it possible to use something like the following:
>   * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
>   * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
>   * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2
>     but not pCPU 6;
>   * ...
>
> In both domain config file and `xl vcpu-pin''.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Overall looks like a pretty clean patch; just a few comments.
> ---
> Picking this up from a previously submitted series ("xl:
> allow for node-wise specification of vcpu pinning") as the
> changes in that and in this series would otherwise be
> conflicting. If this is considered fine, Feel free to apply
> it from here and skip the corresponding e-mail in the
> original submission.
> ---
> Changes from v2:
>   * turned a ''return'' into ''goto out'',
consistently with the
>     most of exit patterns;
>   * harmonized error handling: now parse_range() return a
>     libxl error code, as requested during review;
>   * dealing with "all" moved inside update_cpumap_range().
>     It''s tricky to move it in parse_range() (as requested
>     during review), since we need the cpumap being modified
>     handy when dealing with it. However, having it in
>     update_cpumap_range() simplifies the code just as much
>     as that;
>   * explicitly checking for junk after a valid value or range
>     in parse_range(), as requested during review;
>   * xl exits on parsing failing, so no need to reset the
>     cpumap to something sensible in vcpupin_parse(), as
>     suggested during review;
>
> Changes from v1:
>   * code rearranged in order to look more simple to follow
>     and understand, as requested during review;
>   * improved docs in xl.cfg.pod.5, as requested during
>     review;
>   * strtoul() now returns into unsigned long, and the
>     case where it returns ULONG_MAX is now taken into
>     account, as requested during review;
>   * stuff like "all,^7" now works, as requested during
>     review. Specifying just "^7" does not work either
>     before or after this change
>   * killed some magic (i.e., `ptr += 5 + (ptr[4] == ''s''`)
>     by introducing STR_SKIP_PREFIX() macro, as requested
>     during review.
> ---
>   docs/man/xl.cfg.pod.5    |   20 ++++++
>   tools/libxl/xl_cmdimpl.c |  145
++++++++++++++++++++++++++++++++--------------
>   2 files changed, 121 insertions(+), 44 deletions(-)
>
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> index d2d8921..1c98cb4 100644
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -115,7 +115,25 @@ To allow all the vcpus of the guest to run on all the
cpus on the host.
>
>   =item "0-3,5,^1"
>
> -To allow all the vcpus of the guest to run on cpus 0,2,3,5.
> +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining
> +this with "all" is possible, meaning "all,^7" results
in all the vcpus
> +of the guest running on all the cpus on the host except cpu 7.
> +
> +=item "nodes:0-3,node:^2"
Here you use both "nodes" and "node", while the code seems
to only check
for "nodes".  I was originally going to say we should just check one; 
but on the other hand, it''s just an extra string compare -- I feel like
we might as well accept either "node" or "nodes".  (No need
to enforce
plurality: "nodes:2" and "node:1-3" should both be fine with
me.)
> +
> +To allow all the vcpus of the guest to run on the cpus from NUMA nodes
> +0,1,3 of the host. So, if cpus 0-3 belongs to node 0, cpus 4-7 belongs
> +to node 1 and cpus 8-11 to node 3, the above would mean all the vcpus
> +of the guest will run on cpus 0-3,8-11.
> +
> +Combining this notation with the one above is possible. For instance,
> +"1,node:2,^6", means all the vcpus of the guest will run on cpu
1 and
> +on all the cpus of NUMA node 2, but not on cpu 6. Following the same
> +example as above, that would be cpus 1,4,5,7.
> +
> +Combining this with "all" is also possible, meaning
"all,^nodes:1"
> +results in all the vcpus of the guest running on all the cpus on the
> +host, except for the cpus belonging to the host NUMA node 1.
>
>   =item ["2", "3"] (or [2, 3])
>
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> index 40feb7d..b8755b9 100644
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -59,6 +59,11 @@
>           }                                                               \
>       })
>
> +#define STR_HAS_PREFIX( a, b )  \
> +    ( strncmp(a, b, strlen(b)) == 0 )
> +#define STR_SKIP_PREFIX( a, b ) \
> +    ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL )
> +
>
>   int logfile = 2;
>
> @@ -513,61 +518,115 @@ static void split_string_into_string_list(const char
*str,
>       free(s);
>   }
>
> -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
> +static int parse_range(const char *str, unsigned long *a, unsigned long
*b)
> +{
> +    char *nstr, *endptr;
> +
> +    *a = *b = strtoul(str, &endptr, 10);
> +    if (endptr == str || *a == ULONG_MAX)
> +        return ERROR_INVAL;
> +
> +    if (*endptr == ''-'') {
> +        nstr = endptr + 1;
> +
> +        *b = strtoul(nstr, &endptr, 10);
> +        if (endptr == nstr || *b == ULONG_MAX || *b < *a)
> +            return ERROR_INVAL;
> +    }
> +
> +    /* Valid value or range so far, but we also don''t want junk
after that */
> +    if (*endptr != ''\0'')
> +        return ERROR_INVAL;
> +
> +    return 0;
> +}
> +
> +/*
> + * Add or removes a specific set of cpus (specified in str, either as
> + * single cpus or as entire NUMA nodes) to/from cpumap.
> + */
> +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap)
>   {
> -    libxl_bitmap exclude_cpumap;
> -    uint32_t cpuida, cpuidb;
> -    char *endptr, *toka, *tokb, *saveptr = NULL;
> -    int i, rc = 0, rmcpu;
> +    unsigned long ida, idb;
> +    libxl_bitmap node_cpumap;
> +    bool is_not = false, is_nodes = false;
> +    int rc = 0;
> +
> +    libxl_bitmap_init(&node_cpumap);
>
> -    if (!strcmp(cpu, "all")) {
> +    rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0);
> +    if (rc) {
> +        fprintf(stderr, "libxl_node_bitmap_alloc failed.\n");
> +        goto out;
> +    }
> +
> +    /* Are we adding or removing cpus/nodes? */
> +    if (STR_SKIP_PREFIX(str, "^")) {
> +        is_not = true;
> +    }
> +
> +    /* Are we dealing with cpus or full nodes? */
> +    if (STR_SKIP_PREFIX(str, "nodes:")) {
> +        is_nodes = true;
> +    }
> +
> +    if (STR_HAS_PREFIX(str, "all")) {
Is there any reason not to keep this "strcmp"?  As it is, this will 
accept any string that *starts* with "all", which isn''t
exactly what you
want, I don''t think.

Other than that, I think it looks good.

  -George

Jan Beulich

2013-Nov-05 14:52 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 05.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> @@ -197,6 +199,13 @@ struct vcpu
>      /* Used to restore affinity across S3. */
>      cpumask_var_t    cpu_affinity_saved;
>  
> +    /*
> +     * Bitmask of CPUs on which this VCPU prefers to run. For both this
> +     * and auto_node_affinity access is serialized against
> +     * v->domain->node_affinity_lock.
> +     */
> +    cpumask_var_t    node_affinity;
This all looks quite sensible, except for the naming here: We
already have a node_affinity field in struct domain, having a
meaning that one can expect with this name. So you break
both consistency and the rule of least surprise here. How
about just "preferred_cpus"?

Jan

George Dunlap

2013-Nov-05 15:03 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 11/05/2013 02:52 PM, Jan Beulich wrote:>>>> On 05.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>> @@ -197,6 +199,13 @@ struct vcpu
>>       /* Used to restore affinity across S3. */
>>       cpumask_var_t    cpu_affinity_saved;
>>
>> +    /*
>> +     * Bitmask of CPUs on which this VCPU prefers to run. For both
this
>> +     * and auto_node_affinity access is serialized against
>> +     * v->domain->node_affinity_lock.
>> +     */
>> +    cpumask_var_t    node_affinity;
>
> This all looks quite sensible, except for the naming here: We
> already have a node_affinity field in struct domain, having a
> meaning that one can expect with this name. So you break
> both consistency and the rule of least surprise here. How
> about just "preferred_cpus"?
Actually, would it make more sense to remove node_affinity from the 
domain struct, and have the tools manually set the node_affinity for the 
various vcpus if the user attempts to set the "domain numa affinity"?

  -George

Jan Beulich

2013-Nov-05 15:11 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 05.11.13 at 16:03, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> On 11/05/2013 02:52 PM, Jan Beulich wrote:
>>>>> On 05.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>>> @@ -197,6 +199,13 @@ struct vcpu
>>>       /* Used to restore affinity across S3. */
>>>       cpumask_var_t    cpu_affinity_saved;
>>>
>>> +    /*
>>> +     * Bitmask of CPUs on which this VCPU prefers to run. For both
this
>>> +     * and auto_node_affinity access is serialized against
>>> +     * v->domain->node_affinity_lock.
>>> +     */
>>> +    cpumask_var_t    node_affinity;
>>
>> This all looks quite sensible, except for the naming here: We
>> already have a node_affinity field in struct domain, having a
>> meaning that one can expect with this name. So you break
>> both consistency and the rule of least surprise here. How
>> about just "preferred_cpus"?
> 
> Actually, would it make more sense to remove node_affinity from the 
> domain struct, and have the tools manually set the node_affinity for the 
> various vcpus if the user attempts to set the "domain numa
affinity"?
Quite likely, yes. But that doesn''t mean that the name can be kept
as is.

Jan

George Dunlap

2013-Nov-05 15:11 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 11/05/2013 03:03 PM, George Dunlap wrote:> On 11/05/2013 02:52 PM, Jan Beulich wrote:
>>>>> On 05.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com>
>>>>> wrote:
>>> @@ -197,6 +199,13 @@ struct vcpu
>>>       /* Used to restore affinity across S3. */
>>>       cpumask_var_t    cpu_affinity_saved;
>>>
>>> +    /*
>>> +     * Bitmask of CPUs on which this VCPU prefers to run. For both
this
>>> +     * and auto_node_affinity access is serialized against
>>> +     * v->domain->node_affinity_lock.
>>> +     */
>>> +    cpumask_var_t    node_affinity;
>>
>> This all looks quite sensible, except for the naming here: We
>> already have a node_affinity field in struct domain, having a
>> meaning that one can expect with this name. So you break
>> both consistency and the rule of least surprise here. How
>> about just "preferred_cpus"?
>
> Actually, would it make more sense to remove node_affinity from the
> domain struct, and have the tools manually set the node_affinity for the
> various vcpus if the user attempts to set the "domain numa
affinity"?
Sorry, speaking before I had thought it through.  Of course we need a 
numa affinity for the domain for allocating memory.

How about, "cpu_node_affinity"?

Or, we could internally change the names to "cpu_hard_affinity" and 
"cpu_soft_affinity", since that''s effectively what the
scheduler will
do.  It''s possible someone might want to set soft affinities for some 
other reason besides NUMA performance.

  -George

Jan Beulich

2013-Nov-05 15:23 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 05.11.13 at 16:11, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> Or, we could internally change the names to "cpu_hard_affinity"
and
> "cpu_soft_affinity", since that''s effectively what the
scheduler will
> do.  It''s possible someone might want to set soft affinities for
some
> other reason besides NUMA performance.
I like that.

Jan

George Dunlap

2013-Nov-05 15:24 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 11/05/2013 03:11 PM, Jan Beulich wrote:>>>> On 05.11.13 at 16:03, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
>> On 11/05/2013 02:52 PM, Jan Beulich wrote:
>>>>>> On 05.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>>>> @@ -197,6 +199,13 @@ struct vcpu
>>>>        /* Used to restore affinity across S3. */
>>>>        cpumask_var_t    cpu_affinity_saved;
>>>>
>>>> +    /*
>>>> +     * Bitmask of CPUs on which this VCPU prefers to run. For
both this
>>>> +     * and auto_node_affinity access is serialized against
>>>> +     * v->domain->node_affinity_lock.
>>>> +     */
>>>> +    cpumask_var_t    node_affinity;
>>>
>>> This all looks quite sensible, except for the naming here: We
>>> already have a node_affinity field in struct domain, having a
>>> meaning that one can expect with this name. So you break
>>> both consistency and the rule of least surprise here. How
>>> about just "preferred_cpus"?
>>
>> Actually, would it make more sense to remove node_affinity from the
>> domain struct, and have the tools manually set the node_affinity for
the
>> various vcpus if the user attempts to set the "domain numa
affinity"?
>
> Quite likely, yes. But that doesn''t mean that the name can be kept
> as is.
Right, I see what you''re saying -- in the domain struct, node_affinity 
is a mask of nodes, but here it''s being used as a mask of cpus.  Yes, 
that needs to change.

  -George

George Dunlap

2013-Nov-05 15:39 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 11/05/2013 03:23 PM, Jan Beulich wrote:>>>> On 05.11.13 at 16:11, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
>> Or, we could internally change the names to
"cpu_hard_affinity" and
>> "cpu_soft_affinity", since that''s effectively what
the scheduler will
>> do.  It''s possible someone might want to set soft affinities
for some
>> other reason besides NUMA performance.
>
> I like that.
A potential problem with that is the "auto domain numa" thing.  In
this
patch, if the domain numa affinity is not set but vcpu numa affinity is, 
the domain numa affinity (which will be used to allocate memory for the 
domain) will be set based on the vcpu numa affinity.  That seems like a 
useful feature (though perhaps it''s starting to violate the
"policy
should be in the tools" principle).  If we change this to just "hard 
affinity" and "soft affinity", we''ll lose the natural
logical connection
there.  It might have impacts on how we end up doing vNUMA as well.  So 
I''m a bit torn ATM.

Dario, any thoughts?

  -George

George Dunlap

2013-Nov-05 16:20 UTC

head link

Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling

On 11/05/2013 02:35 PM, Dario Faggioli wrote:> instead of relying the domain-wide node-affinity. To achieve
> that, we use the specific vcpu''s node-affinity when computing
> the NUMA load balancing mask in csched_balance_cpumask().
>
> As a side effects, this simplifies the code quit a bit. In
> fact, prior to this change, we needed to cache the translation
> of d->node_affinity (which is a nodemask_t) to a cpumask_t,
> since that is what is needed during the actual scheduling
> (we used to keep it in node_affinity_cpumask).
>
> Since the per-vcpu node-affinity is maintained in a cpumask_t
> field (v->node_affinity) already, we don''t need that
complicated
> updating logic in place any longer, and this change hence can
> remove stuff like sched_set_node_affinity(),
> csched_set_node_affinity() and, of course, node_affinity_cpumask
> from csched_dom.
>
> The high level description of NUMA placement and scheduling in
> docs/misc/xl-numa-placement.markdown is updated too, to match
> the new behavior. While at it, an attempt is made to make such
> document as few ambiguous as possible, with respect to the
> concepts of vCPU pinning, domain node-affinity and vCPU
> node-affinity.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
>   docs/misc/xl-numa-placement.markdown |  124
+++++++++++++++++++++++-----------
>   xen/common/domain.c                  |    2 -
>   xen/common/sched_credit.c            |   62 ++---------------
>   xen/common/schedule.c                |    5 -
>   xen/include/xen/sched-if.h           |    2 -
>   xen/include/xen/sched.h              |    1
>   6 files changed, 94 insertions(+), 102 deletions(-)
>
> diff --git a/docs/misc/xl-numa-placement.markdown
b/docs/misc/xl-numa-placement.markdown
> index caa3fec..890b856 100644
> --- a/docs/misc/xl-numa-placement.markdown
> +++ b/docs/misc/xl-numa-placement.markdown
> @@ -12,13 +12,6 @@ is quite more complex and slow. On these machines, a
NUMA node is usually
>   defined as a set of processor cores (typically a physical CPU package)
and
>   the memory directly attached to the set of cores.
>
> -The Xen hypervisor deals with NUMA machines by assigning to each domain
> -a "node affinity", i.e., a set of NUMA nodes of the host from
which they
> -get their memory allocated. Also, even if the node affinity of a domain
> -is allowed to change on-line, it is very important to "place"
the domain
> -correctly when it is fist created, as the most of its memory is allocated
> -at that time and can not (for now) be moved easily.
> -
>   NUMA awareness becomes very important as soon as many domains start
>   running memory-intensive workloads on a shared host. In fact, the cost
>   of accessing non node-local memory locations is very high, and the
> @@ -27,13 +20,37 @@ performance degradation is likely to be noticeable.
>   For more information, have a look at the [Xen NUMA
Introduction][numa_intro]
>   page on the Wiki.
>
> +## Xen and NUMA machines: the concept of _node-affinity_ ##
> +
> +The Xen hypervisor deals with NUMA machines throughout the concept of
> +_node-affinity_. When talking about node-affinity, it is important to
> +distinguish two different situations. The node-affinity *of a domain* is
> +the set of NUMA nodes of the host where the memory for the domain is
> +being allocated (mostly, at domain creation time). The node-affinity *of
> +a vCPU* is the set of NUMA nodes of the host where the vCPU prefers to
> +run on, and the (credit1 only, for now) scheduler will try to accomplish
> +that, whenever it is possible.
> +
> +Of course, despite the fact that they belong to and affect different
> +subsystems domain and vCPUs node-affinities are related. In fact, the
> +node-affinity of a domain is the union of the node-affinities of all the
> +domain''s vCPUs.
> +
> +The above means that, when changing the vCPU node-affinity, the domain
> +node-affinity also changes. Well, although this is allowed to happen
> +on-line (i.e., when a domain is already running), that will not result
> +in the memory that has already been allocated being moved to a different
> +host NUMA node. This is why it is very important to "place" the
domain
> +correctly when it is fist created, as the most of its memory is allocated
> +at that time and can not (for now) be moved easily.
> +
>   ### Placing via pinning and cpupools ###
>
>   The simplest way of placing a domain on a NUMA node is statically pinning
> -the domain''s vCPUs to the pCPUs of the node. This goes under the
name of
> -CPU affinity and can be set through the "cpus=" option in the
config file
> -(more about this below). Another option is to pool together the pCPUs
> -spanning the node and put the domain in such a cpupool with the
"pool="
> +the domain''s vCPUs to the pCPUs of the node. This also goes under
the name
> +of vCPU-affinity and can be set through the "cpus=" option in
the config
> +file (more about this below). Another option is to pool together the pCPUs
> +spanning the node and put the domain in such a _cpupool_ with the
"pool="
>   config option (as documented in our [Wiki][cpupools_howto]).
>
>   In both the above cases, the domain will not be able to execute outside
> @@ -45,24 +62,45 @@ may come at he cost of some load imbalances.
>
>   ### NUMA aware scheduling ###
>
> -If the credit scheduler is in use, the concept of node affinity defined
> -above does not only apply to memory. In fact, starting from Xen 4.3, the
> -scheduler always tries to run the domain''s vCPUs on one of the
nodes in
> -its node affinity. Only if that turns out to be impossible, it will just
> -pick any free pCPU.
> +If using credit scheduler, and starting from Xen 4.3, the scheduler always
> +tries to run the domain''s vCPUs on one of the nodes in their
node-affinity.
> +Only if that turns out to be impossible, it will just pick any free pCPU.
> +Moreover, starting from Xen 4.4, each vCPU can have its own node-affinity,
> +potentially different from the ones of all the other vCPUs of the domain.
>
> -This is, therefore, something more flexible than CPU affinity, as a domain
> -can still run everywhere, it just prefers some nodes rather than others.
> +This is, therefore, something more flexible than vCPU pinning, as vCPUs
> +can still run everywhere, they just prefer some nodes rather than others.
>   Locality of access is less guaranteed than in the pinning case, but that
>   comes along with better chances to exploit all the host resources (e.g.,
>   the pCPUs).
>
> -In fact, if all the pCPUs in a domain''s node affinity are busy,
it is
> -possible for the domain to run outside of there, but it is very likely
that
> -slower execution (due to remote memory accesses) is still better than no
> -execution at all, as it would happen with pinning. For this reason, NUMA
> -aware scheduling has the potential of bringing substantial performances
> -benefits, although this will depend on the workload.
> +In fact, if all the pCPUs in a VCPU''s node-affinity are busy, it
is
> +possible for the domain to run outside from there. The idea is that
> +slower execution (due to remote memory accesses) is still better than
> +no execution at all (as it would happen with pinning). For this reason,
> +NUMA aware scheduling has the potential of bringing substantial
> +performances benefits, although this will depend on the workload.
> +
> +Notice that, for each vCPU, the following three scenarios are possbile:
> +
> +  * a vCPU *is pinned* to some pCPUs and *does not have* any vCPU
> +    node-affinity. In this case, the vCPU is always scheduled on one
> +    of the pCPUs to which it is pinned, without any specific peference
> +    on which one of them. Internally, the vCPU''s node-affinity is
just
> +    automatically computed from the vCPU pinning, and the scheduler
> +    just ignores it;
> +  * a vCPU *has* its own vCPU node-affinity and *is not* pinned to
> +    any particular pCPU. In this case, the vCPU can run on every pCPU.
> +    Nevertheless, the scheduler will try to have it running on one of
> +    the pCPUs of the node(s) to which it has node-affinity with;
> +  * a vCPU *has* its own vCPU node-affinity and *is also* pinned to
> +    some pCPUs. In this case, the vCPU is always scheduled on one of the
> +    pCPUs to which it is pinned, with, among them, a preference for the
> +    ones that are from the node(s) it has node-affinity with. In case
> +    pinning and node-affinity form two disjoint sets of pCPUs, pinning
> +    "wins", and the node-affinity, although it is still used to
derive
> +    the domain''s node-affinity (for memory allocation), is, from
the
> +    scheduler''s perspective, just ignored.
>
>   ## Guest placement in xl ##
>
> @@ -71,25 +109,23 @@ both manual or automatic placement of them across the
host''s NUMA nodes.
>
>   Note that xm/xend does a very similar thing, the only differences being
>   the details of the heuristics adopted for automatic placement (see
below),
> -and the lack of support (in both xm/xend and the Xen versions where that\
> +and the lack of support (in both xm/xend and the Xen versions where that
>   was the default toolstack) for NUMA aware scheduling.
>
>   ### Placing the guest manually ###
>
>   Thanks to the "cpus=" option, it is possible to specify where a
domain
>   should be created and scheduled on, directly in its config file. This
> -affects NUMA placement and memory accesses as the hypervisor constructs
> -the node affinity of a VM basing right on its CPU affinity when it is
> -created.
> +affects NUMA placement and memory accesses as, in this case, the
hypervisor
> +constructs the node-affinity of a VM basing right on its vCPU pinning
> +when it is created.
>
>   This is very simple and effective, but requires the user/system
> -administrator to explicitly specify affinities for each and every domain,
> +administrator to explicitly specify the pinning of each and every domain,
>   or Xen won''t be able to guarantee the locality for their memory
accesses.
>
> -Notice that this also pins the domain''s vCPUs to the specified
set of
> -pCPUs, so it not only sets the domain''s node affinity (its memory
will
> -come from the nodes to which the pCPUs belong), but at the same time
> -forces the vCPUs of the domain to be scheduled on those same pCPUs.
> +That, of course, alsos mean the vCPUs of the domain will only be able to
> +execute on those same pCPUs.
>
>   ### Placing the guest automatically ###
>
> @@ -97,7 +133,10 @@ If no "cpus=" option is specified in the
config file, libxl tries
>   to figure out on its own on which node(s) the domain could fit best.
>   If it finds one (some), the domain''s node affinity get set to
there,
>   and both memory allocations and NUMA aware scheduling (for the credit
> -scheduler and starting from Xen 4.3) will comply with it.
> +scheduler and starting from Xen 4.3) will comply with it. Starting from
> +Xen 4.4, this just means all the vCPUs of the domain will have the same
> +vCPU node-affinity, that is the outcome of such automatic
"fitting"
> +procedure.
>
>   It is worthwhile noting that optimally fitting a set of VMs on the NUMA
>   nodes of an host is an incarnation of the Bin Packing Problem. In fact,
> @@ -143,22 +182,29 @@ any placement from happening:
>       libxl_defbool_set(&domain_build_info->numa_placement, false);
>
>   Also, if `numa_placement` is set to `true`, the domain must not
> -have any CPU affinity (i.e., `domain_build_info->cpumap` must
> -have all its bits set, as it is by default), or domain creation
> -will fail returning `ERROR_INVAL`.
> +have any vCPU pinning (i.e., `domain_build_info->cpumap` must have
> +all its bits set, as it is by default), or domain creation will fail
> +with an `ERROR_INVAL`.
>
>   Starting from Xen 4.3, in case automatic placement happens (and is
>   successful), it will affect the domain''s node affinity and _not_
its
> -CPU affinity. Namely, the domain''s vCPUs will not be pinned to
any
> +vCPU pinning. Namely, the domain''s vCPUs will not be pinned to
any
>   pCPU on the host, but the memory from the domain will come from the
>   selected node(s) and the NUMA aware scheduling (if the credit scheduler
> -is in use) will try to keep the domain there as much as possible.
> +is in use) will try to keep the domain''s vCPUs there as much as
possible.
>
>   Besides than that, looking and/or tweaking the placement algorithm
>   search "Automatic NUMA placement" in libxl\_internal.h.
>
>   Note this may change in future versions of Xen/libxl.
>
> +## Xen < 4.4 ##
> +
> +The concept of per-vCPU node-affinity has been introduced for the first
> +time in Xen 4.4. In Xen versions earlier than that, the node-affinity is
> +the same for the whole domain, that is to say the same for all the vCPUs
> +of the domain.
> +
>   ## Xen < 4.3 ##
>
>   As NUMA aware scheduling is a new feature of Xen 4.3, things are a little
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> index 366d9b9..ae29945 100644
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -410,8 +410,6 @@ void domain_update_node_affinity(struct domain *d)
>       for_each_cpu ( cpu, cpumask )
>           node_set(cpu_to_node(cpu), d->node_affinity);
>
> -    sched_set_node_affinity(d, &d->node_affinity);
> -
>       spin_unlock(&d->node_affinity_lock);
>
>       free_cpumask_var(online_affinity);
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> index c53a36b..d228127 100644
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -178,9 +178,6 @@ struct csched_dom {
>       struct list_head active_vcpu;
>       struct list_head active_sdom_elem;
>       struct domain *dom;
> -    /* cpumask translated from the domain''s node-affinity.
> -     * Basically, the CPUs we prefer to be scheduled on. */
> -    cpumask_var_t node_affinity_cpumask;
>       uint16_t active_vcpu_count;
>       uint16_t weight;
>       uint16_t cap;
> @@ -261,32 +258,6 @@ __runq_remove(struct csched_vcpu *svc)
>       list_del_init(&svc->runq_elem);
>   }
>
> -/*
> - * Translates node-affinity mask into a cpumask, so that we can use it
during
> - * actual scheduling. That of course will contain all the cpus from all
the
> - * set nodes in the original node-affinity mask.
> - *
> - * Note that any serialization needed to access mask safely is complete
> - * responsibility of the caller of this function/hook.
> - */
> -static void csched_set_node_affinity(
> -    const struct scheduler *ops,
> -    struct domain *d,
> -    nodemask_t *mask)
> -{
> -    struct csched_dom *sdom;
> -    int node;
> -
> -    /* Skip idle domain since it doesn''t even have a
node_affinity_cpumask */
> -    if ( unlikely(is_idle_domain(d)) )
> -        return;
> -
> -    sdom = CSCHED_DOM(d);
> -    cpumask_clear(sdom->node_affinity_cpumask);
> -    for_each_node_mask( node, *mask )
> -        cpumask_or(sdom->node_affinity_cpumask,
sdom->node_affinity_cpumask,
> -                   &node_to_cpumask(node));
> -}
>
>   #define for_each_csched_balance_step(step) \
>       for ( (step) = 0; (step) <= CSCHED_BALANCE_CPU_AFFINITY; (step)++
)
> @@ -294,12 +265,12 @@ static void csched_set_node_affinity(
>
>   /*
>    * vcpu-affinity balancing is always necessary and must never be skipped.
> - * OTOH, if a domain''s node-affinity is said to be automatically
computed
> - * (or if it just spans all the nodes), we can safely avoid dealing with
> - * node-affinity entirely.
> + * OTOH, if the vcpu''s numa-affinity is being automatically
computed out of
> + * the vcpu''s vcpu-affinity, or if it just spans all the nodes,
we can
> + * safely avoid dealing with numa-affinity entirely.
>    *
> - * Node-affinity is also deemed meaningless in case it has empty
> - * intersection with mask, to cover the cases where using the
node-affinity
> + * A vcpu''s numa-affinity is also deemed meaningless in case it
has empty
> + * intersection with mask, to cover the cases where using the
numa-affinity
>    * mask seems legit, but would instead led to trying to schedule the vcpu
>    * on _no_ pcpu! Typical use cases are for mask to be equal to the
vcpu''s
>    * vcpu-affinity, or to the && of vcpu-affinity and the set of
online cpus
> @@ -308,11 +279,9 @@ static void csched_set_node_affinity(
>   static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
>                                              const cpumask_t *mask)
>   {
> -    const struct domain *d = vc->domain;
> -    const struct csched_dom *sdom = CSCHED_DOM(d);
> -
> -    if (cpumask_full(sdom->node_affinity_cpumask)
> -         || !cpumask_intersects(sdom->node_affinity_cpumask, mask) )
> +    if ( vc->auto_node_affinity == 1
> +         || cpumask_full(vc->node_affinity)
> +         || !cpumask_intersects(vc->node_affinity, mask) )
Isn''t cpumask_full(vc->node_affinity) <=>
vc->auto_node_affinity at this
point?

Other than that, I think this whole patch looks good -- nice when you 
can add a feature and simplify the code. :-)

  -George

George Dunlap

2013-Nov-05 16:56 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 11/05/2013 03:39 PM, George Dunlap wrote:> On 11/05/2013 03:23 PM, Jan Beulich wrote:
>>>>> On 05.11.13 at 16:11, George Dunlap
<george.dunlap@eu.citrix.com>
>>>>> wrote:
>>> Or, we could internally change the names to
"cpu_hard_affinity" and
>>> "cpu_soft_affinity", since that''s effectively
what the scheduler will
>>> do.  It''s possible someone might want to set soft
affinities for some
>>> other reason besides NUMA performance.
>>
>> I like that.
>
> A potential problem with that is the "auto domain numa" thing. 
In this
> patch, if the domain numa affinity is not set but vcpu numa affinity is,
> the domain numa affinity (which will be used to allocate memory for the
> domain) will be set based on the vcpu numa affinity.  That seems like a
> useful feature (though perhaps it''s starting to violate the
"policy
> should be in the tools" principle).  If we change this to just
"hard
> affinity" and "soft affinity", we''ll lose the
natural logical connection
> there.  It might have impacts on how we end up doing vNUMA as well.  So
> I''m a bit torn ATM.
>
> Dario, any thoughts?
[Coming back after going through the whole series]

This is basically the main architectural question that needs to be 
sorted out with the series: Do we bake in that the "soft affinity" is 
specifically for NUMA-ness, or not?

The patch the way it is does make this connection, and that has several 
implications:
* There is no more concept of a separate "domain numa affinity" (Patch
06); the domain numa affinity is just a pre-calculated union of the vcpu 
affinities.
* The interface to this "soft affinity" is a bitmask of numa nodes,
not
a bitmask of cpus.

If we''re OK with that direction, then I think this patch series looks 
pretty good.

Release-wise, I think as long as we''re OK with libxl providing a 
"set_vcpu_numa_affinity", then we can always come back and change the 
implementation later if we want to maintain that distinction internally.

  -George

George Dunlap

2013-Nov-05 17:16 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 11/05/2013 04:56 PM, George Dunlap wrote:> On 11/05/2013 03:39 PM, George Dunlap wrote:
>> On 11/05/2013 03:23 PM, Jan Beulich wrote:
>>>>>> On 05.11.13 at 16:11, George Dunlap
<george.dunlap@eu.citrix.com>
>>>>>> wrote:
>>>> Or, we could internally change the names to
"cpu_hard_affinity" and
>>>> "cpu_soft_affinity", since that''s
effectively what the scheduler will
>>>> do.  It''s possible someone might want to set soft
affinities for some
>>>> other reason besides NUMA performance.
>>>
>>> I like that.
>>
>> A potential problem with that is the "auto domain numa"
thing.  In this
>> patch, if the domain numa affinity is not set but vcpu numa affinity
is,
>> the domain numa affinity (which will be used to allocate memory for the
>> domain) will be set based on the vcpu numa affinity.  That seems like a
>> useful feature (though perhaps it''s starting to violate the
"policy
>> should be in the tools" principle).  If we change this to just
"hard
>> affinity" and "soft affinity", we''ll lose the
natural logical connection
>> there.  It might have impacts on how we end up doing vNUMA as well.  So
>> I''m a bit torn ATM.
>>
>> Dario, any thoughts?
>
> [Coming back after going through the whole series]
>
> This is basically the main architectural question that needs to be
> sorted out with the series: Do we bake in that the "soft
affinity" is
> specifically for NUMA-ness, or not?
>
> The patch the way it is does make this connection, and that has several
> implications:
> * There is no more concept of a separate "domain numa affinity"
(Patch
> 06); the domain numa affinity is just a pre-calculated union of the vcpu
> affinities.
> * The interface to this "soft affinity" is a bitmask of numa
nodes, not
> a bitmask of cpus.
>
> If we''re OK with that direction, then I think this patch series
looks
> pretty good.
Just to outline what the alternative would look like:  The hypervisor 
would focus on the minimum mechanisms required to do something useful 
for NUMA systems.  The domain NUMA affinity would be only used for 
memory allocation.  vcpus would only have "hard" and "soft"
affinities.
  The toolstack (libxl? xl?) would be responsible for stitching these 
together into a useable interface for NUMA: e.g., it would have the 
concept of "numa affinity" for vcpus (or indeed, virtual NUMA 
topologies), and would do things like update the domain NUMA affinity 
based on vcpu affinities.

This would mean the toolstack either assuming, when someone calls 
vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping 
its own copy of numa_affinity for each vcpu around somewhere.

Alternately, we could punt on the NUMA interface altogether for this 
patch series, and wait until we can implement a full-featured vNUMA 
interface.  That is, for this patch series, make an interface just do 
NUMA affinity for memory, and "soft" and "hard" affinities
for vcpus.
Then in another series (perhaps one shortly after that), implement a 
full vNUMA interface, with vcpus mapped to vNUMA nodes, and vNUMA nodes 
mapped to pNUMA nodes -- with the toolstack implementing all of this 
just using the soft affinities.

Thoughts?

  -George

Jan Beulich

2013-Nov-05 17:24 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 05.11.13 at 17:56, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> On 11/05/2013 03:39 PM, George Dunlap wrote:
>> A potential problem with that is the "auto domain numa"
thing.  In this
>> patch, if the domain numa affinity is not set but vcpu numa affinity
is,
>> the domain numa affinity (which will be used to allocate memory for the
>> domain) will be set based on the vcpu numa affinity.  That seems like a
>> useful feature (though perhaps it''s starting to violate the
"policy
>> should be in the tools" principle).  If we change this to just
"hard
>> affinity" and "soft affinity", we''ll lose the
natural logical connection
>> there.  It might have impacts on how we end up doing vNUMA as well.  So
>> I''m a bit torn ATM.
> 
> [Coming back after going through the whole series]
> 
> This is basically the main architectural question that needs to be 
> sorted out with the series: Do we bake in that the "soft
affinity" is
> specifically for NUMA-ness, or not?
I think that it would be beneficial if we could keep the soft
affinity as a more abstract construct than just representing
NUMAness, even if that means that domain node affinity
won''t be able to be fully "automated" anymore. But
I''m
saying that without having a specific use case in mind.

Jan

Jan Beulich

2013-Nov-05 17:30 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 05.11.13 at 18:16, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> This would mean the toolstack either assuming, when someone calls 
> vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping 
> its own copy of numa_affinity for each vcpu around somewhere.
Question is how reasonable it is for the tool stack to keep such
information around and up-to-date, in particular considering
node/CPU/memory hotplug. If that''s reasonable, then your
proposal below seems to be the right one architecturally, i.e.
keeping policy entirely at the tools level.

Jan
> Alternately, we could punt on the NUMA interface altogether for this 
> patch series, and wait until we can implement a full-featured vNUMA 
> interface.  That is, for this patch series, make an interface just do 
> NUMA affinity for memory, and "soft" and "hard"
affinities for vcpus.
> Then in another series (perhaps one shortly after that), implement a 
> full vNUMA interface, with vcpus mapped to vNUMA nodes, and vNUMA nodes 
> mapped to pNUMA nodes -- with the toolstack implementing all of this 
> just using the soft affinities.
> 
> Thoughts?
> 
>   -George

George Dunlap

2013-Nov-05 17:31 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 11/05/2013 05:24 PM, Jan Beulich wrote:>>>> On 05.11.13 at 17:56, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
>> On 11/05/2013 03:39 PM, George Dunlap wrote:
>>> A potential problem with that is the "auto domain numa"
thing.  In this
>>> patch, if the domain numa affinity is not set but vcpu numa
affinity is,
>>> the domain numa affinity (which will be used to allocate memory for
the
>>> domain) will be set based on the vcpu numa affinity.  That seems
like a
>>> useful feature (though perhaps it''s starting to violate
the "policy
>>> should be in the tools" principle).  If we change this to just
"hard
>>> affinity" and "soft affinity", we''ll lose
the natural logical connection
>>> there.  It might have impacts on how we end up doing vNUMA as well.
So
>>> I''m a bit torn ATM.
>>
>> [Coming back after going through the whole series]
>>
>> This is basically the main architectural question that needs to be
>> sorted out with the series: Do we bake in that the "soft
affinity" is
>> specifically for NUMA-ness, or not?
>
> I think that it would be beneficial if we could keep the soft
> affinity as a more abstract construct than just representing
> NUMAness, even if that means that domain node affinity
> won''t be able to be fully "automated" anymore. But
I''m
> saying that without having a specific use case in mind.
People seem to have found all kinds of fun use cases for the "hard" 
pinning we have; it''s just a handy tool to have around.  Allowing an 
arbitrary "soft" pinning just seems like a similarly handy tool. 
I''m
sure between our users and our downstreams (Suse, Oracle, XenServer, 
QubesOS, &c &c) someone will find a use for it. :-)

  -George

Dario Faggioli

2013-Nov-05 22:15 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 15:11 +0000, Jan Beulich wrote:> >>> On 05.11.13 at 16:03, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> > On 11/05/2013 02:52 PM, Jan Beulich wrote:
> >>>>> On 05.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> >>> @@ -197,6 +199,13 @@ struct vcpu
> >>>       /* Used to restore affinity across S3. */
> >>>       cpumask_var_t    cpu_affinity_saved;
> >>>
> >>> +    /*
> >>> +     * Bitmask of CPUs on which this VCPU prefers to run. For
both this
> >>> +     * and auto_node_affinity access is serialized against
> >>> +     * v->domain->node_affinity_lock.
> >>> +     */
> >>> +    cpumask_var_t    node_affinity;
> >>
> >> This all looks quite sensible, except for the naming here: We
> >> already have a node_affinity field in struct domain, having a
> >> meaning that one can expect with this name. So you break
> >> both consistency and the rule of least surprise here. How
> >> about just "preferred_cpus"?
> > 
> > Actually, would it make more sense to remove node_affinity from the 
> > domain struct, and have the tools manually set the node_affinity for
the
> > various vcpus if the user attempts to set the "domain numa
affinity"?
> 
> Quite likely, yes. But that doesn''t mean that the name can be kept
> as is.
> Not really. Well, it sure is possible, but it''s a bit more involved
than
it sounds, as the node_affinity field in struct domain drives all the
actual memory allocations, and it''s maintained in a nodemask_t as
that''s
easier to use in that circumstance.

Getting rid of it would mean going through the node_affinity of all
vcpus and build such node mask all the times (or something like that),
which I think is both more code and less performance! :-P
(BTW, seeing the rest of the e-mails, I think you ended up realizing
that, but still...)

So, as I''ll say replying to the other messages, I''m fine with
changing
names, much less with removing d->node_affinity.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-05 22:22 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 15:11 +0000, George Dunlap wrote:> On 11/05/2013 03:03 PM, George Dunlap wrote:
> > On 11/05/2013 02:52 PM, Jan Beulich wrote:
> >>>>> On 05.11.13 at 15:35, Dario Faggioli
<dario.faggioli@citrix.com>
> >>>>> wrote:
> >>> @@ -197,6 +199,13 @@ struct vcpu
> >>>       /* Used to restore affinity across S3. */
> >>>       cpumask_var_t    cpu_affinity_saved;
> >>>
> >>> +    /*
> >>> +     * Bitmask of CPUs on which this VCPU prefers to run. For
both this
> >>> +     * and auto_node_affinity access is serialized against
> >>> +     * v->domain->node_affinity_lock.
> >>> +     */
> >>> +    cpumask_var_t    node_affinity;
> >>
> >> This all looks quite sensible, except for the naming here: We
> >> already have a node_affinity field in struct domain, having a
> >> meaning that one can expect with this name. So you break
> >> both consistency and the rule of least surprise here. How
> >> about just "preferred_cpus"?
> >
> > Actually, would it make more sense to remove node_affinity from the
> > domain struct, and have the tools manually set the node_affinity for
the
> > various vcpus if the user attempts to set the "domain numa
affinity"?
> 
> Sorry, speaking before I had thought it through.  Of course we need a 
> numa affinity for the domain for allocating memory.
> Ah! And I''m also sorry I missed this one before replying! :-P
> How about, "cpu_node_affinity"?
> 
> Or, we could internally change the names to "cpu_hard_affinity"
and
> "cpu_soft_affinity", since that''s effectively what the
scheduler will
> do.  It''s possible someone might want to set soft affinities for
some
> other reason besides NUMA performance.
> That sounds reasonable to me, as far as we are aware that the upper
layer would continue to call "hard affinity" as either "vcpu
affinity" (domctl_, libxc and libxl) or "vcpu pinning" (xl,
command name
is ''vcpu-pin''). For the "soft affinity", I think the
term "node
affinity" or "NUMA node affinity" already spread a little bit
around the
codebase, so even if there is no real libxc or libxl interface for that
before this patch, we may have to live with that too.

Is this acceptable?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-05 22:54 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 15:39 +0000, George Dunlap wrote:> On 11/05/2013 03:23 PM, Jan Beulich wrote:
> >>>> On 05.11.13 at 16:11, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> >> Or, we could internally change the names to
"cpu_hard_affinity" and
> >> "cpu_soft_affinity", since that''s effectively
what the scheduler will
> >> do.  It''s possible someone might want to set soft
affinities for some
> >> other reason besides NUMA performance.
> >
> > I like that.
> I like that too.
> A potential problem with that is the "auto domain numa" thing. 
In this
> patch, if the domain numa affinity is not set but vcpu numa affinity is, 
> the domain numa affinity (which will be used to allocate memory for the 
> domain) will be set based on the vcpu numa affinity.
>That would indeed be an issue, but, hopefully, even if we want to go for
the soft!=NUMA way, that would mean that, at the toolstack level, two
calls instead than one are required to properly place a guest on
one/some NUMA node: one for vcpus (setting the soft affinity) and one
for memory (setting the node affinity, which would then affect mem
only).

Really, let''s see if I succeed in coding it up tomorrow morning, but I
don''t expect too serious issues.

The only thing that puzzles me is what the ''default behaviour''
should
be. I mean, if in a VM config file I find the following:

 cpus = "2,9"

And nothing else, related to hard, soft or memory affinity, what do I
do? I''d say, even if only for consistency with what we''re
doing right
now (even without this patch), we should go all the way through both the
calls above, and set both hard affinity and memory node affinity. If
not, people doing this in 4.4 (or 4.5, if we miss the train) would
experience different behaviour wrt current one, which would be bad!
Thoughts?

Similar question, provided I implement something like
libxl_set_vcpu_hardaffinity(), libxl_set_vcpu_softaffinity() and
libxl_set_mem_nodeaffinity(), what should a call to the existent
libxl_set_vcpuaffinity() do? As above, right now it affects both
vcpu-pinning and node-affinity, so I''d say it should call both
libxl_set_vcpu_hardaffinity() and libxl_set_mem_nodeaffinity(). Is that
ok?

BTW, does the interface sketched above look reasonable? Someone else up
for suggesting more fancy names? :-)
> That seems like a 
> useful feature (though perhaps it''s starting to violate the
"policy
> should be in the tools" principle).  
>The current situation being that d->node_affinity is set based on
vcpu-pinning, having it retain the same meaning and behavior, apart from
being set based on vcpu node-affinity looked like the natural extension
of it to me. However...
> If we change this to just "hard 
> affinity" and "soft affinity", we''ll lose the
natural logical connection
> there.
>...I think I see what you mean and I agree that it wasn''t that much of
a
violation if we stick to soft==NUMA (as implemented in the series). If
we go for your proposal (which, again, I like), it''d indeed become hard
to decide whether node affinity should derive from hard or soft
affinity. :-)
> It might have impacts on how we end up doing vNUMA as well.  So 
> I''m a bit torn ATM.
> 
> Dario, any thoughts?
> As said above, I like the idea of separating soft affinities and
NUMA-ness... I''ll give it a try and either submit an updated series
implementing such behaviour, or follow-up to this conversation with any
issue I find while doing so. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-05 23:01 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 17:16 +0000, George Dunlap wrote:> Just to outline what the alternative would look like:  The hypervisor 
> would focus on the minimum mechanisms required to do something useful 
> for NUMA systems.  The domain NUMA affinity would be only used for 
> memory allocation.  vcpus would only have "hard" and
"soft" affinities.
>   The toolstack (libxl? xl?) would be responsible for stitching these 
> together into a useable interface for NUMA: e.g., it would have the 
> concept of "numa affinity" for vcpus (or indeed, virtual NUMA 
> topologies), and would do things like update the domain NUMA affinity 
> based on vcpu affinities.
> Yep, I confirm that I like it, and that I don''t think this should be
too
difficult to implement. The hardest thing is going to be picking up
sensible names for libxl functions and xl confing options (and no
simpley  here, because I''m not joking!)
> This would mean the toolstack either assuming, when someone calls 
> vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping 
> its own copy of numa_affinity for each vcpu around somewhere.
> Right.
> Alternately, we could punt on the NUMA interface altogether for this 
> patch series, and wait until we can implement a full-featured vNUMA 
> interface.  
>Well, yes, and let''s fall back to that if implementing the above
reveals
too difficult or too long, ok?
> That is, for this patch series, make an interface just do 
> NUMA affinity for memory, and "soft" and "hard"
affinities for vcpus.
> Then in another series (perhaps one shortly after that), implement a 
> full vNUMA interface, with vcpus mapped to vNUMA nodes, and vNUMA nodes 
> mapped to pNUMA nodes -- with the toolstack implementing all of this 
> just using the soft affinities.
> I agree with everything. About the interactions the evolution of this
series will have with vNUMA, I already talked with Elena in Edinburgh,
and I''ll be taking care of those.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-05 23:08 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 16:56 +0000, George Dunlap wrote:> [Coming back after going through the whole series]
> 
> This is basically the main architectural question that needs to be 
> sorted out with the series: Do we bake in that the "soft
affinity" is
> specifically for NUMA-ness, or not?
> Damn... I was really hoping to be almost done with this series!
Unfortunately, this point you''re raising is a very good one and, even
more unfortunately, I like the way in which you propose to solve it, so
I''ve got to re-implement this thing, or --I''m sure of that-- I
won''t be
able to get any sleep at night for months! :-P :-P :-P
> The patch the way it is does make this connection, and that has several 
> implications:
> * There is no more concept of a separate "domain numa affinity"
(Patch
> 06); the domain numa affinity is just a pre-calculated union of the vcpu 
> affinities.
> * The interface to this "soft affinity" is a bitmask of numa
nodes, not
> a bitmask of cpus.
> 
> If we''re OK with that direction, then I think this patch series
looks
> pretty good.
> We''re not... As said, now that you put it this way, it really sound so
much wrong, that I have to fix it! :-)
> Release-wise, I think as long as we''re OK with libxl providing a 
> "set_vcpu_numa_affinity", then we can always come back and change
the
> implementation later if we want to maintain that distinction internally.
> That''s true too. Let''s see how far and how quick can I get to
implementing what you propose and then decide. BTW, if I can ask (and
sorry if it''s my fault for having missed it), what''s the
updated release
schedule?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-05 23:12 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 17:30 +0000, Jan Beulich wrote:> >>> On 05.11.13 at 18:16, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> > This would mean the toolstack either assuming, when someone calls 
> > vcpu_set_node_affinity, that soft_affinity == numa_affinity, or
keeping
> > its own copy of numa_affinity for each vcpu around somewhere.
> 
> Question is how reasonable it is for the tool stack to keep such
> information around and up-to-date, in particular considering
> node/CPU/memory hotplug.
>Right. Well, I may be wrong, and I won''t know until I''ll
actually try
coding it (first thing in the morning), but I honestly don''t think
there
will be much to keep around at the toolstack level.
>  If that''s reasonable, then your
> proposal below seems to be the right one architecturally, i.e.
> keeping policy entirely at the tools level.
> Agreed.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-06 08:48 UTC

head link

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

On mar, 2013-11-05 at 14:50 +0000, George Dunlap wrote:> On 11/05/2013 02:34 PM, Dario Faggioli wrote:
> > Making it possible to use something like the following:
> >   * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
> >   * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
> >   * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2
> >     but not pCPU 6;
> >   * ...
> >
> > In both domain config file and `xl vcpu-pin''.
> >
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> Overall looks like a pretty clean patch; just a few comments.
> Ok.
> > diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> >
> > -To allow all the vcpus of the guest to run on cpus 0,2,3,5.
> > +To allow all the vcpus of the guest to run on cpus 0,2,3,5. Combining
> > +this with "all" is possible, meaning "all,^7"
results in all the vcpus
> > +of the guest running on all the cpus on the host except cpu 7.
> > +
> > +=item "nodes:0-3,node:^2"
> 
> Here you use both "nodes" and "node", while the code
seems to only check
> for "nodes".  
>Right... This was a leftover from a previous version where I was
actually checking for both, as well as trying to enforce singular/plural
consistency (which, I agree with you) is overkill.
> I was originally going to say we should just check one; 
> but on the other hand, it''s just an extra string compare -- I feel
like
> we might as well accept either "node" or "nodes".  (No
need to enforce
> plurality: "nodes:2" and "node:1-3" should both be fine
with me.)
> Right, I''ll add the support for both then.
> > +static int update_cpumap_range(const char *str, libxl_bitmap *cpumap)
> >   {
> > -    libxl_bitmap exclude_cpumap;
> > -    uint32_t cpuida, cpuidb;
> > -    char *endptr, *toka, *tokb, *saveptr = NULL;
> > -    int i, rc = 0, rmcpu;
> > +    unsigned long ida, idb;
> > +    libxl_bitmap node_cpumap;
> > +    bool is_not = false, is_nodes = false;
> > +    int rc = 0;
> > +
> > +    libxl_bitmap_init(&node_cpumap);
> >
> > -    if (!strcmp(cpu, "all")) {
> > +    rc = libxl_node_bitmap_alloc(ctx, &node_cpumap, 0);
> > +    if (rc) {
> > +        fprintf(stderr, "libxl_node_bitmap_alloc
failed.\n");
> > +        goto out;
> > +    }
> > +
> > +    /* Are we adding or removing cpus/nodes? */
> > +    if (STR_SKIP_PREFIX(str, "^")) {
> > +        is_not = true;
> > +    }
> > +
> > +    /* Are we dealing with cpus or full nodes? */
> > +    if (STR_SKIP_PREFIX(str, "nodes:")) {
> > +        is_nodes = true;
> > +    }
> > +
> > +    if (STR_HAS_PREFIX(str, "all")) {
> 
> Is there any reason not to keep this "strcmp"?  As it is, this
will
> accept any string that *starts* with "all", which isn''t
exactly what you
> want, I don''t think.
> Ah, true... I''ll revert this to strcmp().

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-06 09:15 UTC

head link

Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling

On mar, 2013-11-05 at 16:20 +0000, George Dunlap wrote:> On 11/05/2013 02:35 PM, Dario Faggioli wrote:
> > @@ -308,11 +279,9 @@ static void csched_set_node_affinity(
> >   static inline int __vcpu_has_node_affinity(const struct vcpu *vc,
> >                                              const cpumask_t *mask)
> >   {
> > -    const struct domain *d = vc->domain;
> > -    const struct csched_dom *sdom = CSCHED_DOM(d);
> > -
> > -    if (cpumask_full(sdom->node_affinity_cpumask)
> > -         || !cpumask_intersects(sdom->node_affinity_cpumask, mask)
)
> > +    if ( vc->auto_node_affinity == 1
> > +         || cpumask_full(vc->node_affinity)
> > +         || !cpumask_intersects(vc->node_affinity, mask) )
> 
> Isn''t cpumask_full(vc->node_affinity) <=>
vc->auto_node_affinity at this
> point?
> It is. I''d be killing the cpumask_full() part, if it wasn''t
for the fact
that we decided to re-architect all the thing, and I''m doing the latter
right now.
> Other than that, I think this whole patch looks good -- nice when you 
> can add a feature and simplify the code. :-)
> Nice indeed! :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-06 09:39 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 17:16 +0000, George Dunlap wrote:> Just to outline what the alternative would look like:  The hypervisor 
> would focus on the minimum mechanisms required to do something useful 
> for NUMA systems.  The domain NUMA affinity would be only used for 
> memory allocation.  vcpus would only have "hard" and
"soft" affinities.
>   The toolstack (libxl? xl?) would be responsible for stitching these 
> together into a useable interface for NUMA: e.g., it would have the 
> concept of "numa affinity" for vcpus (or indeed, virtual NUMA 
> topologies), and would do things like update the domain NUMA affinity 
> based on vcpu affinities.
> 
> This would mean the toolstack either assuming, when someone calls 
> vcpu_set_node_affinity, that soft_affinity == numa_affinity, or keeping 
> its own copy of numa_affinity for each vcpu around somewhere.
> And to elaborate a bit more what I said yesterday night, now that I have
the code in front of me, going for the above would actually mean the
following.

In domain.c we have domain_update_node_affinity(). What it does *before*
this series is calculating d->node_affinity basing on all the vcpu''s
cpu_affinity (i.e., pinning). What it does *after* this series is
calculating d->node_affinity besing on _vcpu''s_ node_affinity. (*)

Such function is currently called, basically, when a new vcpu is
allocated (alloc_vcpu()), when a domain changes cpupool
(sched_move_domain()), when the cpupool the domain is in changes
(cpupool_assign_cpu_locked() or cpupool_unassign_cpu(). That means that
all the above operations _automatically_ affect d->node_affinity.

Now, we''re talking about killing vc->cpu_affinity and not
introducing
vc->node_affinity and, instead, introduce vc->cpu_hard_affinity and
vc->cpu_soft_affinity and, more important, not to link any of the above
to d->node_affinity. That means all the above operations _will_NOT_
automatically affect d->node_affinity any longer, at least from the
hypervisor (and, most likely, libxc) perspective. OTOH, I''m almost sure
that I can force libxl (and xl) to retain the exact same behaviour it is
exposing to the user (just by adding an extra call when needed).

So, although all this won''t be an issue for xl and libxl consumers (or,
at least, that''s my goal), it will change how the hypervisor used to
behave in all those situations. This means that xl and libxl users will
see no change, while folks issuing hypercalls and/or libxc calls will.

Is that ok? I mean, I know there are no stability concerns for those
APIs, but still, is that an acceptable change?

Regards,
Dario

(*) yes, in both cases (before and after this series), it is possible
already that d->node_affinity is not automatically calculated, but that
it just stick to something the toolstack provided. That will stay, so
it''s pretty much irrelevant to this discussion... Actually, it
won''t
just "stay", it will become the sole and only case!

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Nov-06 09:46 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 06.11.13 at 10:39, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Now, we''re talking about killing vc->cpu_affinity and not
introducing
> vc->node_affinity and, instead, introduce vc->cpu_hard_affinity and
> vc->cpu_soft_affinity and, more important, not to link any of the above
> to d->node_affinity. That means all the above operations _will_NOT_
> automatically affect d->node_affinity any longer, at least from the
> hypervisor (and, most likely, libxc) perspective. OTOH, I''m almost
sure
> that I can force libxl (and xl) to retain the exact same behaviour it is
> exposing to the user (just by adding an extra call when needed).
> 
> So, although all this won''t be an issue for xl and libxl consumers
(or,
> at least, that''s my goal), it will change how the hypervisor used
to
> behave in all those situations. This means that xl and libxl users will
> see no change, while folks issuing hypercalls and/or libxc calls will.
> 
> Is that ok? I mean, I know there are no stability concerns for those
> APIs, but still, is that an acceptable change?
I would think that as long as d->node_affinity is empty, it should
still be set based on all vCPU-s'' hard affinities (as it is an error -
possibly to be interpret as "do this for me" to try to set an empty
node affinity there''s no conflict here). Or alternatively a flag
could be set once it got set, preventing further implicit updates.

Jan

Dario Faggioli

2013-Nov-06 10:00 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mer, 2013-11-06 at 09:46 +0000, Jan Beulich wrote:> >>> On 06.11.13 at 10:39, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> > Now, we''re talking about killing vc->cpu_affinity and not
introducing
> > vc->node_affinity and, instead, introduce vc->cpu_hard_affinity
and
> > vc->cpu_soft_affinity and, more important, not to link any of the
above
> > to d->node_affinity. That means all the above operations _will_NOT_
> > automatically affect d->node_affinity any longer, at least from the
> > hypervisor (and, most likely, libxc) perspective. OTOH, I''m
almost sure
> > that I can force libxl (and xl) to retain the exact same behaviour it
is
> > exposing to the user (just by adding an extra call when needed).
> > 
> > So, although all this won''t be an issue for xl and libxl
consumers (or,
> > at least, that''s my goal), it will change how the hypervisor
used to
> > behave in all those situations. This means that xl and libxl users
will
> > see no change, while folks issuing hypercalls and/or libxc calls will.
> > 
> > Is that ok? I mean, I know there are no stability concerns for those
> > APIs, but still, is that an acceptable change?
> 
> I would think that as long as d->node_affinity is empty, it should
> still be set based on all vCPU-s'' hard affinities 
>I see, and that sounds sensible to me... It''s mostly a matter a matter
of deciding whether o not we want something like that, and, if yes,
whether we want it based on hard of soft.

Personally, I think I agree with you on having it based on hard
affinities by default.

Let''s see if George get to say something before I get to that part of
the (re)implementation. :-)
> (as it is an error -
> possibly to be interpret as "do this for me" to try to set an
empty
> node affinity there''s no conflict here). Or alternatively a flag
> could be set once it got set, preventing further implicit updates.
> Sure. It''s actually quite similar to that already. Both in the tree and
in the series, affinity to none is just error, affinity to all means "do
it for me", and I do have the flag there. I can easily change this into
what you''re suggesting (i.e., make none => "do it for
me").

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-06 11:41 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mar, 2013-11-05 at 15:11 +0000, George Dunlap wrote:
> How about, "cpu_node_affinity"?
> 
> Or, we could internally change the names to "cpu_hard_affinity"
and
> "cpu_soft_affinity", since that''s effectively what the
scheduler will
> do.
>And about naming, what about introducing ''cpu_soft_affinity''
and
sticking to just ''cpu_affinity'' for the hard one?

That way we could leave XEN_DOMCTL_{set,get}vcpuaffinity alone an
introduce XEN_DOMCTL_{set,get}vcpusoftaffinity. It''s not super pretty,
but it looks better than anything else I can think of.

In fact, I can of course rename ''cpu_affinity'' to
''cpu_hard_affinity''
too, but that would mean having the following:

case XEN_DOMCTL_setvcpuaffinity:
case XEN_DOMCTL_getvcpuaffinity:
    ...
    if ( op->cmd == XEN_DOMCTL_setvcpuaffinity )
        xenctl_bitmap_to_cpumask(&new_affinity,
&op->u.vcpuaffinity.cpumap);
        vcpu_set_hard_affinity(v, new_affinity);
    else
        cpumask_to_xenctl_bitmap(&op->u.vcpuaffinity.cpumap,
                                 v->cpu_hard_affinity);
    ...
case XEN_DOMCTL_setvcpusoftaffinity:
case XEN_DOMCTL_getvcpusoftaffinity:
    ...
    if ( op->cmd == XEN_DOMCTL_setvcpusoftaffinity )
        xenctl_bitmap_to_cpumask(&new_affinity,
&op->u.vcpuaffinity.cpumap);
        vcpu_set_soft_affinity(v, new_affinity);
    else
        cpumask_to_xenctl_bitmap(&op->u.vcpuaffinity.cpumap,
                                 v->cpu_soft_affinity);
    ...

I.e., *vcpusoftaffinity manipulates cpu_soft_affinity while
*vcpuaffinity manipulates cpu_hard_affinity, which looks asymmetrical,
in addition to not being pretty. And, of course, the same will also
happen at the higher (libxc and libxl) levels.

There''s of course the option of renaming XEN_DOMCTL_setvcpuaffinity to
XEN_DOMCTL_setvcpuhardaffinity (and, while there, put some ''_''
in it!
)... But is that really an option? And even if it is an option for the
hypervisor, and perhaps, libxc, I''m quite sure it''s not for
libxl... Am
I wrong?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Nov-06 11:44 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 06/11/13 10:00, Dario Faggioli wrote:> On mer, 2013-11-06 at 09:46 +0000, Jan Beulich wrote:
>>>>> On 06.11.13 at 10:39, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>>> Now, we''re talking about killing vc->cpu_affinity and
not introducing
>>> vc->node_affinity and, instead, introduce
vc->cpu_hard_affinity and
>>> vc->cpu_soft_affinity and, more important, not to link any of
the above
>>> to d->node_affinity. That means all the above operations
_will_NOT_
>>> automatically affect d->node_affinity any longer, at least from
the
>>> hypervisor (and, most likely, libxc) perspective. OTOH,
I''m almost sure
>>> that I can force libxl (and xl) to retain the exact same behaviour
it is
>>> exposing to the user (just by adding an extra call when needed).
>>>
>>> So, although all this won''t be an issue for xl and libxl
consumers (or,
>>> at least, that''s my goal), it will change how the
hypervisor used to
>>> behave in all those situations. This means that xl and libxl users
will
>>> see no change, while folks issuing hypercalls and/or libxc calls
will.
>>>
>>> Is that ok? I mean, I know there are no stability concerns for
those
>>> APIs, but still, is that an acceptable change?
>> I would think that as long as d->node_affinity is empty, it should
>> still be set based on all vCPU-s'' hard affinities
>>
> I see, and that sounds sensible to me... It''s mostly a matter a
matter
> of deciding whether o not we want something like that, and, if yes,
> whether we want it based on hard of soft.
>
> Personally, I think I agree with you on having it based on hard
> affinities by default.
>
> Let''s see if George get to say something before I get to that part
of
> the (re)implementation. :-)
I would probably have it based on soft affinities, since that''s where
we
expect to have the domain''s vcpus actually running most if the time;
but
it''s really a bike-shed issue, and something we can change / adjust in 
the future.

(Although I suppose ideal behavior would be for the allocator to have 
three levels of preference instead of just two: allocate from soft 
affinity first; if that''s not available, allocate from hard affinity; 
and finally allocate wherever you can find ram.  But that''s probably 
more work than it''s worth at this point.)

So what''s the plan now Dario?  You''re going to re-spin the
patch series
to just do hard and soft affinities at the HV level, plumbing the 
results through the toolstack?

I think for now I might advise putting off doing a NUMA interface at the 
libxl level, and do a full vNUMA interface in another series (perhaps 
for 4.5, depending on the timing).

  -George

Dario Faggioli

2013-Nov-06 14:26 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote:> On 06/11/13 10:00, Dario Faggioli wrote:
> > I see, and that sounds sensible to me... It''s mostly a matter
a matter
> > of deciding whether o not we want something like that, and, if yes,
> > whether we want it based on hard of soft.
> >
> > Personally, I think I agree with you on having it based on hard
> > affinities by default.
> >
> > Let''s see if George get to say something before I get to that
part of
> > the (re)implementation. :-)
> 
> I would probably have it based on soft affinities, since that''s
where we
> expect to have the domain''s vcpus actually running most if the
time;
>True. However, doing that would rule out cpupool and vcpu-pinning. That
is to say, if you create a domain in a pool or with its vcpu pinned (by
specifying "pool=" or "cpus=" in the config file),
it''ll get its memory
striped over all the nodes. In fact, in both these cases, there really
is no soft affinity involved. This is bad, because people may be already
used to create a domain with "cpus=", and have the memory allocated
from
the relevant nodes.

OTOH, there is no similar thing (i.e., a user interface/API) for soft
affinities, and the way I was using it at the xl and libxl level for the
sake of NUMA performance, I can easily implement there on top of soft
affinities.

So, in summary, I''d have liked to have it based on soft affinity too,
but I think that would break more things than having it based on hard
ones.
> but 
> it''s really a bike-shed issue, and something we can change /
adjust in
> the future.
> That is also true, indeed.
> (Although I suppose ideal behavior would be for the allocator to have 
> three levels of preference instead of just two: allocate from soft 
> affinity first; if that''s not available, allocate from hard
affinity;
> and finally allocate wherever you can find ram.  But that''s
probably
> more work than it''s worth at this point.)
> I like this too, but that''s definitely something longer term than this
or next week.
> So what''s the plan now Dario?  You''re going to re-spin
the patch series
> to just do hard and soft affinities at the HV level, plumbing the 
> results through the toolstack?
> Yes, that was what I had in mind, and what I already started working on
(see the other mail I sent before lunch about (re)naming). I should be
able to craft and send something by ether today or tomorrow.
> I think for now I might advise putting off doing a NUMA interface at the 
> libxl level, and do a full vNUMA interface in another series (perhaps 
> for 4.5, depending on the timing).
> Well, I agree that all this is of very little use without vNUMA, but at
the same time, it''s not necessarily only useful for it. Also, whatever
it is vcpu-node-affinity or soft-affinity, if it is not wired up
properly up to the higher layers, there''s very few point of having the
HV part only... So my idea was to redo and resend everything, including
libxl and xl bits.

Of course, that doesn''t mean we must necessarily have this for 4.4
(although I think it would still be feasible), just that we either
check-in or wait for both the implementation and the interface. Again,
how''s the updated release schedule?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2013-Nov-06 14:47 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 06/11/13 11:41, Dario Faggioli wrote:> There''s of course the option of renaming
XEN_DOMCTL_setvcpuaffinity to
> XEN_DOMCTL_setvcpuhardaffinity (and, while there, put some
''_'' in it!
> )... But is that really an option? And even if it is an option for the
> hypervisor, and perhaps, libxc, I''m quite sure it''s not
for libxl... Am
> I wrong?
I think vcpuaffinity is a domctl, which means it''s a interface to
libxc,
and thus malleable.  So I *think* we could actually just add a parameter 
to vcpuaffinity to say "hard" or "soft".  (Correct me if
I''m wrong, Jan.)

Where we need to consider backwards compatibility is in the libxl 
interface.  I''m not sure whether the best thing to do there.  I think
we
can''t break the existing API, so we need to keep 
libxl_set_vcpuaffinity() available for old callers.  Options include:
* Just add libxl_set_vcpuaffinity_soft(), and leave the first one as it is
* Add libxl_set_vcpu_affinity_[something]() which takes a soft/hard 
flag, and deprecate the existing one
* Redefine libxl_set_vcpuaffinity() to include a soft/hard flag, bumping 
the API version and having a work-around for old callers.

IanJ / IanC, any opinions?

  -George

George Dunlap

2013-Nov-06 14:56 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 06/11/13 14:26, Dario Faggioli wrote:> On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote:
>> On 06/11/13 10:00, Dario Faggioli wrote:
>>> I see, and that sounds sensible to me... It''s mostly a
matter a matter
>>> of deciding whether o not we want something like that, and, if yes,
>>> whether we want it based on hard of soft.
>>>
>>> Personally, I think I agree with you on having it based on hard
>>> affinities by default.
>>>
>>> Let''s see if George get to say something before I get to
that part of
>>> the (re)implementation. :-)
>> I would probably have it based on soft affinities, since
that''s where we
>> expect to have the domain''s vcpus actually running most if the
time;
>>
> True. However, doing that would rule out cpupool and vcpu-pinning.
I guess I was assuming that a vcpu''s soft affinity would always be 
considered a subset of its hard affinity and the cpus in its cpupool.
>> I think for now I might advise putting off doing a NUMA interface at
the
>> libxl level, and do a full vNUMA interface in another series (perhaps
>> for 4.5, depending on the timing).
>>
> Well, I agree that all this is of very little use without vNUMA, but at
> the same time, it''s not necessarily only useful for it. Also,
whatever
> it is vcpu-node-affinity or soft-affinity, if it is not wired up
> properly up to the higher layers, there''s very few point of having
the
> HV part only... So my idea was to redo and resend everything, including
> libxl and xl bits.
>
> Of course, that doesn''t mean we must necessarily have this for 4.4
> (although I think it would still be feasible), just that we either
> check-in or wait for both the implementation and the interface. Again,
> how''s the updated release schedule?
Well we definitely should plumb through access to the soft affinity 
directly through libxl.  The question is whether it would be useful to 
have a libxl per-vcpu *numa* affinity that is not yet a full vNUMA 
implementation.  If long-term we want to have vcpu -> vnode, and then 
vnode->pnode, the having a direct vcpu->pnode interface in the interim 
is probably not particularly useful, I don''t think.

Re the release schedule: we agreed to move the code freeze to November 
18, so we''ve got just under two weeks.  I think keeping the libxl side 
simple will allow us to at least get the "soft affinity" stuff in for 
4.4, whether the vNUMA stuff makes it or not.

  -George

Jan Beulich

2013-Nov-06 15:14 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 06.11.13 at 15:56, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> On 06/11/13 14:26, Dario Faggioli wrote:
>> On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote:
>>> On 06/11/13 10:00, Dario Faggioli wrote:
>>>> I see, and that sounds sensible to me... It''s mostly a
matter a matter
>>>> of deciding whether o not we want something like that, and, if
yes,
>>>> whether we want it based on hard of soft.
>>>>
>>>> Personally, I think I agree with you on having it based on hard
>>>> affinities by default.
>>>>
>>>> Let''s see if George get to say something before I get
to that part of
>>>> the (re)implementation. :-)
>>> I would probably have it based on soft affinities, since
that''s where we
>>> expect to have the domain''s vcpus actually running most if
the time;
>>>
>> True. However, doing that would rule out cpupool and vcpu-pinning.
> 
> I guess I was assuming that a vcpu''s soft affinity would always be
> considered a subset of its hard affinity and the cpus in its cpupool.
For CPU pools I agree, but didn''t we mean hard affinity to control
memory allocation, and soft affinity scheduling decisions (in which
case there''s no strict ordering between the two)? If not, I guess
I''d
need a brief but clear definition what "soft" and "hard" are
supposed
to represent...

Jan

George Dunlap

2013-Nov-06 16:12 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On 06/11/13 15:14, Jan Beulich wrote:>>>> On 06.11.13 at 15:56, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
>> On 06/11/13 14:26, Dario Faggioli wrote:
>>> On mer, 2013-11-06 at 11:44 +0000, George Dunlap wrote:
>>>> On 06/11/13 10:00, Dario Faggioli wrote:
>>>>> I see, and that sounds sensible to me... It''s
mostly a matter a matter
>>>>> of deciding whether o not we want something like that, and,
if yes,
>>>>> whether we want it based on hard of soft.
>>>>>
>>>>> Personally, I think I agree with you on having it based on
hard
>>>>> affinities by default.
>>>>>
>>>>> Let''s see if George get to say something before I
get to that part of
>>>>> the (re)implementation. :-)
>>>> I would probably have it based on soft affinities, since
that''s where we
>>>> expect to have the domain''s vcpus actually running
most if the time;
>>>>
>>> True. However, doing that would rule out cpupool and vcpu-pinning.
>> I guess I was assuming that a vcpu''s soft affinity would
always be
>> considered a subset of its hard affinity and the cpus in its cpupool.
> For CPU pools I agree, but didn''t we mean hard affinity to control
> memory allocation, and soft affinity scheduling decisions (in which
> case there''s no strict ordering between the two)? If not, I guess
I''d
> need a brief but clear definition what "soft" and
"hard" are supposed
> to represent...
We have two "affinities" at the moment:
* per-vcpu cpu affinity.  This is this is a scheduling construct, and is 
used to restrict vcpus to run on specific pcpus.  This is what we would 
call hard affinity -- "hard" because vcpus are *not allowed* to run on
cpus not in this mask.
* Domain NUMA affinity.  As of 4.3 this has two functions: both a memory 
allocation function, and a scheduling function.  For the scheduling 
function, it acts as a "soft affinity", but it''s domain-wide,
rather
than being per-vcpu.  It''s "soft" because the scheduler will
*try* to
run it on pcpus in that mask, but if it cannot, it *will allow* them to 
run on pcpus not in that mask.

At the moment you can set this manually, or if it''s not set, then Xen 
will set it automatically based on the vcpu hard affinities (and I think 
the cpupool that it''s in).

What we''re proposing is to remove the domain-wide soft affinity and 
replace it with a per-vcpu soft affinity.  That way each vcpu has some 
pcpus that it prefers (in the soft affinity *and* the hard affinity), 
some that it doesn''t prefer but is OK with (hard affinity but not soft 
affinity), and some that it cannot run on at all (not in the hard affinity).

The question Dario has is this: given that we now have per-vcpu hard and 
soft scheduling affinity, how should we automatically construct the 
per-domain memory allocation affinity, if at all?  Should we construct 
it from the "hard" scheduling affinities, or from the "soft"
scheduling
affinities?

I said that I thought we should use the soft affinity; but I really 
meant the "effective soft affinity" -- i.e., the union of soft, hard, 
and cpupools.

  -George

Dario Faggioli

2013-Nov-06 16:20 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mer, 2013-11-06 at 15:14 +0000, Jan Beulich wrote:> >>> On 06.11.13 at 15:56, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> > I guess I was assuming that a vcpu''s soft affinity would
always be
> > considered a subset of its hard affinity and the cpus in its cpupool.
> Well, we''re doing all this talking with the specific goal or
disentangle
soft from har from memory affinities, aren''t we? However, yes, for both
logical and implementation reasons, a domain that lives in a pool should
not have either soft or hard affinity to any pcpu which is outside of
the pool.

For hard affinities, not so much. Meaning that is it possible (it was
possible in this series, and I think it should remain possible in its
new incarnation) to have completely disjoint sets of pcpus as hard an
soft affinities, at least at the hypervisor level.

Then, of course, high in the toolstack we can come up with proper
policing, but that''s a different issue. Or are you (George) saying that
if domain A has soft affinity to pcpus 0-15, setting its hard affinity
to 0-7 should affect the soft atinity (an constraint it within that
range) too?

I''m almost sure we had a similar discussion in another thread, and it
was you tha argued that we should not introduce these kind of
dependencies, when the user changes parameter X and this silently result
in Y changing too...
> For CPU pools I agree, but didn''t we mean hard affinity to control
> memory allocation, and soft affinity scheduling decisions (in which
> case there''s no strict ordering between the two)? If not, I guess
I''d
> need a brief but clear definition what "soft" and
"hard" are supposed
> to represent...
> Well, let me try. Both hard and soft affinity is all about scheduling,
with the following meaning:
 - hard: the *allowed* pcpus
 - soft: the *preferred* pcpus
If there are no free *preferred* pcpus, as well as if the soft affinity
mask is empty or has empty intersection with the hard one, a pcpu from
the *allowed* one is picked up by the scheduler.

Memory allocation is controlled by (let''s call i like that) node
affinity. There will be specific (hyper)calls to set it explicitly, but
the point here is, if none of those is invoked, how should it look like?

My opinion is that, doing as you said, and have it be based on hard
affinities makes things easier and quicker to implement, and guarantees
the most consistent behavior wrt the current one. Thus, although, I know
we don''t care much about backward compatibility at this level,
I''d go
for it.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Nov-06 16:22 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

>>> On 06.11.13 at 17:12, George Dunlap
<george.dunlap@eu.citrix.com> wrote:
> The question Dario has is this: given that we now have per-vcpu hard and 
> soft scheduling affinity, how should we automatically construct the 
> per-domain memory allocation affinity, if at all?  Should we construct 
> it from the "hard" scheduling affinities, or from the
"soft" scheduling
> affinities?
> 
> I said that I thought we should use the soft affinity; but I really 
> meant the "effective soft affinity" -- i.e., the union of soft,
hard,
> and cpupools.
Actually I think trying on the most narrow set first (soft) and then
widening (hard, anywhere) is indeed the most sensible approach
then.

Jan

Dario Faggioli

2013-Nov-06 16:23 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mer, 2013-11-06 at 14:56 +0000, George Dunlap wrote:> On 06/11/13 14:26, Dario Faggioli wrote:
> > Well, I agree that all this is of very little use without vNUMA, but
at
> > the same time, it''s not necessarily only useful for it. Also,
whatever
> > it is vcpu-node-affinity or soft-affinity, if it is not wired up
> > properly up to the higher layers, there''s very few point of
having the
> > HV part only... So my idea was to redo and resend everything,
including
> > libxl and xl bits.
> >
> > Of course, that doesn''t mean we must necessarily have this
for 4.4
> > (although I think it would still be feasible), just that we either
> > check-in or wait for both the implementation and the interface. Again,
> > how''s the updated release schedule?
> 
> Well we definitely should plumb through access to the soft affinity 
> directly through libxl.  The question is whether it would be useful to 
> have a libxl per-vcpu *numa* affinity that is not yet a full vNUMA 
> implementation.  
>Ah, ok, I didn''t read your comment this way, sorry. :-) Now that you
explained it, yes, I think we''re on the same page, and I''m
more than
happy to leave the libx?_{set,get}vcpu_node_affinity() stuff out of the
new series.
> If long-term we want to have vcpu -> vnode, and then 
> vnode->pnode, the having a direct vcpu->pnode interface in the
interim
> is probably not particularly useful, I don''t think.
> Indeed.
> Re the release schedule: we agreed to move the code freeze to November 
> 18, so we''ve got just under two weeks.  
>Ok, that was my recollection from the developer meeting, but just wanted
to be sure.
> I think keeping the libxl side 
> simple will allow us to at least get the "soft affinity" stuff in
for
> 4.4, whether the vNUMA stuff makes it or not.
> Agreed.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-06 16:48 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mer, 2013-11-06 at 16:12 +0000, George Dunlap wrote:> We have two "affinities" at the moment:
> * per-vcpu cpu affinity.  This is this is a scheduling construct, and is 
> used to restrict vcpus to run on specific pcpus.  This is what we would 
> call hard affinity -- "hard" because vcpus are *not allowed* to
run on
> cpus not in this mask.
> * Domain NUMA affinity.  As of 4.3 this has two functions: both a memory 
> allocation function, and a scheduling function.  For the scheduling 
> function, it acts as a "soft affinity", but it''s
domain-wide, rather
> than being per-vcpu.  It''s "soft" because the scheduler
will *try* to
> run it on pcpus in that mask, but if it cannot, it *will allow* them to 
> run on pcpus not in that mask.
> 
> At the moment you can set this manually, or if it''s not set, then
Xen
> will set it automatically based on the vcpu hard affinities (and I think 
> the cpupool that it''s in).
> 
> What we''re proposing is to remove the domain-wide soft affinity
and
> replace it with a per-vcpu soft affinity.  That way each vcpu has some 
> pcpus that it prefers (in the soft affinity *and* the hard affinity), 
> some that it doesn''t prefer but is OK with (hard affinity but not
soft
> affinity), and some that it cannot run on at all (not in the hard
affinity).
> Great explanation... A bit longer but *much* better than mine. :-)
> The question Dario has is this: given that we now have per-vcpu hard and 
> soft scheduling affinity, how should we automatically construct the 
> per-domain memory allocation affinity, if at all?  Should we construct 
> it from the "hard" scheduling affinities, or from the
"soft" scheduling
> affinities?
> Exactly.
> I said that I thought we should use the soft affinity; but I really 
> meant the "effective soft affinity" -- i.e., the union of soft,
hard,
> and cpupools.
> Aha... I see it now... Only one more thing: union or intersection? Union
would be nice, especially because there are very few risks of it being
empty, but it may be too broad (if one of the tree is ''all
cpus/nodes'',
so will the memory affinity). I''d go for the intersection, with some
extra care to avoid degeneration into the empty set.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-06 16:53 UTC

head link

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

On mer, 2013-11-06 at 14:47 +0000, George Dunlap wrote:> On 06/11/13 11:41, Dario Faggioli wrote:
> > There''s of course the option of renaming
XEN_DOMCTL_setvcpuaffinity to
> > XEN_DOMCTL_setvcpuhardaffinity (and, while there, put some
''_'' in it!
> > )... But is that really an option? And even if it is an option for the
> > hypervisor, and perhaps, libxc, I''m quite sure it''s
not for libxl... Am
> > I wrong?
> 
> I think vcpuaffinity is a domctl, which means it''s a interface to
libxc,
> and thus malleable.  So I *think* we could actually just add a parameter 
> to vcpuaffinity to say "hard" or "soft".  (Correct me
if I''m wrong, Jan.)
> Ok, that certainly make it a lot easier than I thought. I''ll take it
that this also holds for xc_vcpu_setaffinity(), and will add a flag
there too.
> Where we need to consider backwards compatibility is in the libxl 
> interface. 
>Sure.
> I''m not sure whether the best thing to do there.  I think we 
> can''t break the existing API, so we need to keep 
> libxl_set_vcpuaffinity() available for old callers.  Options include:
> * Just add libxl_set_vcpuaffinity_soft(), and leave the first one as it is
>All these three looks sensible to me. Personally, I think I''d like this
first one the most. Anyway...
> * Add libxl_set_vcpu_affinity_[something]() which takes a soft/hard 
> flag, and deprecate the existing one
> * Redefine libxl_set_vcpuaffinity() to include a soft/hard flag, bumping 
> the API version and having a work-around for old callers.
> 
> IanJ / IanC, any opinions?
> ... Sure, let''s hear the maintainers! :-)
 
Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Nov-07 18:17 UTC

head link

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Dario Faggioli writes ("[PATCH RESEND 02/12] xl: allow for node-wise
specification of vcpu pinning"):> Making it possible to use something like the following:
>  * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
>  * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
>  * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2
>    but not pCPU 6;
>  * ...
Thanks.  This parsing is a lot clearer now.
> @@ -59,6 +59,11 @@
>          }                                                               \
>      })
>  
> +#define STR_HAS_PREFIX( a, b )  \
> +    ( strncmp(a, b, strlen(b)) == 0 )
> +#define STR_SKIP_PREFIX( a, b ) \
> +    ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL )
I think it might be worth making the type of STR_SKIP_PREFIX be
explicitly boolean.  Eg,
  +    ( STR_HAS_PREFIX(a, b) ? ((a) += strlen(b), 1) : 0 )

Since the returned pointer value isn''t very helpful.
> -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
> +static int parse_range(const char *str, unsigned long *a, unsigned long
*b)
> +{
> +    char *nstr, *endptr;
Missing consts ?
> +    if (STR_HAS_PREFIX(str, "all")) {
>          libxl_bitmap_set_any(cpumap);
> -        return 0;
> +        goto out;
I think this does the wrong thing with "^all".
> +    for (ptr = strtok_r(cpu, ",", &saveptr); ptr;
> +         ptr = strtok_r(NULL, ",", &saveptr)) {
A minor style complaint: If you are going to split two of these three
items onto their own line, please give them all their own line.

Thanks,
Ian.

Ian Jackson

2013-Nov-07 18:27 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Dario Faggioli writes ("[PATCH RESEND 09/12] libxc: numa-sched: enable
getting/specifying per-vcpu node-affinity"):> by providing the proper get/set interfaces and wiring them
> to the new domctl-s from the previous commit.
...> +int xc_vcpu_setnodeaffinity(xc_interface *xch,
> +                            uint32_t domid,
> +                            int vcpu,
> +                            xc_nodemap_t nodemap)
> +{
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
This looks good to me, but I would like Ian C to check that the use of
the hypercall buffer machinery is correct.

Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

Ian.

Ian Jackson

2013-Nov-07 18:29 UTC

head link

Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

Dario Faggioli writes ("[PATCH RESEND 10/12] libxl: numa-sched: enable
getting/specifying per-vcpu node-affinity"):> by providing the proper get/set interfaces and wiring them
> to the new libxc calls from the previous commit.
> 
> For the ''get'' part, the node-affinity of all the vcpus of
a
> domain is also reported via libxl_list_vcpu() (exactly as it
> is happening for their vcpu affinity already), adding a
> specific field to libxl_vcpuinfo.
This is a change to the libxl ABI.  I think you should bump the MAJOR
number in this patch (or a previous one).

Otherwise it looks good.

Ian.

Ian Jackson

2013-Nov-07 18:33 UTC

head link

Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

Dario Faggioli writes ("[PATCH RESEND 11/12] xl: numa-sched: enable
getting/specifying per-vcpu node-affinity"):> by showing it, upon request (''-n'' switch), as a part of
the
> output of `xl vcpu-list''. Such output looks like this:
...> # xl vcpu-list -n
> 
> Name          ID  VCPU   CPU State   Time(s) CPU Affinity / NUMA Affinity
> Domain-0       0     0    8   -b-       4.4  any cpu / any node
> Domain-0       0     1    0   -b-       1.2  any cpu / any node
> vm-test        1     0    8   -b-       2.4  any cpu / 1
> vm-test        1     1   13   -b-       3.3  any cpu / 1
> 
> The "vcpu-pinning / vcpu-affinity" part may indeed look
> weird, as it is different from the other fields.
> Unfortunately, it''s very hard to properly format it in
> columns, since there is no such things as minimal or
> maximal lengths for the bitmaps being printed.
Can you use the same syntax for printing as would be legal in
the config file ?
> While at it, the implementation of `xl vcpu-list'' is reworked
> a little bit, making it more similar to the one of `xl list''
> and more compliant with the libxl programming paradigm (e.g.,
> regarding allocating at the beginning and freeing on exit).
Can you do this in a pre-patch ?
> xl manual page is updated accordingly.
> -static int update_cpumap_range(const char *str, libxl_bitmap *cpumap)
> +static int update_bitmap_range(const char *str, libxl_bitmap *bitmap)
I don''t understand how this set of changes relates to this patch.

Thanks,
Ian.

Ian Jackson

2013-Nov-07 18:35 UTC

head link

Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

Dario Faggioli writes ("[PATCH RESEND 12/12] xl: numa-sched: enable
specifying node-affinity in VM config file"):> in a similar way to how it is possible to specify vcpu-affinity.
...>      /*
> -     * Check if the domain has any CPU affinity. If not, try to build
> -     * up one. In case numa_place_domain() find at least a suitable
> -     * candidate, it will affect info->nodemap accordingly; if it
> -     * does not, it just leaves it as it is. This means (unless
> -     * some weird error manifests) the subsequent call to
> -     * libxl_domain_set_nodeaffinity() will do the actual placement,
> +     * Check if the domain has any pinning or node-affinity and, if not,
try
> +     * to build up one.
> +     *
> +     * In case numa_place_domain() find at least a suitable candidate, it
will
> +     * affect info->nodemap accordingly; if it does not, it just leaves
it as
> +     * it is. This means (unless some weird error manifests) the
subsequent
> +     * call to libxl_domain_set_nodeaffinity() will do the actual
placement,
>       * whatever that turns out to be.
>       */
>      if (libxl_defbool_val(info->numa_placement)) {
>  
> -        if (!libxl_bitmap_is_full(&info->cpumap)) {
> +        if (!libxl_bitmap_is_full(&info->cpumap) ||
> +            !libxl_bitmap_is_full(&info->nodemap)) {
>              LOG(ERROR, "Can run NUMA placement only if no vcpu "
> -                       "affinity is specified");
> +                       "pinning or node-affinity is specified");
I appreciate I may be a bit late with this complaint, but surely this
approach (setting the cpumap and nodemap when they aren''t
all-bits-set) means that it''s not possible to explicitly set "all
cpus".

If I say in the config file "all" for cpus and "all" for
nodes, this
code will override that.  I don''t think that can be right.

Ian.

Dario Faggioli

2013-Nov-08 09:18 UTC

head link

Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

On gio, 2013-11-07 at 18:29 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 10/12] libxl: numa-sched: enable
getting/specifying per-vcpu node-affinity"):
> > by providing the proper get/set interfaces and wiring them
> > to the new libxc calls from the previous commit.
> > 
> > For the ''get'' part, the node-affinity of all the
vcpus of a
> > domain is also reported via libxl_list_vcpu() (exactly as it
> > is happening for their vcpu affinity already), adding a
> > specific field to libxl_vcpuinfo.
> 
> This is a change to the libxl ABI.  I think you should bump the MAJOR
> number in this patch (or a previous one).
> Oh, do I? I may be missing something then. Every time I''ve done
something similar to this before, I did provide the proper LIBXL_HAVE_*
stuff (which this patch does), for the sake of _API_ compatibility, and
that was it. :-(

I''ve just looked up in libxl.h and found the comment block about ABI
compatibility and SONAME bumping but (sorry) I''m not 100% sure I fully
understand what it says. Does it (and you) mean that, since this is the
first change since 4.3 that breaks ABI compatibility, I should include a
patch to tools/libxl/Makefile doing something like:

-MAJOR=4.3
+MAJOR=4.4

?

If yes, anything more than that?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-08 09:24 UTC

head link

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

On gio, 2013-11-07 at 18:17 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 02/12] xl: allow for node-wise
specification of vcpu pinning"):
> > Making it possible to use something like the following:
> >  * "nodes:0-3": all pCPUs of nodes 0,1,2,3;
> >  * "nodes:0-3,^node:2": all pCPUS of nodes 0,1,3;
> >  * "1,nodes:1-2,^6": pCPU 1 plus all pCPUs of nodes 1,2
> >    but not pCPU 6;
> >  * ...
> 
> Thanks.  This parsing is a lot clearer now.
> Good to hear that, again (yep, you said this before in a previous review
of a different series still including this patch :-D).
> > @@ -59,6 +59,11 @@
> >          }                                                            
\
> >      })
> >  
> > +#define STR_HAS_PREFIX( a, b )  \
> > +    ( strncmp(a, b, strlen(b)) == 0 )
> > +#define STR_SKIP_PREFIX( a, b ) \
> > +    ( STR_HAS_PREFIX(a, b) ? (a) += strlen(b) : NULL )
> 
> I think it might be worth making the type of STR_SKIP_PREFIX be
> explicitly boolean.  Eg,
>   +    ( STR_HAS_PREFIX(a, b) ? ((a) += strlen(b), 1) : 0 )
> 
> Since the returned pointer value isn''t very helpful.
> Good point. Will do.
> > -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
> > +static int parse_range(const char *str, unsigned long *a, unsigned
long *b)
> > +{
> > +    char *nstr, *endptr;
> 
> Missing consts ?
> Is it? For instance, I''m using endptr like this:

 *a = *b = strtoul(str, &endptr, 10);

And the prototype of strtoul is:

 unsigned long int strtoul(const char *nptr, char **endptr, int base);

So it won''t work, not for endptr at least. What I can do is the
following:

 const char *nstr;
 char *endptr;

And I will, if you think it''s better.
> > +    if (STR_HAS_PREFIX(str, "all")) {
> >          libxl_bitmap_set_any(cpumap);
> > -        return 0;
> > +        goto out;
> 
> I think this does the wrong thing with "^all".
> Good point again. I''ll fix that.
> > +    for (ptr = strtok_r(cpu, ",", &saveptr); ptr;
> > +         ptr = strtok_r(NULL, ",", &saveptr)) {
> 
> A minor style complaint: If you are going to split two of these three
> items onto their own line, please give them all their own line.
> You mean ptr should have its own line? Like this?

+    for (ptr = strtok_r(cpu, ",", &saveptr);
+         ptr;
+         ptr = strtok_r(NULL, ",", &saveptr)) {

If yes, I sure can do that, although the result really looks super
unpleasant to me (but that''s a matter of taste, I guess).

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-08 09:33 UTC

head link

Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

On gio, 2013-11-07 at 18:33 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 11/12] xl: numa-sched: enable
getting/specifying per-vcpu node-affinity"):
> > by showing it, upon request (''-n'' switch), as a part
of the
> > output of `xl vcpu-list''. Such output looks like this:
> ...
> > # xl vcpu-list -n
> > 
> > Name          ID  VCPU   CPU State   Time(s) CPU Affinity / NUMA
Affinity
> > Domain-0       0     0    8   -b-       4.4  any cpu / any node
> > Domain-0       0     1    0   -b-       1.2  any cpu / any node
> > vm-test        1     0    8   -b-       2.4  any cpu / 1
> > vm-test        1     1   13   -b-       3.3  any cpu / 1
> > 
> > The "vcpu-pinning / vcpu-affinity" part may indeed look
> > weird, as it is different from the other fields.
> > Unfortunately, it''s very hard to properly format it in
> > columns, since there is no such things as minimal or
> > maximal lengths for the bitmaps being printed.
> 
> Can you use the same syntax for printing as would be legal in
> the config file ?
> That is actually what is happening, for single cpu/node bitmaps. The
problem arises when I need to print two bitmaps, one next to the other,
like the cpumap for vCPU affinity and the nodemap for NUMA affinity
above.

The thing is I don''t think there is a maximal length for the strings
representing the bitmaps being printed, and this no easy way to format
the last two values (again "CPU Affinity" and "NUMA
Affinity") above in
columns... Hence the "/".

Ideas?

(BTW, the new version of the series I''m preparing will be very
different, considering we decided with George and Jan to change the
fundamental architecture of the whole thing. However, there still may be
the need to doing something like the above, independently whether it
will be for printing hard and soft affinity, instead than CPU and NUMA
affinity.)
> > While at it, the implementation of `xl vcpu-list'' is reworked
> > a little bit, making it more similar to the one of `xl list''
> > and more compliant with the libxl programming paradigm (e.g.,
> > regarding allocating at the beginning and freeing on exit).
> 
> Can you do this in a pre-patch ?
> Ok.
> > xl manual page is updated accordingly.
> 
> > -static int update_cpumap_range(const char *str, libxl_bitmap *cpumap)
> > +static int update_bitmap_range(const char *str, libxl_bitmap *bitmap)
> 
> I don''t understand how this set of changes relates to this patch.
> Well, I need to parse strings that may be very similar (the usual
"1,2-7,^4" stuff), but with potentially different meaning for the
various substrings (pcpus vs. nodes).

Anyway, in case I''m keeping this, or doing something similar to this
for
the new series, I''ll try to figure out a way to make it more simple, or
at minimum more readable and clarify why it is needed (by breaking the
patch down, etc.).

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-08 09:49 UTC

head link

Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

On gio, 2013-11-07 at 18:35 +0000, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH RESEND 12/12] xl: numa-sched: enable
specifying node-affinity in VM config file"):
> > in a similar way to how it is possible to specify vcpu-affinity.
> ...
> >      /*
> > -     * Check if the domain has any CPU affinity. If not, try to build
> > -     * up one. In case numa_place_domain() find at least a suitable
> > -     * candidate, it will affect info->nodemap accordingly; if it
> > -     * does not, it just leaves it as it is. This means (unless
> > -     * some weird error manifests) the subsequent call to
> > -     * libxl_domain_set_nodeaffinity() will do the actual placement,
> > +     * Check if the domain has any pinning or node-affinity and, if
not, try
> > +     * to build up one.
> > +     *
> > +     * In case numa_place_domain() find at least a suitable
candidate, it will
> > +     * affect info->nodemap accordingly; if it does not, it just
leaves it as
> > +     * it is. This means (unless some weird error manifests) the
subsequent
> > +     * call to libxl_domain_set_nodeaffinity() will do the actual
placement,
> >       * whatever that turns out to be.
> >       */
> >      if (libxl_defbool_val(info->numa_placement)) {
> >  
> > -        if (!libxl_bitmap_is_full(&info->cpumap)) {
> > +        if (!libxl_bitmap_is_full(&info->cpumap) ||
> > +            !libxl_bitmap_is_full(&info->nodemap)) {
> >              LOG(ERROR, "Can run NUMA placement only if no vcpu
"
> > -                       "affinity is specified");
> > +                       "pinning or node-affinity is
specified");
> 
> I appreciate I may be a bit late with this complaint, but surely this
> approach (setting the cpumap and nodemap when they aren''t
> all-bits-set) means that it''s not possible to explicitly set
"all
> cpus".
> The funny part is not that you''re late, but to whom you''re
saying
it! :-D :-D

http://lists.xen.org/archives/html/xen-devel/2012-07/msg01077.html

From: Ian Jackson <Ian.Jackson@xxxxxxxxxxxxx>
Date: Mon, 23 Jul 2012 12:04:29 +0100
Dario Faggioli writes ("[PATCH 2 of 4 v6/leftover] xl: inform libxl if
there was a cpumap in the config file"):
[...]
Shouldn''t this be

  -    if (libxl_bitmap_is_full(&info->cpumap)) {
  +    if (libxl_defbool_val(info->numa_placement)) {
  +        if (!libxl_bitmap_is_full(&info->cpumap)) {
  +            rc = ERROR_INVAL;
  +            LOG blah blah invalid not supported
  +            goto out;
  +        }

The story behind all this is that, back at that time, someone (and I''m
sure it''s you again, although I don''t find the thread now)
asked to
introduce info->numa_placement, to have a mean for: (1) disabling NUMA
placement completely; (2) help distinguishing the case where cpumap is
full because we want "cpus=all" from the one where it''s full
because we
did not touched it, since being full is the default value.

I remember finding it sound, and thus going for it. Perhaps we could
have done it differently, but given that info->cpumap is full by
default, I really don''t see that many alternatives (apart from making
cpumap empty by default, which may look even weirder).

So, the idea here is:
 - if it''s full, and placement is enabled, full means "do this for
me",
   and that''s what happens.
 - if placement is disabled, then nobody touches it, so if it was full
   it stay that way.

So, for saying "all", the procedure is as follows:

info->numa_placement = false;
fill_bitmap(info->cpumap);
> If I say in the config file "all" for cpus and "all"
for nodes, this
> code will override that.  I don''t think that can be right.
> And in fact, xl, when parsing the config file, turns numa_placement to
false when finding a "cpus=" item.

All the above applies to the new "nodes=" switch as well, of course.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2013-Nov-08 15:07 UTC

head link

Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

Dario Faggioli writes ("Re: [PATCH RESEND 10/12] libxl: numa-sched: enable
getting/specifying per-vcpu node-affinity"):> On gio, 2013-11-07 at 18:29 +0000, Ian Jackson wrote:
> > This is a change to the libxl ABI.  I think you should bump the MAJOR
> > number in this patch (or a previous one).
> > 
> Oh, do I? I may be missing something then. Every time I''ve done
> something similar to this before, I did provide the proper LIBXL_HAVE_*
> stuff (which this patch does), for the sake of _API_ compatibility, and
> that was it. :-(
Yes.  Probably in the past this bump had already been done for some
other reason.
> I''ve just looked up in libxl.h and found the comment block about
ABI
> compatibility and SONAME bumping but (sorry) I''m not 100% sure I
fully
> understand what it says. Does it (and you) mean that, since this is the
> first change since 4.3 that breaks ABI compatibility, I should include a
> patch to tools/libxl/Makefile doing something like:
> 
> -MAJOR=4.3
> +MAJOR=4.4
> 
> ?
Exactly.
> If yes, anything more than that?
No, nothing more.

Thanks,
Ian.

Ian Jackson

2013-Nov-08 15:18 UTC

head link

Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

Dario Faggioli writes ("Re: [PATCH RESEND 11/12] xl: numa-sched: enable
getting/specifying per-vcpu node-affinity"):> On gio, 2013-11-07 at 18:33 +0000, Ian Jackson wrote:
> > Dario Faggioli writes ("[PATCH RESEND 11/12] xl: numa-sched:
enable getting/specifying per-vcpu node-affinity"):
> > > # xl vcpu-list -n
> > > 
> > > Name          ID  VCPU   CPU State   Time(s) CPU Affinity / NUMA
Affinity
> > > Domain-0       0     0    8   -b-       4.4  any cpu / any node
> > > Domain-0       0     1    0   -b-       1.2  any cpu / any node
> > > vm-test        1     0    8   -b-       2.4  any cpu / 1
> > > vm-test        1     1   13   -b-       3.3  any cpu / 1
...> > Can you use the same syntax for printing as would be legal in
> > the config file ?
> > 
> That is actually what is happening, for single cpu/node bitmaps.
Err, I meant always, not just in one special case.  Eg, "any cpu"
should be "all".

I don''t know if you want to try to boil ranges cpumap ranges down to
node sets during printing - that''s probably too hard.
> The problem arises when I need to print two bitmaps, one next to the
> other, like the cpumap for vCPU affinity and the nodemap for NUMA
> affinity above.
Separating them with a / is fine I think.
> (BTW, the new version of the series I''m preparing will be very
> different, considering we decided with George and Jan to change the
> fundamental architecture of the whole thing. However, there still may be
> the need to doing something like the above, independently whether it
> will be for printing hard and soft affinity, instead than CPU and NUMA
> affinity.)
Right.
> > > xl manual page is updated accordingly.
> > 
> > > -static int update_cpumap_range(const char *str, libxl_bitmap
*cpumap)
> > > +static int update_bitmap_range(const char *str, libxl_bitmap
*bitmap)
> > 
> > I don''t understand how this set of changes relates to this
patch.
> > 
> Well, I need to parse strings that may be very similar (the usual
> "1,2-7,^4" stuff), but with potentially different meaning for the
> various substrings (pcpus vs. nodes).
So perhaps there should be a pre-patch to change the names of these
functions.
> Anyway, in case I''m keeping this, or doing something similar to
this for
> the new series, I''ll try to figure out a way to make it more
simple, or
> at minimum more readable and clarify why it is needed (by breaking the
> patch down, etc.).
Thanks,
Ian.

Ian Jackson

2013-Nov-08 15:20 UTC

head link

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Dario Faggioli writes ("Re: [PATCH RESEND 02/12] xl: allow for node-wise
specification of vcpu pinning"):> On gio, 2013-11-07 at 18:17 +0000, Ian Jackson wrote:
> > > -static int vcpupin_parse(char *cpu, libxl_bitmap *cpumap)
> > > +static int parse_range(const char *str, unsigned long *a,
unsigned long *b)
> > > +{
> > > +    char *nstr, *endptr;
> > 
> > Missing consts ?
> > 
> Is it? For instance, I''m using endptr like this:
> 
>  *a = *b = strtoul(str, &endptr, 10);
> 
> And the prototype of strtoul is:
>  unsigned long int strtoul(const char *nptr, char **endptr, int base);
Oh, yes, sorry, you''re right about endptr.
> So it won''t work, not for endptr at least. What I can do is the
> following:
> 
>  const char *nstr;
>  char *endptr;
>
> And I will, if you think it''s better.
Thanks, yes.
> > > +    for (ptr = strtok_r(cpu, ",", &saveptr); ptr;
> > > +         ptr = strtok_r(NULL, ",", &saveptr)) {
> > 
> > A minor style complaint: If you are going to split two of these three
> > items onto their own line, please give them all their own line.
> > 
> You mean ptr should have its own line? Like this?
> 
> +    for (ptr = strtok_r(cpu, ",", &saveptr);
> +         ptr;
> +         ptr = strtok_r(NULL, ",", &saveptr)) {
> 
> If yes, I sure can do that, although the result really looks super
> unpleasant to me (but that''s a matter of taste, I guess).
IMO it makes it much easier to see which of these expressions is which
of the three items in a for(;;).  That''s more important than making
the lines compact.

thanks,
Ian.

Ian Jackson

2013-Nov-08 15:22 UTC

head link

Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

Dario Faggioli writes ("Re: [PATCH RESEND 12/12] xl: numa-sched: enable
specifying node-affinity in VM config file"):> On gio, 2013-11-07 at 18:35 +0000, Ian Jackson wrote:
> > I appreciate I may be a bit late with this complaint, but surely this
> > approach (setting the cpumap and nodemap when they aren''t
> > all-bits-set) means that it''s not possible to explicitly set
"all
> > cpus".
> > 
> The funny part is not that you''re late, but to whom
you''re saying
> it! :-D :-D
Sorry about this.  Forget about this complaint.  I hadn''t spotted the
line
>        if (libxl_defbool_val(info->numa_placement)) {
above.

So now I think it''s correct.

Ian.

Konrad Rzeszutek Wilk

2013-Nov-12 16:01 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli
wrote:> by providing the proper get/set interfaces and wiring them
> to the new domctl-s from the previous commit.
s/previous commit/<title of the commit>/
> 
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> ---
>  tools/libxc/xc_domain.c |   82
+++++++++++++++++++++++++++++++++++++++++++++++
>  tools/libxc/xenctrl.h   |   19 +++++++++++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> index f8825a3..e44ad2e 100644
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -272,6 +272,88 @@ out:
>      return ret;
>  }
>  
> +int xc_vcpu_setnodeaffinity(xc_interface *xch,
> +                            uint32_t domid,
> +                            int vcpu,
> +                            xc_nodemap_t nodemap)
> +{
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> +    int ret = -1;
Ewww.. Could we just use regular -Exx for new xc_* calls?
> +    int nodesize;
> +
> +    nodesize = xc_get_cpumap_size(xch);
> +    if (!nodesize)
> +    {
> +        PERROR("Could not get number of  nodes");
Extra space.> +        goto out;
> +    }
> +
> +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> +    if ( local == NULL )
> +    {
> +        PERROR("Could not allocate memory for setvcpunodeaffinity
domctl hypercall");
> +        goto out;
> +    }
> +
> +    domctl.cmd = XEN_DOMCTL_setvcpunodeaffinity;
> +    domctl.domain = (domid_t)domid;
> +    domctl.u.vcpuaffinity.vcpu = vcpu;
> +
> +    memcpy(local, nodemap, nodesize);
> +
> +    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local);
> +
> +    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
> +
> +    ret = do_domctl(xch, &domctl);
> +
> +    xc_hypercall_buffer_free(xch, local);
> +
> + out:
> +    return ret;
> +}
> +
> +int xc_vcpu_getnodeaffinity(xc_interface *xch,
> +                            uint32_t domid,
> +                            int vcpu,
> +                            xc_nodemap_t nodemap)
> +{
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> +    int ret = -1;
> +    int nodesize;
> +
> +    nodesize = xc_get_nodemap_size(xch);
> +    if (!nodesize)
> +    {
> +        PERROR("Could not get number of nodes");
> +        goto out;
> +    }
> +
> +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> +    if (local == NULL)
> +    {
> +        PERROR("Could not allocate memory for getvcpunodeaffinity
domctl hypercall");
> +        goto out;
> +    }
> +
> +    domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity;
> +    domctl.domain = (domid_t)domid;
> +    domctl.u.vcpuaffinity.vcpu = vcpu;
> +
> +    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local);
> +    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
Could the 8 be replaced by a sizeof?> +
> +    ret = do_domctl(xch, &domctl);
> +
> +    memcpy(nodemap, local, nodesize);
> +
> +    xc_hypercall_buffer_free(xch, local);
> +out:
> +    return ret;
> +}
> +
>  int xc_domain_get_guest_width(xc_interface *xch, uint32_t domid,
>                                unsigned int *guest_width)
>  {
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> index 8cf3f3b..208fa2c 100644
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch,
>                                 uint32_t domind,
>                                 xc_nodemap_t nodemap);
>  
> +/**
> + * These functions set and retrieves the NUMA node-affinity
> + * of a specific vcpu.
> + *
> + * @parm xch a handle to an open hypervisor interface.
> + * @parm domid the domain id one is interested in.
> + * @parm vcpu the vcpu one wants to set/get the affinity of.
> + * @parm nodemap the map of the affine nodes.
> + * @return 0 on success, -1 on failure.
and something in errno?> + */
> +int xc_vcpu_setnodeaffinity(xc_interface *xch,
> +                            uint32_t domid,
> +                            int vcpu,
> +                            xc_nodemap_t nodemap);
> +int xc_vcpu_getnodeaffinity(xc_interface *xch,
> +                            uint32_t domid,
> +                            int vcpu,
> +                            xc_nodemap_t nodemap);
> +
>  int xc_vcpu_setaffinity(xc_interface *xch,
>                          uint32_t domid,
>                          int vcpu,
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2013-Nov-12 16:43 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

On 11/12/2013 04:01 PM, Konrad Rzeszutek Wilk wrote:> On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote:
>> by providing the proper get/set interfaces and wiring them
>> to the new domctl-s from the previous commit.
>
> s/previous commit/<title of the commit>/
>
>>
>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>> ---
>>   tools/libxc/xc_domain.c |   82
+++++++++++++++++++++++++++++++++++++++++++++++
>>   tools/libxc/xenctrl.h   |   19 +++++++++++
>>   2 files changed, 101 insertions(+)
>>
>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>> index f8825a3..e44ad2e 100644
>> --- a/tools/libxc/xc_domain.c
>> +++ b/tools/libxc/xc_domain.c
>> @@ -272,6 +272,88 @@ out:
>>       return ret;
>>   }
>>
>> +int xc_vcpu_setnodeaffinity(xc_interface *xch,
>> +                            uint32_t domid,
>> +                            int vcpu,
>> +                            xc_nodemap_t nodemap)
>> +{
>> +    DECLARE_DOMCTL;
>> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
>> +    int ret = -1;
>
> Ewww.. Could we just use regular -Exx for new xc_* calls?
I think the standard (which Dario is just following here) is to return 
-1 and set errno.  Since the failures are all after making calls to 
other xc functions, they should be setting errno properly, and we should 
be able to just leave it.

  -George

Konrad Rzeszutek Wilk

2013-Nov-12 16:55 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

George Dunlap <george.dunlap@eu.citrix.com> wrote:>On 11/12/2013 04:01 PM, Konrad Rzeszutek Wilk wrote:
>> On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote:
>>> by providing the proper get/set interfaces and wiring them
>>> to the new domctl-s from the previous commit.
>>
>> s/previous commit/<title of the commit>/
>>
>>>
>>> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>>> ---
>>>   tools/libxc/xc_domain.c |   82
>+++++++++++++++++++++++++++++++++++++++++++++++
>>>   tools/libxc/xenctrl.h   |   19 +++++++++++
>>>   2 files changed, 101 insertions(+)
>>>
>>> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
>>> index f8825a3..e44ad2e 100644
>>> --- a/tools/libxc/xc_domain.c
>>> +++ b/tools/libxc/xc_domain.c
>>> @@ -272,6 +272,88 @@ out:
>>>       return ret;
>>>   }
>>>
>>> +int xc_vcpu_setnodeaffinity(xc_interface *xch,
>>> +                            uint32_t domid,
>>> +                            int vcpu,
>>> +                            xc_nodemap_t nodemap)
>>> +{
>>> +    DECLARE_DOMCTL;
>>> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
>>> +    int ret = -1;
>>
>> Ewww.. Could we just use regular -Exx for new xc_* calls?
>
>I think the standard (which Dario is just following here) is to return 
>-1 and set errno.  Since the failures are all after making calls to 
>other xc functions, they should be setting errno properly, and we
>should 
>be able to just leave it.
>
>  -George
I recall some discussion with Ian Campbell where he expressed frustration with
the return -1. I thought we wanted to stamp these out going forward?

Dario Faggioli

2013-Nov-12 18:40 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

On mar, 2013-11-12 at 11:01 -0500, Konrad Rzeszutek Wilk
wrote:> On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote:
> > by providing the proper get/set interfaces and wiring them
> > to the new domctl-s from the previous commit.
> 
> s/previous commit/<title of the commit>/
> Ok.
> > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> > +int xc_vcpu_setnodeaffinity(xc_interface *xch,
> > +                            uint32_t domid,
> > +                            int vcpu,
> > +                            xc_nodemap_t nodemap)
> > +{
> > +    DECLARE_DOMCTL;
> > +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> > +    int ret = -1;
> 
> Ewww.. Could we just use regular -Exx for new xc_* calls?
> Mmm... As George said already, that is the most common behavior in
libxc... If we want something different, we should really write it down
somewhere (or has that happened already and I''m missing it?).

Also, yes, this does return -1 in cases where I don''t have an error
code
provided by the call that is actually failing, and yes, I totally rely
on it to set errno properly.

In the main operation implemented here, I just return the output of
do_domctl(), which, again, may be -1, may be some err-code, but it
really looks like it is what every other function there does, and, TBH,
it also seems the very most sane thing to do to me.
> > +int xc_vcpu_getnodeaffinity(xc_interface *xch,
> > +                            uint32_t domid,
> > +                            int vcpu,
> > +                            xc_nodemap_t nodemap)
> > +{
> > +    DECLARE_DOMCTL;
> > +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> > +    int ret = -1;
> > +    int nodesize;
> > +
> > +    nodesize = xc_get_nodemap_size(xch);
> > +    if (!nodesize)
> > +    {
> > +        PERROR("Could not get number of nodes");
> > +        goto out;
> > +    }
> > +
> > +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> > +    if (local == NULL)
> > +    {
> > +        PERROR("Could not allocate memory for
getvcpunodeaffinity domctl hypercall");
> > +        goto out;
> > +    }
> > +
> > +    domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity;
> > +    domctl.domain = (domid_t)domid;
> > +    domctl.u.vcpuaffinity.vcpu = vcpu;
> > +
> > +    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap, local);
> > +    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
> 
> Could the 8 be replaced by a sizeof?
>I guess it could... What was it that you had in mind in particular?

Personally, I consider the names ''bitmap'' and
''nr_bits'' talking enough
to feel comfortable with the 8... To the point that I think the
following would be even less readable:

 domctl.u.vcpuaffinity.map.nr_bits = nodesize * 
        sizeof(domctl.u.vcpuaffinity.cpumap.bitmap);

But, of course, I''d do it that way if a maintainer asks for that.
> > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> > index 8cf3f3b..208fa2c 100644
> > --- a/tools/libxc/xenctrl.h
> > +++ b/tools/libxc/xenctrl.h
> > @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface *xch,
> >                                 uint32_t domind,
> >                                 xc_nodemap_t nodemap);
> >  
> > +/**
> > + * These functions set and retrieves the NUMA node-affinity
> > + * of a specific vcpu.
> > + *
> > + * @parm xch a handle to an open hypervisor interface.
> > + * @parm domid the domain id one is interested in.
> > + * @parm vcpu the vcpu one wants to set/get the affinity of.
> > + * @parm nodemap the map of the affine nodes.
> > + * @return 0 on success, -1 on failure.
> 
> and something in errno?
>Right. I''ll add a mention to that in the comment.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Nov-12 19:13 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

On Tue, Nov 12, 2013 at 07:40:24PM +0100, Dario Faggioli
wrote:> On mar, 2013-11-12 at 11:01 -0500, Konrad Rzeszutek Wilk wrote:
> > On Tue, Nov 05, 2013 at 03:35:42PM +0100, Dario Faggioli wrote:
> > > by providing the proper get/set interfaces and wiring them
> > > to the new domctl-s from the previous commit.
> > 
> > s/previous commit/<title of the commit>/
> > 
> Ok.
> 
> > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> > > +int xc_vcpu_setnodeaffinity(xc_interface *xch,
> > > +                            uint32_t domid,
> > > +                            int vcpu,
> > > +                            xc_nodemap_t nodemap)
> > > +{
> > > +    DECLARE_DOMCTL;
> > > +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> > > +    int ret = -1;
> > 
> > Ewww.. Could we just use regular -Exx for new xc_* calls?
> > 
> Mmm... As George said already, that is the most common behavior in
> libxc... If we want something different, we should really write it down
> somewhere (or has that happened already and I''m missing it?).
I was under the impression that was it - but I am probably misremembering
Ian''s comments and now sowing disinformation.

So ignore my code comment please.
> 
> Also, yes, this does return -1 in cases where I don''t have an
error code
> provided by the call that is actually failing, and yes, I totally rely
> on it to set errno properly.
> 
> In the main operation implemented here, I just return the output of
> do_domctl(), which, again, may be -1, may be some err-code, but it
> really looks like it is what every other function there does, and, TBH,
> it also seems the very most sane thing to do to me.
> 
> > > +int xc_vcpu_getnodeaffinity(xc_interface *xch,
> > > +                            uint32_t domid,
> > > +                            int vcpu,
> > > +                            xc_nodemap_t nodemap)
> > > +{
> > > +    DECLARE_DOMCTL;
> > > +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> > > +    int ret = -1;
> > > +    int nodesize;
> > > +
> > > +    nodesize = xc_get_nodemap_size(xch);
> > > +    if (!nodesize)
> > > +    {
> > > +        PERROR("Could not get number of nodes");
> > > +        goto out;
> > > +    }
> > > +
> > > +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> > > +    if (local == NULL)
> > > +    {
> > > +        PERROR("Could not allocate memory for
getvcpunodeaffinity domctl hypercall");
> > > +        goto out;
> > > +    }
> > > +
> > > +    domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity;
> > > +    domctl.domain = (domid_t)domid;
> > > +    domctl.u.vcpuaffinity.vcpu = vcpu;
> > > +
> > > +    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap,
local);
> > > +    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
> > 
> > Could the 8 be replaced by a sizeof?
> >
> I guess it could... What was it that you had in mind in particular?
> 
> Personally, I consider the names ''bitmap'' and
''nr_bits'' talking enough
> to feel comfortable with the 8... To the point that I think the
> following would be even less readable:
> 
>  domctl.u.vcpuaffinity.map.nr_bits = nodesize * 
>         sizeof(domctl.u.vcpuaffinity.cpumap.bitmap);
Eww. That is worst. Somehow I assumed you could just do
''sizeof(unsigned long long)''
or such. or some #define for this magic number.> 
> But, of course, I''d do it that way if a maintainer asks for that.
> 
> > > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> > > index 8cf3f3b..208fa2c 100644
> > > --- a/tools/libxc/xenctrl.h
> > > +++ b/tools/libxc/xenctrl.h
> > > @@ -551,6 +551,25 @@ int xc_domain_node_getaffinity(xc_interface
*xch,
> > >                                 uint32_t domind,
> > >                                 xc_nodemap_t nodemap);
> > >  
> > > +/**
> > > + * These functions set and retrieves the NUMA node-affinity
> > > + * of a specific vcpu.
> > > + *
> > > + * @parm xch a handle to an open hypervisor interface.
> > > + * @parm domid the domain id one is interested in.
> > > + * @parm vcpu the vcpu one wants to set/get the affinity of.
> > > + * @parm nodemap the map of the affine nodes.
> > > + * @return 0 on success, -1 on failure.
> > 
> > and something in errno?
> >
> Right. I''ll add a mention to that in the comment.
> 
> Thanks and Regards,
> Dario
> 
> -- 
> <<This happens because I choose it to happen!>> (Raistlin
Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://about.me/dario.faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>

Dario Faggioli

2013-Nov-12 21:36 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

On mar, 2013-11-12 at 14:13 -0500, Konrad Rzeszutek Wilk
wrote:> On Tue, Nov 12, 2013 at 07:40:24PM +0100, Dario Faggioli wrote:
> > Mmm... As George said already, that is the most common behavior in
> > libxc... If we want something different, we should really write it
down
> > somewhere (or has that happened already and I''m missing it?).
> 
> I was under the impression that was it - but I am probably misremembering
> Ian''s comments and now sowing disinformation.
> Well, it may well have been me having missed that, so, really, no big
deal at all. :-)
> > > > +    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap,
local);
> > > > +    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
> > > 
> > > Could the 8 be replaced by a sizeof?
> > >
> > I guess it could... What was it that you had in mind in particular?
> > 
> > Personally, I consider the names ''bitmap'' and
''nr_bits'' talking enough
> > to feel comfortable with the 8... To the point that I think the
> > following would be even less readable:
> > 
> >  domctl.u.vcpuaffinity.map.nr_bits = nodesize * 
> >         sizeof(domctl.u.vcpuaffinity.cpumap.bitmap);
> 
> Eww. That is worst. Somehow I assumed you could just do
''sizeof(unsigned long long)''
> or such. or some #define for this magic number.
>Maybe I can... I''ll look into that (although no promises, ok? :-P).
This
is all about playing games with uint8_t and array of them with weird
sizes... Again, I''ll see if I can come up with something...

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Nov-13 10:57 UTC

head link

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

On mar, 2013-11-12 at 14:13 -0500, Konrad Rzeszutek Wilk
wrote:> On Tue, Nov 12, 2013 at 07:40:24PM +0100, Dario Faggioli wrote:
> > > > +int xc_vcpu_getnodeaffinity(xc_interface *xch,
> > > > +                            uint32_t domid,
> > > > +                            int vcpu,
> > > > +                            xc_nodemap_t nodemap)
> > > > +{
> > > > +    DECLARE_DOMCTL;
> > > > +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> > > > +    int ret = -1;
> > > > +    int nodesize;
> > > > +
> > > > +    nodesize = xc_get_nodemap_size(xch);
> > > > +    if (!nodesize)
> > > > +    {
> > > > +        PERROR("Could not get number of nodes");
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    local = xc_hypercall_buffer_alloc(xch, local,
nodesize);
> > > > +    if (local == NULL)
> > > > +    {
> > > > +        PERROR("Could not allocate memory for
getvcpunodeaffinity domctl hypercall");
> > > > +        goto out;
> > > > +    }
> > > > +
> > > > +    domctl.cmd = XEN_DOMCTL_getvcpunodeaffinity;
> > > > +    domctl.domain = (domid_t)domid;
> > > > +    domctl.u.vcpuaffinity.vcpu = vcpu;
> > > > +
> > > > +    set_xen_guest_handle(domctl.u.vcpuaffinity.map.bitmap,
local);
> > > > +    domctl.u.vcpuaffinity.map.nr_bits = nodesize * 8;
> > > 
> > > Could the 8 be replaced by a sizeof?
> > >
> > I guess it could... What was it that you had in mind in particular?
> > 
> > Personally, I consider the names ''bitmap'' and
''nr_bits'' talking enough
> > to feel comfortable with the 8... To the point that I think the
> > following would be even less readable:
> > 
> >  domctl.u.vcpuaffinity.map.nr_bits = nodesize * 
> >         sizeof(domctl.u.vcpuaffinity.cpumap.bitmap);
> 
> Eww. That is worst. Somehow I assumed you could just do
''sizeof(unsigned long long)''
> or such. or some #define for this magic number.
>I did think more about this and (re)did the math. After that, I don''t
think there is much we can do to make it more clear than it already is
with the 8, assuming that we all are cool with the fact that there are 8
bits in a byte, because that''s what the 8 means there.

In fact, both the passed xc_nodemap_t (nodemap, in this case) and the
hypercall buffer (local) are allocated to be nodesize (i.e., the result
of xc_get_nodemap_size()) _bytes_. Since what we want to store in
nr_bits is the number of actual bit available in the resulting bitmap,
given how large it is in bytes, there is nothing much else I can think
of to get there than multiplying such number of byte by the number of
bits in a byte.

Therefore, either we find a suitable sizeof() for
"nr_of_bits_in_a_byte", or I "#define BITS_PER_BYTE 8"
(:-P), or we
stick with the 8, which again I think it''s the best option.

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Nov 2013 - [PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1

[PATCH RESEND 00/12] Implement per-vcpu NUMA node-affinity for credit1

[PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode

[PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

[PATCH RESEND 03/12] xl: implement and enable dryrun mode for `xl vcpu-pin''

[PATCH RESEND 04/12] xl: test script for the cpumap parser (for vCPU pinning)

[PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

[PATCH RESEND 06/12] xen: numa-sched: domain node-affinity always comes from vcpu node-affinity

[PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling

[PATCH RESEND 08/12] xen: numa-sched: enable getting/specifying per-vcpu node-affinity

[PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

[PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

[PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

[PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

Re: [PATCH RESEND 01/12] xen: numa-sched: leave node-affinity alone if not in "auto" mode

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH RESEND 07/12] xen: numa-sched: use per-vcpu node-affinity for actual scheduling

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 05/12] xen: numa-sched: make space for per-vcpu node-affinity

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

Re: [PATCH RESEND 10/12] libxl: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 11/12] xl: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 02/12] xl: allow for node-wise specification of vcpu pinning

Re: [PATCH RESEND 12/12] xl: numa-sched: enable specifying node-affinity in VM config file

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity

Re: [PATCH RESEND 09/12] libxc: numa-sched: enable getting/specifying per-vcpu node-affinity