thr3ads.net - Xen devel - [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler [Oct 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Oct-05 14:08 UTC

[PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Hi Everyone,

Here it comes a patch series instilling some NUMA awareness in the Credit
scheduler.

What the patches do is teaching the Xen''s scheduler how to try
maximizing
performances on a NUMA host, taking advantage of the information coming from
the automatic NUMA placement we have in libxl.  Right now, the
placement algorithm runs and selects a node (or a set of nodes) where it is best
to put a new domain on. Then, all the memory for the new domain is allocated
from those node(s) and all the vCPUs of the new domain are pinned to the pCPUs
of those node(s). What we do here is, instead of statically pinning the
domain''s
vCPUs to the nodes'' pCPUs, have the (Credit) scheduler _prefer_ running
them
there. That enables most of the performances benefits of "real"
pinning, but
without its intrinsic lack of flexibility.

The above happens by extending to the scheduler the knowledge of a
domain''s
node-affinity. We then ask it to first try to run the domain''s vCPUs on
one of
the nodes the domain has affinity with. Of course, if that turns out to be
impossible, it falls back on the old behaviour (i.e., considering vcpu-affinity
only).

Just allow me to mention that NUMA aware scheduling not only is one of the item
of the NUMA roadmap I''m trying to maintain here
http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features we
decided we want for Xen 4.3 (and thus it is part of the list of such features
that George is maintaining).

Up to now, I''ve been able to thoroughly test this only on my 2 NUMA
nodes
testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs, and
the results looks really nice.  A full set of what I got can be found inside my
presentation from last XenSummit, which is available here:

 http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html

However, I rerun some of the tests in these last days (since I changed some
bits of the implementation) and here''s what I got:

-------------------------------------------------------
 SpecJBB2005 Total Aggregate Throughput
-------------------------------------------------------
#VMs       No NUMA affinity     NUMA affinity &   +/- %
                                  scheduling
-------------------------------------------------------
   2            34653.273          40243.015    +16.13%
   4            29883.057          35526.807    +18.88%
   6            23512.926          27015.786    +14.89%
   8            19120.243          21825.818    +14.15%
  10            15676.675          17701.472    +12.91%

Basically, results are consistent with what is shown in the super-nice graphs I
have in the slides above! :-) As said, this looks nice to me, especially
considering that my test machine is quite small, i.e., its 2 nodes are very
close to each others from a latency point of view. I really expect more
improvement on bigger hardware, where much greater NUMA effect is to be
expected.  Of course, I myself will continue benchmarking (hopefully, on
systems with more than 2 nodes too), but should anyone want to run its own
testing, that would be great, so feel free to do that and report results to me
and/or to the list!

A little bit more about the series:

 1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap
 2/8 xen, libxc: introduce node maps and masks

Is some preparation work.

 3/8 xen: let the (credit) scheduler know about `node affinity`

Is where the vcpu load balancing logic of the credit scheduler is modified to
support node-affinity.

 4/8 xen: allow for explicitly specifying node-affinity
 5/8 libxc: allow for explicitly specifying node-affinity
 6/8 libxl: allow for explicitly specifying node-affinity
 7/8 libxl: automatic placement deals with node-affinity

Is what wires the in-scheduler node-affinity support with the external world.
Please, note that patch 4 touches XSM and Flask, which is the area with which I
have less experience and less chance to test properly. So, If Daniel and/or
anyone interested in that could take a look and comment, that would be awesome.

 8/8 xl: report node-affinity for domains

Is just some small output enhancement.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 1 of 8] xen, libxc: rename xenctl_cpumap to xenctl_bitmap

More specifically:
 1. replaces xenctl_cpumap with xenctl_bitmap
 2. provides bitmap_to_xenctl_bitmap and the reverse;
 3. re-implement cpumask_to_xenctl_bitmap with
    bitmap_to_xenctl_bitmap and the reverse;

Other than #3, no functional changes. Interface only slightly
afected.

This is in preparation of introducing NUMA node-affinity maps.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/xc_cpupool.c b/tools/libxc/xc_cpupool.c
--- a/tools/libxc/xc_cpupool.c
+++ b/tools/libxc/xc_cpupool.c
@@ -90,7 +90,7 @@ xc_cpupoolinfo_t *xc_cpupool_getinfo(xc_
     sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_INFO;
     sysctl.u.cpupool_op.cpupool_id = poolid;
     set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
-    sysctl.u.cpupool_op.cpumap.nr_cpus = local_size * 8;
+    sysctl.u.cpupool_op.cpumap.nr_elems = local_size * 8;
 
     err = do_sysctl_save(xch, &sysctl);
 
@@ -184,7 +184,7 @@ xc_cpumap_t xc_cpupool_freeinfo(xc_inter
     sysctl.cmd = XEN_SYSCTL_cpupool_op;
     sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_FREEINFO;
     set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
-    sysctl.u.cpupool_op.cpumap.nr_cpus = mapsize * 8;
+    sysctl.u.cpupool_op.cpumap.nr_elems = mapsize * 8;
 
     err = do_sysctl_save(xch, &sysctl);
 
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -142,7 +142,7 @@ int xc_vcpu_setaffinity(xc_interface *xc
 
     set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
 
-    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
+    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
@@ -182,7 +182,7 @@ int xc_vcpu_getaffinity(xc_interface *xc
     domctl.u.vcpuaffinity.vcpu = vcpu;
 
     set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
-    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
+    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
diff --git a/tools/libxc/xc_tbuf.c b/tools/libxc/xc_tbuf.c
--- a/tools/libxc/xc_tbuf.c
+++ b/tools/libxc/xc_tbuf.c
@@ -134,7 +134,7 @@ int xc_tbuf_set_cpu_mask(xc_interface *x
     bitmap_64_to_byte(bytemap, &mask64, sizeof (mask64) * 8);
 
     set_xen_guest_handle(sysctl.u.tbuf_op.cpu_mask.bitmap, bytemap);
-    sysctl.u.tbuf_op.cpu_mask.nr_cpus = sizeof(bytemap) * 8;
+    sysctl.u.tbuf_op.cpu_mask.nr_elems = sizeof(bytemap) * 8;
 
     ret = do_sysctl(xch, &sysctl);
 
diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c
--- a/xen/arch/x86/cpu/mcheck/mce.c
+++ b/xen/arch/x86/cpu/mcheck/mce.c
@@ -1474,8 +1474,7 @@ long do_mca(XEN_GUEST_HANDLE(xen_mc_t) u
             cpumap = &cpu_online_map;
         else
         {
-            ret = xenctl_cpumap_to_cpumask(&cmv,
-                                           &op->u.mc_inject_v2.cpumap);
+            ret = xenctl_bitmap_to_cpumask(&cmv,
&op->u.mc_inject_v2.cpumap);
             if ( ret )
                 break;
             cpumap = cmv;
diff --git a/xen/arch/x86/platform_hypercall.c
b/xen/arch/x86/platform_hypercall.c
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -371,7 +371,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
     {
         uint32_t cpu;
         uint64_t idletime, now = NOW();
-        struct xenctl_cpumap ctlmap;
+        struct xenctl_bitmap ctlmap;
         cpumask_var_t cpumap;
         XEN_GUEST_HANDLE(uint8) cpumap_bitmap;
         XEN_GUEST_HANDLE(uint64) idletimes;
@@ -384,11 +384,11 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
         if ( cpufreq_controller != FREQCTL_dom0_kernel )
             break;
 
-        ctlmap.nr_cpus  = op->u.getidletime.cpumap_nr_cpus;
+        ctlmap.nr_elems  = op->u.getidletime.cpumap_nr_cpus;
         guest_from_compat_handle(cpumap_bitmap,
                                  op->u.getidletime.cpumap_bitmap);
         ctlmap.bitmap.p = cpumap_bitmap.p; /* handle -> handle_64 conversion
*/
-        if ( (ret = xenctl_cpumap_to_cpumask(&cpumap, &ctlmap)) != 0 )
+        if ( (ret = xenctl_bitmap_to_cpumask(&cpumap, &ctlmap)) != 0 )
             goto out;
         guest_from_compat_handle(idletimes, op->u.getidletime.idletime);
 
@@ -407,7 +407,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
 
         op->u.getidletime.now = now;
         if ( ret == 0 )
-            ret = cpumask_to_xenctl_cpumap(&ctlmap, cpumap);
+            ret = cpumask_to_xenctl_bitmap(&ctlmap, cpumap);
         free_cpumask_var(cpumap);
 
         if ( ret == 0 && copy_to_guest(u_xenpf_op, op, 1) )
diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
--- a/xen/common/cpupool.c
+++ b/xen/common/cpupool.c
@@ -493,7 +493,7 @@ int cpupool_do_sysctl(struct xen_sysctl_
         op->cpupool_id = c->cpupool_id;
         op->sched_id = c->sched->sched_id;
         op->n_dom = c->n_dom;
-        ret = cpumask_to_xenctl_cpumap(&op->cpumap, c->cpu_valid);
+        ret = cpumask_to_xenctl_bitmap(&op->cpumap, c->cpu_valid);
         cpupool_put(c);
     }
     break;
@@ -588,7 +588,7 @@ int cpupool_do_sysctl(struct xen_sysctl_
 
     case XEN_SYSCTL_CPUPOOL_OP_FREEINFO:
     {
-        ret = cpumask_to_xenctl_cpumap(
+        ret = cpumask_to_xenctl_bitmap(
             &op->cpumap, &cpupool_free_cpus);
     }
     break;
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -32,28 +32,29 @@
 static DEFINE_SPINLOCK(domctl_lock);
 DEFINE_SPINLOCK(vcpu_alloc_lock);
 
-int cpumask_to_xenctl_cpumap(
-    struct xenctl_cpumap *xenctl_cpumap, const cpumask_t *cpumask)
+int bitmap_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_bitmap,
+                            const unsigned long *bitmap,
+                            unsigned int nbits)
 {
     unsigned int guest_bytes, copy_bytes, i;
     uint8_t zero = 0;
     int err = 0;
-    uint8_t *bytemap = xmalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
+    uint8_t *bytemap = xmalloc_array(uint8_t, (nbits + 7) / 8);
 
     if ( !bytemap )
         return -ENOMEM;
 
-    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
-    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
+    guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8;
+    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
 
-    bitmap_long_to_byte(bytemap, cpumask_bits(cpumask), nr_cpu_ids);
+    bitmap_long_to_byte(bytemap, bitmap, nbits);
 
     if ( copy_bytes != 0 )
-        if ( copy_to_guest(xenctl_cpumap->bitmap, bytemap, copy_bytes) )
+        if ( copy_to_guest(xenctl_bitmap->bitmap, bytemap, copy_bytes) )
             err = -EFAULT;
 
     for ( i = copy_bytes; !err && i < guest_bytes; i++ )
-        if ( copy_to_guest_offset(xenctl_cpumap->bitmap, i, &zero, 1) )
+        if ( copy_to_guest_offset(xenctl_bitmap->bitmap, i, &zero, 1) )
             err = -EFAULT;
 
     xfree(bytemap);
@@ -61,36 +62,59 @@ int cpumask_to_xenctl_cpumap(
     return err;
 }
 
-int xenctl_cpumap_to_cpumask(
-    cpumask_var_t *cpumask, const struct xenctl_cpumap *xenctl_cpumap)
+int xenctl_bitmap_to_bitmap(unsigned long *bitmap,
+                            const struct xenctl_bitmap *xenctl_bitmap,
+                            unsigned int nbits)
 {
     unsigned int guest_bytes, copy_bytes;
     int err = 0;
-    uint8_t *bytemap = xzalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
+    uint8_t *bytemap = xzalloc_array(uint8_t, (nbits + 7) / 8);
 
     if ( !bytemap )
         return -ENOMEM;
 
-    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
-    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
+    guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8;
+    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
 
     if ( copy_bytes != 0 )
     {
-        if ( copy_from_guest(bytemap, xenctl_cpumap->bitmap, copy_bytes) )
+        if ( copy_from_guest(bytemap, xenctl_bitmap->bitmap, copy_bytes) )
             err = -EFAULT;
-        if ( (xenctl_cpumap->nr_cpus & 7) && (guest_bytes <=
sizeof(bytemap)) )
-            bytemap[guest_bytes-1] &= ~(0xff <<
(xenctl_cpumap->nr_cpus & 7));
+        if ( (xenctl_bitmap->nr_elems & 7) &&
+             (guest_bytes <= sizeof(bytemap)) )
+            bytemap[guest_bytes-1] &= ~(0xff <<
(xenctl_bitmap->nr_elems & 7));
     }
 
-    if ( err )
-        /* nothing */;
-    else if ( alloc_cpumask_var(cpumask) )
-        bitmap_byte_to_long(cpumask_bits(*cpumask), bytemap, nr_cpu_ids);
+    if ( !err )
+        bitmap_byte_to_long(bitmap, bytemap, nbits);
+
+    xfree(bytemap);
+
+    return err;
+}
+
+int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_cpumap,
+                             const cpumask_t *cpumask)
+{
+    return bitmap_to_xenctl_bitmap(xenctl_cpumap, cpumask_bits(cpumask),
+                                   nr_cpu_ids);
+}
+
+int xenctl_bitmap_to_cpumask(cpumask_var_t *cpumask,
+                             const struct xenctl_bitmap *xenctl_cpumap)
+{
+    int err = 0;
+
+    if ( alloc_cpumask_var(cpumask) ) {
+        err = xenctl_bitmap_to_bitmap(cpumask_bits(*cpumask), xenctl_cpumap,
+                                      nr_cpu_ids);
+        /* In case of error, cleanup is up to us, as the caller won''t
care! */
+        if ( err )
+            free_cpumask_var(*cpumask);
+    }
     else
         err = -ENOMEM;
 
-    xfree(bytemap);
-
     return err;
 }
 
@@ -621,7 +645,7 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
         {
             cpumask_var_t new_affinity;
 
-            ret = xenctl_cpumap_to_cpumask(
+            ret = xenctl_bitmap_to_cpumask(
                 &new_affinity, &op->u.vcpuaffinity.cpumap);
             if ( !ret )
             {
@@ -631,7 +655,7 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
         }
         else
         {
-            ret = cpumask_to_xenctl_cpumap(
+            ret = cpumask_to_xenctl_bitmap(
                 &op->u.vcpuaffinity.cpumap, v->cpu_affinity);
         }
 
diff --git a/xen/common/trace.c b/xen/common/trace.c
--- a/xen/common/trace.c
+++ b/xen/common/trace.c
@@ -384,7 +384,7 @@ int tb_control(xen_sysctl_tbuf_op_t *tbc
     {
         cpumask_var_t mask;
 
-        rc = xenctl_cpumap_to_cpumask(&mask, &tbc->cpu_mask);
+        rc = xenctl_bitmap_to_cpumask(&mask, &tbc->cpu_mask);
         if ( !rc )
         {
             cpumask_copy(&tb_cpu_mask, mask);
diff --git a/xen/include/public/arch-x86/xen-mca.h
b/xen/include/public/arch-x86/xen-mca.h
--- a/xen/include/public/arch-x86/xen-mca.h
+++ b/xen/include/public/arch-x86/xen-mca.h
@@ -414,7 +414,7 @@ struct xen_mc_mceinject {
 
 struct xen_mc_inject_v2 {
 	uint32_t flags;
-	struct xenctl_cpumap cpumap;
+	struct xenctl_bitmap cpumap;
 };
 #endif
 
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -284,7 +284,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvc
 /* XEN_DOMCTL_getvcpuaffinity */
 struct xen_domctl_vcpuaffinity {
     uint32_t  vcpu;              /* IN */
-    struct xenctl_cpumap cpumap; /* IN/OUT */
+    struct xenctl_bitmap cpumap; /* IN/OUT */
 };
 typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t);
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -71,7 +71,7 @@ struct xen_sysctl_tbuf_op {
 #define XEN_SYSCTL_TBUFOP_disable      5
     uint32_t cmd;
     /* IN/OUT variables */
-    struct xenctl_cpumap cpu_mask;
+    struct xenctl_bitmap cpu_mask;
     uint32_t             evt_mask;
     /* OUT variables */
     uint64_aligned_t buffer_mfn;
@@ -532,7 +532,7 @@ struct xen_sysctl_cpupool_op {
     uint32_t domid;       /* IN: M              */
     uint32_t cpu;         /* IN: AR             */
     uint32_t n_dom;       /*            OUT: I  */
-    struct xenctl_cpumap cpumap; /*     OUT: IF */
+    struct xenctl_bitmap cpumap; /*     OUT: IF */
 };
 typedef struct xen_sysctl_cpupool_op xen_sysctl_cpupool_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpupool_op_t);
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -820,9 +820,9 @@ typedef uint8_t xen_domain_handle_t[16];
 #endif
 
 #ifndef __ASSEMBLY__
-struct xenctl_cpumap {
+struct xenctl_bitmap {
     XEN_GUEST_HANDLE_64(uint8) bitmap;
-    uint32_t nr_cpus;
+    uint32_t nr_elems;
 };
 #endif
 
diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h
--- a/xen/include/xen/cpumask.h
+++ b/xen/include/xen/cpumask.h
@@ -424,8 +424,8 @@ extern cpumask_t cpu_present_map;
 #define for_each_present_cpu(cpu)  for_each_cpu(cpu, &cpu_present_map)
 
 /* Copy to/from cpumap provided by control tools. */
-struct xenctl_cpumap;
-int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const cpumask_t *);
-int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_cpumap *);
+struct xenctl_bitmap;
+int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *, const cpumask_t *);
+int xenctl_bitmap_to_cpumask(cpumask_var_t *, const struct xenctl_bitmap *);
 
 #endif /* __XEN_CPUMASK_H */
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -2,7 +2,7 @@
 # ! - needs translation
 # ? - needs checking
 ?	dom0_vga_console_info		xen.h
-?	xenctl_cpumap			xen.h
+?	xenctl_bitmap			xen.h
 ?	mmu_update			xen.h
 !	mmuext_op			xen.h
 !	start_info			xen.h

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 2 of 8] xen, libxc: introduce node maps and masks

Following suit from cpumap and cpumask implementations.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -54,6 +54,11 @@ int xc_get_cpumap_size(xc_interface *xch
     return (xc_get_max_cpus(xch) + 7) / 8;
 }
 
+int xc_get_nodemap_size(xc_interface *xch)
+{
+    return (xc_get_max_nodes(xch) + 7) / 8;
+}
+
 xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
 {
     int sz;
@@ -64,6 +69,16 @@ xc_cpumap_t xc_cpumap_alloc(xc_interface
     return calloc(1, sz);
 }
 
+xc_nodemap_t xc_nodemap_alloc(xc_interface *xch)
+{
+    int sz;
+
+    sz = xc_get_nodemap_size(xch);
+    if (sz == 0)
+        return NULL;
+    return calloc(1, sz);
+}
+
 int xc_readconsolering(xc_interface *xch,
                        char *buffer,
                        unsigned int *pnr_chars,
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -330,12 +330,20 @@ int xc_get_cpumap_size(xc_interface *xch
 /* allocate a cpumap */
 xc_cpumap_t xc_cpumap_alloc(xc_interface *xch);
 
- /*
+/*
  * NODEMAP handling
  */
+typedef uint8_t *xc_nodemap_t;
+
 /* return maximum number of NUMA nodes the hypervisor supports */
 int xc_get_max_nodes(xc_interface *xch);
 
+/* return array size for nodemap */
+int xc_get_nodemap_size(xc_interface *xch);
+
+/* allocate a nodemap */
+xc_nodemap_t xc_nodemap_alloc(xc_interface *xch);
+
 /*
  * DOMAIN DEBUGGING FUNCTIONS
  */
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -118,6 +118,30 @@ int xenctl_bitmap_to_cpumask(cpumask_var
     return err;
 }
 
+int nodemask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_nodemap,
+                              const nodemask_t *nodemask)
+{
+    return bitmap_to_xenctl_bitmap(xenctl_nodemap, cpumask_bits(nodemask),
+                                   MAX_NUMNODES);
+}
+
+int xenctl_bitmap_to_nodemask(nodemask_t *nodemask,
+                              const struct xenctl_bitmap *xenctl_nodemap)
+{
+    int err = 0;
+
+    if ( alloc_nodemask_var(nodemask) ) {
+        err = xenctl_bitmap_to_bitmap(nodes_addr(*nodemask), xenctl_nodemap,
+                                      MAX_NUMNODES);
+        if ( err )
+            free_nodemask_var(*nodemask);
+    }
+    else
+        err = -ENOMEM;
+
+    return err;
+}
+
 static inline int is_free_domid(domid_t dom)
 {
     struct domain *d;
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -298,6 +298,53 @@ static inline int __nodemask_parse(const
 }
 #endif
 
+/*
+ * nodemask_var_t: struct nodemask for stack usage.
+ *
+ * See definition of cpumask_var_t in include/xen//cpumask.h.
+ */
+#if MAX_NUMNODES > 2 * BITS_PER_LONG
+#include <xen/xmalloc.h>
+
+typedef nodemask_t *nodemask_var_t;
+
+#define nr_nodemask_bits (BITS_TO_LONGS(MAX_NUMNODES) * BITS_PER_LONG)
+
+static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
+{
+	*(void **)mask = _xmalloc(nr_nodemask_bits / 8, sizeof(long));
+	return *mask != NULL;
+}
+
+static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
+{
+	*(void **)mask = _xzalloc(nr_nodemask_bits / 8, sizeof(long));
+	return *mask != NULL;
+}
+
+static inline void free_nodemask_var(nodemask_var_t mask)
+{
+	xfree(mask);
+}
+#else
+typedef nodemask_t nodemask_var_t;
+
+static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
+{
+	return 1;
+}
+
+static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
+{
+	nodes_clear(*mask);
+	return 1;
+}
+
+static inline void free_nodemask_var(nodemask_var_t mask)
+{
+}
+#endif
+
 #if MAX_NUMNODES > 1
 #define for_each_node_mask(node, mask)			\
 	for ((node) = first_node(mask);			\

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

As vcpu-affinity tells where vcpus can run, node-affinity tells
where a domain''s vcpus prefer to run. Respecting vcpu-affinity is
the primary concern, but honouring node-affinity will likely
result in some performances benefit.

This change modifies the vcpu load balancing algorithm (for the
credit scheduler only), introducing a two steps logic.
During the first step, we use the node-affinity mask. The aim is
giving precedence to the CPUs where it is known to be preferrable
for the domain to run. If that fails in finding a valid CPU, the
node-affinity is just ignored and, in the second step, we fall
back to using cpu-affinity only.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -101,6 +101,13 @@
 
 
 /*
+ * Node Balancing
+ */
+#define CSCHED_BALANCE_CPU_AFFINITY     0
+#define CSCHED_BALANCE_NODE_AFFINITY    1
+#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY
+
+/*
  * Boot parameters
  */
 static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS;
@@ -148,6 +155,9 @@ struct csched_dom {
     struct list_head active_vcpu;
     struct list_head active_sdom_elem;
     struct domain *dom;
+    /* cpumask translated from the domain'' node-affinity
+     * mask. Basically, the CPUs we prefer to be scheduled on. */
+    cpumask_var_t node_affinity_cpumask;
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
@@ -228,6 +238,39 @@ static inline void
     list_del_init(&svc->runq_elem);
 }
 
+#define for_each_csched_balance_step(__step) \
+    for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
+
+/*
+ * Each csched-balance step should use its own cpumask. This function
+ * determines which one, given the step, and copies it in mask. Notice
+ * that, in the case of a node balancing step, it also filters out from
+ * the node-affinity mask the cpus that are not part of vc''s
cpu-affinity,
+ * as we do not want to end up running a vcpu where it is not allowed to!
+ *
+ * As an optimization, if a domain does not have any specific node-affinity
+ * (namely, its node affinity is automatically computed), we inform the
+ * caller that he can skip the first step by returning -1.
+ */
+static int
+csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
+    {
+        struct domain *d = vc->domain;
+        struct csched_dom *sdom = CSCHED_DOM(d);
+
+        if ( cpumask_full(sdom->node_affinity_cpumask) )
+            return -1;
+
+        cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
+    }
+    else /* step == CSCHED_BALANACE_CPU_AFFINITY */
+        cpumask_copy(mask, vc->cpu_affinity);
+
+    return 0;
+}
+
 static void burn_credits(struct csched_vcpu *svc, s_time_t now)
 {
     s_time_t delta;
@@ -250,6 +293,20 @@ boolean_param("tickle_one_idle_cpu", opt
 DEFINE_PER_CPU(unsigned int, last_tickle_cpu);
 
 static inline void
+__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
+{
+    CSCHED_STAT_CRANK(tickle_idlers_some);
+    if ( opt_tickle_one_idle )
+    {
+        this_cpu(last_tickle_cpu) +           
cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
+        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
+    }
+    else
+        cpumask_or(mask, mask, idle_mask);
+}
+
+static inline void
 __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
 {
     struct csched_vcpu * const cur @@ -287,22 +344,26 @@ static inline void
         }
         else
         {
-            cpumask_t idle_mask;
+            cpumask_t idle_mask, balance_mask;
+            int balance_step;
 
-            cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
-            if ( !cpumask_empty(&idle_mask) )
+            for_each_csched_balance_step(balance_step)
             {
-                CSCHED_STAT_CRANK(tickle_idlers_some);
-                if ( opt_tickle_one_idle )
-                {
-                    this_cpu(last_tickle_cpu) = 
-                        cpumask_cycle(this_cpu(last_tickle_cpu),
&idle_mask);
-                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
-                }
-                else
-                    cpumask_or(&mask, &mask, &idle_mask);
+                if ( csched_balance_cpumask(new->vcpu, balance_step,
+                                            &balance_mask) )
+                    continue;
+
+                /* Look for idlers in the step''s cpumask */
+                cpumask_and(&idle_mask, prv->idlers, &balance_mask);
+                if ( !cpumask_empty(&idle_mask) )
+                    __cpumask_tickle(&mask, &idle_mask);
+
+                cpumask_and(&mask, &mask, &balance_mask);
+
+                /* We can quit balancing if we found someone to tickle */
+                if ( !cpumask_empty(&mask) )
+                    break;
             }
-            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
         }
     }
 
@@ -443,35 +504,42 @@ static inline int
 }
 
 static inline int
-__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
+__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
 {
     /*
      * Don''t pick up work that''s in the peer''s
scheduling tail or hot on
-     * peer PCPU. Only pick up work that''s allowed to run on our CPU.
+     * peer PCPU. Only pick up work that prefers and/or is allowed to run
+     * on our CPU.
      */
     return !vc->is_running &&
            !__csched_vcpu_is_cache_hot(vc) &&
-           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
+           cpumask_test_cpu(dest_cpu, mask);
 }
 
 static int
 _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
 {
-    cpumask_t cpus;
+    cpumask_t cpus, start_cpus;
     cpumask_t idlers;
     cpumask_t *online;
+    struct csched_dom *sdom = CSCHED_DOM(vc->domain);
     struct csched_pcpu *spc = NULL;
     int cpu;
 
     /*
-     * Pick from online CPUs in VCPU''s affinity mask, giving a
-     * preference to its current processor if it''s in there.
+     * Pick an online CPU from the && of vcpu-affinity and
node-affinity
+     * masks (if not empty, in which case only the vcpu-affinity mask is
+     * used). Also, try to give a preference to its current processor if
+     * it''s in there.
      */
     online = cpupool_scheduler_cpumask(vc->domain->cpupool);
     cpumask_and(&cpus, online, vc->cpu_affinity);
-    cpu = cpumask_test_cpu(vc->processor, &cpus)
+    cpumask_and(&start_cpus, &cpus, sdom->node_affinity_cpumask);
+    if ( unlikely(cpumask_empty(&start_cpus)) )
+        cpumask_copy(&start_cpus, &cpus);
+    cpu = cpumask_test_cpu(vc->processor, &start_cpus)
             ? vc->processor
-            : cpumask_cycle(vc->processor, &cpus);
+            : cpumask_cycle(vc->processor, &start_cpus);
     ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu,
&cpus) );
 
     /*
@@ -867,6 +935,13 @@ csched_alloc_domdata(const struct schedu
     if ( sdom == NULL )
         return NULL;
 
+    if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) )
+    {
+        xfree(sdom);
+        return NULL;
+    }
+    cpumask_setall(sdom->node_affinity_cpumask);
+
     /* Initialize credit and weight */
     INIT_LIST_HEAD(&sdom->active_vcpu);
     sdom->active_vcpu_count = 0;
@@ -900,6 +975,9 @@ csched_dom_init(const struct scheduler *
 static void
 csched_free_domdata(const struct scheduler *ops, void *data)
 {
+    struct csched_dom *sdom = data;
+
+    free_cpumask_var(sdom->node_affinity_cpumask);
     xfree(data);
 }
 
@@ -1211,30 +1289,48 @@ csched_runq_steal(int peer_cpu, int cpu,
      */
     if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
     {
-        list_for_each( iter, &peer_pcpu->runq )
+        int balance_step;
+
+        /*
+         * Take node-affinity into account. That means, for all the vcpus
+         * in peer_pcpu''s runq, check _first_ if their node-affinity
allows
+         * them to run on cpu. If not, retry the loop considering plain
+         * vcpu-affinity. Also, notice that as soon as one vcpu is found,
+         * balancing is considered done, and the vcpu is returned to the
+         * caller.
+         */
+        for_each_csched_balance_step(balance_step)
         {
-            speer = __runq_elem(iter);
+            list_for_each( iter, &peer_pcpu->runq )
+            {
+                cpumask_t balance_mask;
 
-            /*
-             * If next available VCPU here is not of strictly higher
-             * priority than ours, this PCPU is useless to us.
-             */
-            if ( speer->pri <= pri )
-                break;
+                speer = __runq_elem(iter);
 
-            /* Is this VCPU is runnable on our PCPU? */
-            vc = speer->vcpu;
-            BUG_ON( is_idle_vcpu(vc) );
+                /*
+                 * If next available VCPU here is not of strictly higher
+                 * priority than ours, this PCPU is useless to us.
+                 */
+                if ( speer->pri <= pri )
+                    break;
 
-            if (__csched_vcpu_is_migrateable(vc, cpu))
-            {
-                /* We got a candidate. Grab it! */
-                CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
-                CSCHED_STAT_CRANK(migrate_queued);
-                WARN_ON(vc->is_urgent);
-                __runq_remove(speer);
-                vc->processor = cpu;
-                return speer;
+                /* Is this VCPU runnable on our PCPU? */
+                vc = speer->vcpu;
+                BUG_ON( is_idle_vcpu(vc) );
+
+                if ( csched_balance_cpumask(vc, balance_step,
&balance_mask) )
+                    continue;
+
+                if (__csched_vcpu_is_migrateable(vc, cpu, &balance_mask))
+                {
+                    /* We got a candidate. Grab it! */
+                    CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
+                    CSCHED_STAT_CRANK(migrate_queued);
+                    WARN_ON(vc->is_urgent);
+                    __runq_remove(speer);
+                    vc->processor = cpu;
+                    return speer;
+                }
             }
         }
     }

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

Make it possible to pass the node-affinity of a domain to the hypervisor
from the upper layers, instead of always being computed automatically.

Note that this also required generalizing the Flask hooks for setting
and getting the affinity, so that they now deal with both vcpu and
node affinity.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/xen/common/domain.c b/xen/common/domain.c
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -222,6 +222,7 @@ struct domain *domain_create(
 
     spin_lock_init(&d->node_affinity_lock);
     d->node_affinity = NODE_MASK_ALL;
+    d->auto_node_affinity = 1;
 
     spin_lock_init(&d->shutdown_lock);
     d->shutdown_code = -1;
@@ -362,11 +363,26 @@ void domain_update_node_affinity(struct 
         cpumask_or(cpumask, cpumask, online_affinity);
     }
 
-    for_each_online_node ( node )
-        if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
-            node_set(node, nodemask);
+    if ( d->auto_node_affinity )
+    {
+        /* Node-affinity is automaically computed from all vcpu-affinities */
+        for_each_online_node ( node )
+            if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
+                node_set(node, nodemask);
 
-    d->node_affinity = nodemask;
+        d->node_affinity = nodemask;
+    }
+    else
+    {
+        /* Node-affinity is provided by someone else, just filter out cpus
+         * that are either offline or not in the affinity of any vcpus. */
+        for_each_node_mask ( node, d->node_affinity )
+            if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) )
+                node_clear(node, d->node_affinity);
+    }
+
+    sched_set_node_affinity(d, &d->node_affinity);
+
     spin_unlock(&d->node_affinity_lock);
 
     free_cpumask_var(online_affinity);
@@ -374,6 +390,36 @@ void domain_update_node_affinity(struct 
 }
 
 
+int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
+{
+    /* Being affine with no nodes is just wrong */
+    if ( nodes_empty(*affinity) )
+        return -EINVAL;
+
+    spin_lock(&d->node_affinity_lock);
+
+    /*
+     * Being/becoming explicitly affine to all nodes is not particularly
+     * useful. Let''s take it as the `reset node affinity` command.
+     */
+    if ( nodes_full(*affinity) )
+    {
+        d->auto_node_affinity = 1;
+        goto out;
+    }
+
+    d->auto_node_affinity = 0;
+    d->node_affinity = *affinity;
+
+out:
+    spin_unlock(&d->node_affinity_lock);
+
+    domain_update_node_affinity(d);
+
+    return 0;
+}
+
+
 struct domain *get_domain_by_id(domid_t dom)
 {
     struct domain *d;
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -642,6 +642,40 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
     }
     break;
 
+    case XEN_DOMCTL_setnodeaffinity:
+    case XEN_DOMCTL_getnodeaffinity:
+    {
+        domid_t dom = op->domain;
+        struct domain *d = rcu_lock_domain_by_id(dom);
+
+        ret = -ESRCH;
+        if ( d == NULL )
+            break;
+
+        ret = xsm_nodeaffinity(op->cmd, d);
+        if ( ret )
+            goto nodeaffinity_out;
+
+        if ( op->cmd == XEN_DOMCTL_setnodeaffinity )
+        {
+            nodemask_t new_affinity;
+
+            ret = xenctl_bitmap_to_nodemask(&new_affinity,
+                                           
&op->u.nodeaffinity.nodemap);
+            if ( !ret )
+                ret = domain_set_node_affinity(d, &new_affinity);
+        }
+        else
+        {
+            ret = nodemask_to_xenctl_bitmap(&op->u.nodeaffinity.nodemap,
+                                            &d->node_affinity);
+        }
+
+    nodeaffinity_out:
+        rcu_unlock_domain(d);
+    }
+    break;
+
     case XEN_DOMCTL_setvcpuaffinity:
     case XEN_DOMCTL_getvcpuaffinity:
     {
diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
--- a/xen/common/keyhandler.c
+++ b/xen/common/keyhandler.c
@@ -217,6 +217,14 @@ static void cpuset_print(char *set, int 
     *set++ = ''\0'';
 }
 
+static void nodeset_print(char *set, int size, const nodemask_t *mask)
+{
+    *set++ = ''['';
+    set += nodelist_scnprintf(set, size-2, mask);
+    *set++ = '']'';
+    *set++ = ''\0'';
+}
+
 static void periodic_timer_print(char *str, int size, uint64_t period)
 {
     if ( period == 0 )
@@ -272,6 +280,9 @@ static void dump_domains(unsigned char k
 
         dump_pageframe_info(d);
                
+        nodeset_print(tmpstr, sizeof(tmpstr), &d->node_affinity);
+        printk("NODE affinity for domain %d: %s\n", d->domain_id,
tmpstr);
+
         printk("VCPU information and callbacks for domain %u:\n",
                d->domain_id);
         for_each_vcpu ( d, v )
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -238,6 +238,33 @@ static inline void
     list_del_init(&svc->runq_elem);
 }
 
+/*
+ * Translates node-affinity mask into a cpumask, so that we can use it during
+ * actual scheduling. That of course will contain all the cpus from all the
+ * set nodes in the original node-affinity mask.
+ *
+ * Note that any serialization needed to access mask safely is complete
+ * responsibility of the caller of this function/hook.
+ */
+static void csched_set_node_affinity(
+    const struct scheduler *ops,
+    struct domain *d,
+    nodemask_t *mask)
+{
+    struct csched_dom *sdom;
+    int node;
+
+    /* Skip idle domain since it doesn''t even have a
node_affinity_cpumask */
+    if ( unlikely(is_idle_domain(d)) )
+        return;
+
+    sdom = CSCHED_DOM(d);
+    cpumask_clear(sdom->node_affinity_cpumask);
+    for_each_node_mask( node, *mask )
+        cpumask_or(sdom->node_affinity_cpumask,
sdom->node_affinity_cpumask,
+                   &node_to_cpumask(node));
+}
+
 #define for_each_csched_balance_step(__step) \
     for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
 
@@ -260,7 +287,8 @@ csched_balance_cpumask(const struct vcpu
         struct domain *d = vc->domain;
         struct csched_dom *sdom = CSCHED_DOM(d);
 
-        if ( cpumask_full(sdom->node_affinity_cpumask) )
+        if ( cpumask_full(sdom->node_affinity_cpumask) ||
+             d->auto_node_affinity == 1 )
             return -1;
 
         cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity);
@@ -1786,6 +1814,8 @@ const struct scheduler sched_credit_def 
     .adjust         = csched_dom_cntl,
     .adjust_global  = csched_sys_cntl,
 
+    .set_node_affinity  = csched_set_node_affinity,
+
     .pick_cpu       = csched_cpu_pick,
     .do_schedule    = csched_schedule,
 
diff --git a/xen/common/schedule.c b/xen/common/schedule.c
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -588,6 +588,11 @@ int cpu_disable_scheduler(unsigned int c
     return ret;
 }
 
+void sched_set_node_affinity(struct domain *d, nodemask_t *mask)
+{
+    SCHED_OP(DOM2OP(d), set_node_affinity, d, mask);
+}
+
 int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity)
 {
     cpumask_t online_affinity;
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -279,6 +279,16 @@ typedef struct xen_domctl_getvcpuinfo xe
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t);
 
 
+/* Get/set the NUMA node(s) with which the guest has affinity with. */
+/* XEN_DOMCTL_setnodeaffinity */
+/* XEN_DOMCTL_getnodeaffinity */
+struct xen_domctl_nodeaffinity {
+    struct xenctl_bitmap nodemap;/* IN */
+};
+typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t);
+
+
 /* Get/set which physical cpus a vcpu can execute on. */
 /* XEN_DOMCTL_setvcpuaffinity */
 /* XEN_DOMCTL_getvcpuaffinity */
@@ -900,6 +910,8 @@ struct xen_domctl {
 #define XEN_DOMCTL_set_access_required           64
 #define XEN_DOMCTL_audit_p2m                     65
 #define XEN_DOMCTL_set_virq_handler              66
+#define XEN_DOMCTL_setnodeaffinity               67
+#define XEN_DOMCTL_getnodeaffinity               68
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -913,6 +925,7 @@ struct xen_domctl {
         struct xen_domctl_getpageframeinfo  getpageframeinfo;
         struct xen_domctl_getpageframeinfo2 getpageframeinfo2;
         struct xen_domctl_getpageframeinfo3 getpageframeinfo3;
+        struct xen_domctl_nodeaffinity      nodeaffinity;
         struct xen_domctl_vcpuaffinity      vcpuaffinity;
         struct xen_domctl_shadow_op         shadow_op;
         struct xen_domctl_max_mem           max_mem;
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -8,8 +8,9 @@
  * See detailed comments in the file linux/bitmap.h describing the
  * data type on which these nodemasks are based.
  *
- * For details of nodemask_scnprintf() and nodemask_parse(),
- * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
+ * For details of nodemask_scnprintf(), nodelist_scnpintf() and
+ * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse()
+ * in lib/bitmap.c.
  *
  * The available nodemask operations are:
  *
@@ -48,6 +49,7 @@
  * unsigned long *nodes_addr(mask)	Array of unsigned long''s in mask
  *
  * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
+ * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for
printing
  * int nodemask_parse(ubuf, ulen, mask)	Parse ascii string as nodemask
  *
  * for_each_node_mask(node, mask)	for-loop node over mask
@@ -280,6 +282,14 @@ static inline int __first_unset_node(con
 
 #define nodes_addr(src) ((src).bits)
 
+#define nodelist_scnprintf(buf, len, src) \
+			__nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES)
+static inline int __nodelist_scnprintf(char *buf, int len,
+					const nodemask_t *srcp, int nbits)
+{
+	return bitmap_scnlistprintf(buf, len, srcp->bits, nbits);
+}
+
 #if 0
 #define nodemask_scnprintf(buf, len, src) \
 			__nodemask_scnprintf((buf), (len), &(src), MAX_NUMNODES)
diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h
+++ b/xen/include/xen/sched-if.h
@@ -182,6 +182,8 @@ struct scheduler {
                                     struct xen_domctl_scheduler_op *);
     int          (*adjust_global)  (const struct scheduler *,
                                     struct xen_sysctl_scheduler_op *);
+    void         (*set_node_affinity) (const struct scheduler *,
+                                       struct domain *, nodemask_t *);
     void         (*dump_settings)  (const struct scheduler *);
     void         (*dump_cpu_state) (const struct scheduler *, int);
 
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -346,8 +346,12 @@ struct domain
     /* Various mem_events */
     struct mem_event_per_domain *mem_event;
 
-    /* Currently computed from union of all vcpu cpu-affinity masks. */
+    /*
+     * Can be specified by the user. If that is not the case, it is
+     * computed from the union of all the vcpu cpu-affinity masks.
+     */
     nodemask_t node_affinity;
+    int auto_node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
 };
@@ -416,6 +420,7 @@ static inline void get_knownalive_domain
     ASSERT(!(atomic_read(&d->refcnt) & DOMAIN_DESTROYED));
 }
 
+int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity);
 void domain_update_node_affinity(struct domain *d);
 
 struct domain *domain_create(
@@ -519,6 +524,7 @@ void sched_destroy_domain(struct domain 
 int sched_move_domain(struct domain *d, struct cpupool *c);
 long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);
 long sched_adjust_global(struct xen_sysctl_scheduler_op *);
+void sched_set_node_affinity(struct domain *, nodemask_t *);
 int  sched_id(void);
 void sched_tick_suspend(void);
 void sched_tick_resume(void);
diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
--- a/xen/include/xsm/xsm.h
+++ b/xen/include/xsm/xsm.h
@@ -56,6 +56,7 @@ struct xsm_operations {
     int (*domain_create) (struct domain *d, u32 ssidref);
     int (*max_vcpus) (struct domain *d);
     int (*destroydomain) (struct domain *d);
+    int (*nodeaffinity) (int cmd, struct domain *d);
     int (*vcpuaffinity) (int cmd, struct domain *d);
     int (*scheduler) (struct domain *d);
     int (*getdomaininfo) (struct domain *d);
@@ -229,6 +230,11 @@ static inline int xsm_destroydomain (str
     return xsm_call(destroydomain(d));
 }
 
+static inline int xsm_nodeaffinity (int cmd, struct domain *d)
+{
+    return xsm_call(nodeaffinity(cmd, d));
+}
+
 static inline int xsm_vcpuaffinity (int cmd, struct domain *d)
 {
     return xsm_call(vcpuaffinity(cmd, d));
diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
--- a/xen/xsm/dummy.c
+++ b/xen/xsm/dummy.c
@@ -634,6 +634,7 @@ void xsm_fixup_ops (struct xsm_operation
     set_to_dummy_if_null(ops, domain_create);
     set_to_dummy_if_null(ops, max_vcpus);
     set_to_dummy_if_null(ops, destroydomain);
+    set_to_dummy_if_null(ops, nodeaffinity);
     set_to_dummy_if_null(ops, vcpuaffinity);
     set_to_dummy_if_null(ops, scheduler);
     set_to_dummy_if_null(ops, getdomaininfo);
diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
--- a/xen/xsm/flask/hooks.c
+++ b/xen/xsm/flask/hooks.c
@@ -521,17 +521,19 @@ static int flask_destroydomain(struct do
                            DOMAIN__DESTROY);
 }
 
-static int flask_vcpuaffinity(int cmd, struct domain *d)
+static int flask_affinity(int cmd, struct domain *d)
 {
     u32 perm;
 
     switch ( cmd )
     {
     case XEN_DOMCTL_setvcpuaffinity:
-        perm = DOMAIN__SETVCPUAFFINITY;
+    case XEN_DOMCTL_setnodeaffinity:
+        perm = DOMAIN__SETAFFINITY;
         break;
     case XEN_DOMCTL_getvcpuaffinity:
-        perm = DOMAIN__GETVCPUAFFINITY;
+    case XEN_DOMCTL_getnodeaffinity:
+        perm = DOMAIN__GETAFFINITY;
         break;
     default:
         return -EPERM;
@@ -1473,7 +1475,8 @@ static struct xsm_operations flask_ops      .domain_create
= flask_domain_create,
     .max_vcpus = flask_max_vcpus,
     .destroydomain = flask_destroydomain,
-    .vcpuaffinity = flask_vcpuaffinity,
+    .nodeaffinity = flask_affinity,
+    .vcpuaffinity = flask_affinity,
     .scheduler = flask_scheduler,
     .getdomaininfo = flask_getdomaininfo,
     .getvcpucontext = flask_getvcpucontext,
diff --git a/xen/xsm/flask/include/av_perm_to_string.h
b/xen/xsm/flask/include/av_perm_to_string.h
--- a/xen/xsm/flask/include/av_perm_to_string.h
+++ b/xen/xsm/flask/include/av_perm_to_string.h
@@ -37,8 +37,8 @@
    S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
    S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
    S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY, "setvcpuaffinity")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY, "getvcpuaffinity")
+   S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity")
+   S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity")
    S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler")
    S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo")
    S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo")
diff --git a/xen/xsm/flask/include/av_permissions.h
b/xen/xsm/flask/include/av_permissions.h
--- a/xen/xsm/flask/include/av_permissions.h
+++ b/xen/xsm/flask/include/av_permissions.h
@@ -38,8 +38,8 @@
 #define DOMAIN__TRANSITION                        0x00000020UL
 #define DOMAIN__MAX_VCPUS                         0x00000040UL
 #define DOMAIN__DESTROY                           0x00000080UL
-#define DOMAIN__SETVCPUAFFINITY                   0x00000100UL
-#define DOMAIN__GETVCPUAFFINITY                   0x00000200UL
+#define DOMAIN__SETAFFINITY                       0x00000100UL
+#define DOMAIN__GETAFFINITY                       0x00000200UL
 #define DOMAIN__SCHEDULER                         0x00000400UL
 #define DOMAIN__GETDOMAININFO                     0x00000800UL
 #define DOMAIN__GETVCPUINFO                       0x00001000UL

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 5 of 8] libxc: allow for explicitly specifying node-affinity

By providing the proper get/set interface and wiring them
to the new domctl-s from the previous commit.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -110,6 +110,83 @@ int xc_domain_shutdown(xc_interface *xch
 }
 
 
+int xc_domain_node_setaffinity(xc_interface *xch,
+                               uint32_t domid,
+                               xc_nodemap_t nodemap)
+{
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
+    int ret = -1;
+    int nodesize;
+
+    nodesize = xc_get_nodemap_size(xch);
+    if (!nodesize)
+    {
+        PERROR("Could not get number of nodes");
+        goto out;
+    }
+
+    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
+    if ( local == NULL )
+    {
+        PERROR("Could not allocate memory for setnodeaffinity domctl
hypercall");
+        goto out;
+    }
+
+    domctl.cmd = XEN_DOMCTL_setnodeaffinity;
+    domctl.domain = (domid_t)domid;
+
+    memcpy(local, nodemap, nodesize);
+    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
+    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
+
+    ret = do_domctl(xch, &domctl);
+
+    xc_hypercall_buffer_free(xch, local);
+
+ out:
+    return ret;
+}
+
+int xc_domain_node_getaffinity(xc_interface *xch,
+                               uint32_t domid,
+                               xc_nodemap_t nodemap)
+{
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
+    int ret = -1;
+    int nodesize;
+
+    nodesize = xc_get_nodemap_size(xch);
+    if (!nodesize)
+    {
+        PERROR("Could not get number of nodes");
+        goto out;
+    }
+
+    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
+    if ( local == NULL )
+    {
+        PERROR("Could not allocate memory for getnodeaffinity domctl
hypercall");
+        goto out;
+    }
+
+    domctl.cmd = XEN_DOMCTL_getnodeaffinity;
+    domctl.domain = (domid_t)domid;
+
+    set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local);
+    domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8;
+
+    ret = do_domctl(xch, &domctl);
+
+    memcpy(nodemap, local, nodesize);
+
+    xc_hypercall_buffer_free(xch, local);
+
+ out:
+    return ret;
+}
+
 int xc_vcpu_setaffinity(xc_interface *xch,
                         uint32_t domid,
                         int vcpu,
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -521,6 +521,32 @@ int xc_watchdog(xc_interface *xch,
 		uint32_t id,
 		uint32_t timeout);
 
+/**
+ * This function explicitly sets the host NUMA nodes the domain will
+ * have affinity with.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id one wants to set the affinity of.
+ * @parm nodemap the map of the affine nodes.
+ * @return 0 on success, -1 on failure.
+ */
+int xc_domain_node_setaffinity(xc_interface *xch,
+                               uint32_t domind,
+                               xc_nodemap_t nodemap);
+
+/**
+ * This function retrieves the host NUMA nodes the domain has
+ * affinity with.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id one wants to get the node affinity of.
+ * @parm nodemap the map of the affine nodes.
+ * @return 0 on success, -1 on failure.
+ */
+int xc_domain_node_getaffinity(xc_interface *xch,
+                               uint32_t domind,
+                               xc_nodemap_t nodemap);
+
 int xc_vcpu_setaffinity(xc_interface *xch,
                         uint32_t domid,
                         int vcpu,

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 6 of 8] libxl: allow for explicitly specifying node-affinity

By introducing a nodemap in libxl_domain_build_info and
providing the get/set methods to deal with it.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -3926,6 +3926,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx
     return rc;
 }
 
+int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap)
+{
+    if (xc_domain_node_setaffinity(ctx->xch, domid, nodemap->map)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node
affinity");
+        return ERROR_FAIL;
+    }
+    return 0;
+}
+
+int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap)
+{
+    if (xc_domain_node_getaffinity(ctx->xch, domid, nodemap->map)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting node
affinity");
+        return ERROR_FAIL;
+    }
+    return 0;
+}
+
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap)
 {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -859,6 +859,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
                            libxl_bitmap *cpumap);
 int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
                                unsigned int max_vcpus, libxl_bitmap *cpumap);
+int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap);
+int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid,
+                                  libxl_bitmap *nodemap);
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap);
 
 libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx);
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -219,6 +219,12 @@ int libxl__domain_build_info_setdefault(
 
     libxl_defbool_setdefault(&b_info->numa_placement, true);
 
+    if (!b_info->nodemap.size) {
+        if (libxl_node_bitmap_alloc(CTX, &b_info->nodemap, 0))
+            return ERROR_FAIL;
+        libxl_bitmap_set_any(&b_info->nodemap);
+    }
+
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
         b_info->max_memkb = 32 * 1024;
     if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -230,6 +230,7 @@ int libxl__build_pre(libxl__gc *gc, uint
         if (rc)
             return rc;
     }
+    libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap);
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,
&info->cpumap);
 
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
LIBXL_MAXMEM_CONSTANT);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -255,6 +255,7 @@ libxl_domain_build_info = Struct("domain
     ("max_vcpus",       integer),
     ("avail_vcpus",     libxl_bitmap),
     ("cpumap",          libxl_bitmap),
+    ("nodemap",         libxl_bitmap),
     ("numa_placement",  libxl_defbool),
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 7 of 8] libxl: automatic placement deals with node-affinity

Which basically means the following two things:
 1) during domain creation, it is the node-affinity of
    the domain --rather than the vcpu-affinities of its
    vcpus-- that is affected by automatic placement;
 2) during automatic placement, when counting how many
    vcpus are already "bound" to a placement candidate
    (as part of the process of choosing the best
    candidate), node-affinity is also considered,
    together with vcpu-affinity.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -133,13 +133,13 @@ static int numa_place_domain(libxl__gc *
 {
     int found;
     libxl__numa_candidate candidate;
-    libxl_bitmap candidate_nodemap;
+    libxl_bitmap cpupool_nodemap;
     libxl_cpupoolinfo cpupool_info;
     int i, cpupool, rc = 0;
     uint32_t memkb;
 
     libxl__numa_candidate_init(&candidate);
-    libxl_bitmap_init(&candidate_nodemap);
+    libxl_bitmap_init(&cpupool_nodemap);
 
     /*
      * Extract the cpumap from the cpupool the domain belong to. In fact,
@@ -156,7 +156,7 @@ static int numa_place_domain(libxl__gc *
     rc = libxl_domain_need_memory(CTX, info, &memkb);
     if (rc)
         goto out;
-    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) {
+    if (libxl_node_bitmap_alloc(CTX, &cpupool_nodemap, 0)) {
         rc = ERROR_FAIL;
         goto out;
     }
@@ -174,17 +174,19 @@ static int numa_place_domain(libxl__gc *
     if (found == 0)
         goto out;
 
-    /* Map the candidate''s node map to the domain''s
info->cpumap */
-    libxl__numa_candidate_get_nodemap(gc, &candidate,
&candidate_nodemap);
-    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap,
&info->cpumap);
+    /* Map the candidate''s node map to the domain''s
info->nodemap */
+    libxl__numa_candidate_get_nodemap(gc, &candidate,
&info->nodemap);
+
+    /* Avoid trying to set the affinity to nodes that might be in the
+     * candidate''s nodemap but out of our cpupool. */
+    rc = libxl_cpumap_to_nodemap(CTX, &cpupool_info.cpumap,
+                                 &cpupool_nodemap);
     if (rc)
         goto out;
 
-    /* Avoid trying to set the affinity to cpus that might be in the
-     * nodemap but not in our cpupool. */
-    libxl_for_each_set_bit(i, info->cpumap) {
-        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
-            libxl_bitmap_reset(&info->cpumap, i);
+    libxl_for_each_set_bit(i, info->nodemap) {
+        if (!libxl_bitmap_test(&cpupool_nodemap, i))
+            libxl_bitmap_reset(&info->nodemap, i);
     }
 
     LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
@@ -193,7 +195,7 @@ static int numa_place_domain(libxl__gc *
 
  out:
     libxl__numa_candidate_dispose(&candidate);
-    libxl_bitmap_dispose(&candidate_nodemap);
+    libxl_bitmap_dispose(&cpupool_nodemap);
     libxl_cpupoolinfo_dispose(&cpupool_info);
     return rc;
 }
@@ -211,10 +213,10 @@ int libxl__build_pre(libxl__gc *gc, uint
     /*
      * Check if the domain has any CPU affinity. If not, try to build
      * up one. In case numa_place_domain() find at least a suitable
-     * candidate, it will affect info->cpumap accordingly; if it
+     * candidate, it will affect info->nodemap accordingly; if it
      * does not, it just leaves it as it is. This means (unless
      * some weird error manifests) the subsequent call to
-     * libxl_set_vcpuaffinity_all() will do the actual placement,
+     * libxl_domain_set_nodeaffinity() will do the actual placement,
      * whatever that turns out to be.
      */
     if (libxl_defbool_val(info->numa_placement)) {
diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
--- a/tools/libxl/libxl_numa.c
+++ b/tools/libxl/libxl_numa.c
@@ -171,7 +171,7 @@ static int nodemap_to_nr_vcpus(libxl__gc
                                const libxl_bitmap *nodemap)
 {
     libxl_dominfo *dinfo = NULL;
-    libxl_bitmap vcpu_nodemap;
+    libxl_bitmap vcpu_nodemap, dom_nodemap;
     int nr_doms, nr_cpus;
     int nr_vcpus = 0;
     int i, j, k;
@@ -185,6 +185,12 @@ static int nodemap_to_nr_vcpus(libxl__gc
         return ERROR_FAIL;
     }
 
+    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
+        libxl_dominfo_list_free(dinfo, nr_doms);
+        libxl_bitmap_dispose(&vcpu_nodemap);
+        return ERROR_FAIL;
+    }
+
     for (i = 0; i < nr_doms; i++) {
         libxl_vcpuinfo *vinfo;
         int nr_dom_vcpus;
@@ -193,6 +199,9 @@ static int nodemap_to_nr_vcpus(libxl__gc
         if (vinfo == NULL)
             continue;
 
+        /* Retrieve the domain''s node-affinity map (see below) */
+        libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap);
+
         /* For each vcpu of each domain ... */
         for (j = 0; j < nr_dom_vcpus; j++) {
 
@@ -201,9 +210,17 @@ static int nodemap_to_nr_vcpus(libxl__gc
             libxl_for_each_set_bit(k, vinfo[j].cpumap)
                 libxl_bitmap_set(&vcpu_nodemap, tinfo[k].node);
 
-            /* And check if that map has any intersection with our nodemap */
+            /*
+             * We now check whether the && of the vcpu''s
nodemap and the
+             * domain''s nodemap has any intersection with the nodemap
of our
+             * canidate.
+             * Using both (vcpu''s and domain''s) nodemaps
allows us to take
+             * both vcpu-affinity and node-affinity into account when counting
+             * the number of vcpus bound to the candidate.
+             */
             libxl_for_each_set_bit(k, vcpu_nodemap) {
-                if (libxl_bitmap_test(nodemap, k)) {
+                if (libxl_bitmap_test(&dom_nodemap, k) &&
+                    libxl_bitmap_test(nodemap, k)) {
                     nr_vcpus++;
                     break;
                 }
@@ -213,6 +230,7 @@ static int nodemap_to_nr_vcpus(libxl__gc
         libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
     }
 
+    libxl_bitmap_dispose(&dom_nodemap);
     libxl_bitmap_dispose(&vcpu_nodemap);
     libxl_dominfo_list_free(dinfo, nr_doms);
     return nr_vcpus;

Dario Faggioli

2012-Oct-05 14:08 UTC

head link

[PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Node-affinity is now something that is under (some) control of the
user, so show it upon request as part of the output of `xl list''.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -2834,14 +2834,82 @@ out:
     }
 }
 
-static void list_domains(int verbose, int context, const libxl_dominfo *info,
int nb_domain)
+static void print_bitmap(uint8_t *map, int maplen, FILE *stream, int cpu_node)
+{
+    int i;
+    uint8_t pmap = 0, bitmask = 0;
+    int firstset = 0, state = 0;
+
+    for (i = 0; i < maplen; i++) {
+        if (i % 8 == 0) {
+            pmap = *map++;
+            bitmask = 1;
+        } else bitmask <<= 1;
+
+        switch (state) {
+        case 0:
+        case 2:
+            if ((pmap & bitmask) != 0) {
+                firstset = i;
+                state++;
+            }
+            continue;
+        case 1:
+        case 3:
+            if ((pmap & bitmask) == 0) {
+                fprintf(stream, "%s%d", state > 1 ? ","
: "", firstset);
+                if (i - 1 > firstset)
+                    fprintf(stream, "-%d", i - 1);
+                state = 2;
+            }
+            continue;
+        }
+    }
+    switch (state) {
+        case 0:
+            fprintf(stream, "none");
+            break;
+        case 2:
+            break;
+        case 1:
+            if (firstset == 0) {
+                fprintf(stream, cpu_node ? "any cpu" : "any
node");
+                break;
+            }
+        case 3:
+            fprintf(stream, "%s%d", state > 1 ? "," :
"", firstset);
+            if (i - 1 > firstset)
+                fprintf(stream, "-%d", i - 1);
+            break;
+    }
+}
+
+static void list_domains(int verbose, int context, int numa, const
libxl_dominfo *info, int nb_domain)
 {
     int i;
     static const char shutdown_reason_letters[]= "-rscw";
+    libxl_bitmap nodemap;
+    libxl_physinfo physinfo;
+
+    libxl_bitmap_init(&nodemap);
+    libxl_physinfo_init(&physinfo);
 
     printf("Name                                        ID   Mem
VCPUs\tState\tTime(s)");
     if (verbose) printf("   UUID                           
Reason-Code\tSecurity Label");
     if (context && !verbose) printf("   Security Label");
+    if (numa) {
+        if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) {
+            fprintf(stderr, "libxl_node_bitmap_alloc_failed.\n");
+            exit(1);
+        }
+        if (libxl_get_physinfo(ctx, &physinfo) != 0) {
+            fprintf(stderr, "libxl_physinfo failed.\n");
+            libxl_bitmap_dispose(&nodemap);
+            exit(1);
+        }
+
+        printf(" NODE Affinity");
+    }
     printf("\n");
     for (i = 0; i < nb_domain; i++) {
         char *domname;
@@ -2875,14 +2943,23 @@ static void list_domains(int verbose, in
             rc = libxl_flask_sid_to_context(ctx, info[i].ssidref, &buf,
                                             &size);
             if (rc < 0)
-                printf("  -");
+                printf("                -");
             else {
-                printf("  %s", buf);
+                printf(" %16s", buf);
                 free(buf);
             }
         }
+        if (numa) {
+            libxl_domain_get_nodeaffinity(ctx, info[i].domid, &nodemap);
+
+            putchar('' '');
+            print_bitmap(nodemap.map, physinfo.nr_nodes, stdout, 0);
+        }
         putchar(''\n'');
     }
+
+    libxl_bitmap_dispose(&nodemap);
+    libxl_physinfo_dispose(&physinfo);
 }
 
 static void list_vm(void)
@@ -3724,12 +3801,14 @@ int main_list(int argc, char **argv)
     int opt, verbose = 0;
     int context = 0;
     int details = 0;
+    int numa = 0;
     int option_index = 0;
     static struct option long_options[] = {
         {"long", 0, 0, ''l''},
         {"help", 0, 0, ''h''},
         {"verbose", 0, 0, ''v''},
         {"context", 0, 0, ''Z''},
+        {"numa", 0, 0, ''n''},
         {0, 0, 0, 0}
     };
 
@@ -3738,7 +3817,7 @@ int main_list(int argc, char **argv)
     int nb_domain, rc;
 
     while (1) {
-        opt = getopt_long(argc, argv, "lvhZ", long_options,
&option_index);
+        opt = getopt_long(argc, argv, "lvhZn", long_options,
&option_index);
         if (opt == -1)
             break;
 
@@ -3755,6 +3834,9 @@ int main_list(int argc, char **argv)
         case ''Z'':
             context = 1;
             break;
+        case ''n'':
+            numa = 1;
+            break;
         default:
             fprintf(stderr, "option `%c'' not supported.\n",
optopt);
             break;
@@ -3790,7 +3872,7 @@ int main_list(int argc, char **argv)
     if (details)
         list_domains_details(info, nb_domain);
     else
-        list_domains(verbose, context, info, nb_domain);
+        list_domains(verbose, context, numa, info, nb_domain);
 
     if (info_free)
         libxl_dominfo_list_free(info, nb_domain);
@@ -4062,56 +4144,6 @@ int main_button_press(int argc, char **a
     return 0;
 }
 
-static void print_bitmap(uint8_t *map, int maplen, FILE *stream)
-{
-    int i;
-    uint8_t pmap = 0, bitmask = 0;
-    int firstset = 0, state = 0;
-
-    for (i = 0; i < maplen; i++) {
-        if (i % 8 == 0) {
-            pmap = *map++;
-            bitmask = 1;
-        } else bitmask <<= 1;
-
-        switch (state) {
-        case 0:
-        case 2:
-            if ((pmap & bitmask) != 0) {
-                firstset = i;
-                state++;
-            }
-            continue;
-        case 1:
-        case 3:
-            if ((pmap & bitmask) == 0) {
-                fprintf(stream, "%s%d", state > 1 ? ","
: "", firstset);
-                if (i - 1 > firstset)
-                    fprintf(stream, "-%d", i - 1);
-                state = 2;
-            }
-            continue;
-        }
-    }
-    switch (state) {
-        case 0:
-            fprintf(stream, "none");
-            break;
-        case 2:
-            break;
-        case 1:
-            if (firstset == 0) {
-                fprintf(stream, "any cpu");
-                break;
-            }
-        case 3:
-            fprintf(stream, "%s%d", state > 1 ? "," :
"", firstset);
-            if (i - 1 > firstset)
-                fprintf(stream, "-%d", i - 1);
-            break;
-    }
-}
-
 static void print_vcpuinfo(uint32_t tdomid,
                            const libxl_vcpuinfo *vcpuinfo,
                            uint32_t nr_cpus)
@@ -4135,7 +4167,7 @@ static void print_vcpuinfo(uint32_t tdom
     /*      TIM */
     printf("%9.1f  ", ((float)vcpuinfo->vcpu_time / 1e9));
     /* CPU AFFINITY */
-    print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout);
+    print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout, 1);
     printf("\n");
 }
 
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -50,7 +50,8 @@ struct cmd_spec cmd_table[] = {
       "[options] [Domain]\n",
       "-l, --long              Output all VM details\n"
       "-v, --verbose           Prints out UUIDs and security
context\n"
-      "-Z, --context           Prints out security context"
+      "-Z, --context           Prints out security context\n"
+      "-n, --numa              Prints out NUMA node affinity"
     },
     { "destroy",
       &main_destroy, 0, 1,

Jan Beulich

2012-Oct-05 14:25 UTC

head link

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

>>> On 05.10.12 at 16:08, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> @@ -287,22 +344,26 @@ static inline void
>          }
>          else
>          {
> -            cpumask_t idle_mask;
> +            cpumask_t idle_mask, balance_mask;
Be _very_ careful about adding on-stack CPU mask variables
(also further below): each one of them grows the stack frame
by 512 bytes (when building for the current maximum of 4095
CPUs), which is generally too much; you may want to consider
pre-allocated scratch space instead.

Jan

Ian Jackson

2012-Oct-05 16:36 UTC

head link

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Dario Faggioli writes ("[PATCH 8 of 8] xl: add node-affinity to the output
of `xl list`"):> Node-affinity is now something that is under (some) control of the
> user, so show it upon request as part of the output of `xl list''.
...> -static void list_domains(int verbose, int context, const libxl_dominfo
*info, int nb_domain)
> +static void print_bitmap(uint8_t *map, int maplen, FILE *stream, int
cpu_node)
> +{
> +    int i;
> +    uint8_t pmap = 0, bitmask = 0;
> +    int firstset = 0, state = 0;
> +
> +    for (i = 0; i < maplen; i++) {
> +        if (i % 8 == 0) {
> +            pmap = *map++;
> +            bitmask = 1;
> +        } else bitmask <<= 1;
> +
> +        switch (state) {
> +        case 0:
> +        case 2:
> +            if ((pmap & bitmask) != 0) {
> +                firstset = i;
> +                state++;
> +            }
> +            continue;
> +        case 1:
> +        case 3:
> +            if ((pmap & bitmask) == 0) {
> +                fprintf(stream, "%s%d", state > 1 ?
"," : "", firstset);
> +                if (i - 1 > firstset)
> +                    fprintf(stream, "-%d", i - 1);
> +                state = 2;
> +            }
> +            continue;
> +        }
> +    }
Is this business with a state variable really the least opaque way of
writing this ?  Oh I see you''re just moving it about.  Oh well..

Ian.

Dan Magenheimer

2012-Oct-08 19:43 UTC

head link

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

> From: Dario Faggioli [mailto:dario.faggioli@citrix.com]
> Sent: Friday, October 05, 2012 8:08 AM
> To: xen-devel@lists.xen.org
> Cc: Andre Przywara; Ian Campbell; Anil Madhavapeddy; George Dunlap; Andrew
Cooper; Juergen Gross; Ian
> Jackson; Jan Beulich; Marcus Granado; Daniel De Graaf; Matt Wilson
> Subject: [Xen-devel] [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler
> 
> Hi Everyone,
> 
> Here it comes a patch series instilling some NUMA awareness in the Credit
> scheduler.
Hi Dario --

Just wondering... is the NUMA information preserved on live migration?
I''m not saying that it necessarily should, but it may just work
due to the implementation (since migration is a form of domain creation).
In either case, it might be good to comment about live migration
on your wiki.

Thanks,
Dan

Juergen Gross

2012-Oct-09 09:53 UTC

head link

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

Am 05.10.2012 16:08, schrieb Dario Faggioli:> As vcpu-affinity tells where vcpus can run, node-affinity tells
> where a domain''s vcpus prefer to run. Respecting vcpu-affinity is
> the primary concern, but honouring node-affinity will likely
> result in some performances benefit.
>
> This change modifies the vcpu load balancing algorithm (for the
> credit scheduler only), introducing a two steps logic.
> During the first step, we use the node-affinity mask. The aim is
> giving precedence to the CPUs where it is known to be preferrable
> for the domain to run. If that fails in finding a valid CPU, the
> node-affinity is just ignored and, in the second step, we fall
> back to using cpu-affinity only.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
>
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
...>   static int
>   _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t
commit)
>   {
> -    cpumask_t cpus;
> +    cpumask_t cpus, start_cpus;
>       cpumask_t idlers;
>       cpumask_t *online;
> +    struct csched_dom *sdom = CSCHED_DOM(vc->domain);
>       struct csched_pcpu *spc = NULL;
>       int cpu;
>
>       /*
> -     * Pick from online CPUs in VCPU''s affinity mask, giving a
> -     * preference to its current processor if it''s in there.
> +     * Pick an online CPU from the&&  of vcpu-affinity and
node-affinity
> +     * masks (if not empty, in which case only the vcpu-affinity mask is
> +     * used). Also, try to give a preference to its current processor if
> +     * it''s in there.
>        */
>       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
>       cpumask_and(&cpus, online, vc->cpu_affinity);
> -    cpu = cpumask_test_cpu(vc->processor,&cpus)
> +    cpumask_and(&start_cpus,&cpus,
sdom->node_affinity_cpumask);
> +    if ( unlikely(cpumask_empty(&start_cpus)) )
> +        cpumask_copy(&start_cpus,&cpus);
> +    cpu = cpumask_test_cpu(vc->processor,&start_cpus)
>               ? vc->processor
> -            : cpumask_cycle(vc->processor,&cpus);
> +            : cpumask_cycle(vc->processor,&start_cpus);
>       ASSERT( !cpumask_empty(&cpus)&& 
cpumask_test_cpu(cpu,&cpus) );
Shouldn''t the ASSERT be changed to start_cpus, too?


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

Juergen Gross

2012-Oct-09 10:02 UTC

head link

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Am 05.10.2012 16:08, schrieb Dario Faggioli:> Hi Everyone,
>
> Here it comes a patch series instilling some NUMA awareness in the Credit
> scheduler.
>
> What the patches do is teaching the Xen''s scheduler how to try
maximizing
> performances on a NUMA host, taking advantage of the information coming
from
> the automatic NUMA placement we have in libxl.  Right now, the
> placement algorithm runs and selects a node (or a set of nodes) where it is
best
> to put a new domain on. Then, all the memory for the new domain is
allocated
> from those node(s) and all the vCPUs of the new domain are pinned to the
pCPUs
> of those node(s). What we do here is, instead of statically pinning the
domain''s
> vCPUs to the nodes'' pCPUs, have the (Credit) scheduler _prefer_
running them
> there. That enables most of the performances benefits of "real"
pinning, but
> without its intrinsic lack of flexibility.
>
> The above happens by extending to the scheduler the knowledge of a
domain''s
> node-affinity. We then ask it to first try to run the domain''s
vCPUs on one of
> the nodes the domain has affinity with. Of course, if that turns out to be
> impossible, it falls back on the old behaviour (i.e., considering
vcpu-affinity
> only).
>
> Just allow me to mention that NUMA aware scheduling not only is one of the
item
> of the NUMA roadmap I''m trying to maintain here
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features
we
> decided we want for Xen 4.3 (and thus it is part of the list of such
features
> that George is maintaining).
>
> Up to now, I''ve been able to thoroughly test this only on my 2
NUMA nodes
> testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs,
and
> the results looks really nice.  A full set of what I got can be found
inside my
> presentation from last XenSummit, which is available here:
>
>  
http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html
>
> However, I rerun some of the tests in these last days (since I changed some
> bits of the implementation) and here''s what I got:
>
> -------------------------------------------------------
>   SpecJBB2005 Total Aggregate Throughput
> -------------------------------------------------------
> #VMs       No NUMA affinity     NUMA affinity&    +/- %
>                                    scheduling
> -------------------------------------------------------
>     2            34653.273          40243.015    +16.13%
>     4            29883.057          35526.807    +18.88%
>     6            23512.926          27015.786    +14.89%
>     8            19120.243          21825.818    +14.15%
>    10            15676.675          17701.472    +12.91%
>
> Basically, results are consistent with what is shown in the super-nice
graphs I
> have in the slides above! :-) As said, this looks nice to me, especially
> considering that my test machine is quite small, i.e., its 2 nodes are very
> close to each others from a latency point of view. I really expect more
> improvement on bigger hardware, where much greater NUMA effect is to be
> expected.  Of course, I myself will continue benchmarking (hopefully, on
> systems with more than 2 nodes too), but should anyone want to run its own
> testing, that would be great, so feel free to do that and report results to
me
> and/or to the list!
>
> A little bit more about the series:
>
>   1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap
>   2/8 xen, libxc: introduce node maps and masks
>
> Is some preparation work.
>
>   3/8 xen: let the (credit) scheduler know about `node affinity`
>
> Is where the vcpu load balancing logic of the credit scheduler is modified
to
> support node-affinity.
>
>   4/8 xen: allow for explicitly specifying node-affinity
>   5/8 libxc: allow for explicitly specifying node-affinity
>   6/8 libxl: allow for explicitly specifying node-affinity
>   7/8 libxl: automatic placement deals with node-affinity
>
> Is what wires the in-scheduler node-affinity support with the external
world.
> Please, note that patch 4 touches XSM and Flask, which is the area with
which I
> have less experience and less chance to test properly. So, If Daniel and/or
> anyone interested in that could take a look and comment, that would be
awesome.
>
>   8/8 xl: report node-affinity for domains
>
> Is just some small output enhancement.
Apart from the minor comment to Patch 3:

Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>


-- 
Juergen Gross                 Principal Developer Operating Systems
PBG PDG ES&S SWE OS6                   Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

Dario Faggioli

2012-Oct-09 10:21 UTC

head link

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

On Tue, 2012-10-09 at 11:53 +0200, Juergen Gross wrote: > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> > --- a/xen/common/sched_credit.c
> > +++ b/xen/common/sched_credit.c
> ...
> >   static int
> >   _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc,
bool_t commit)
> >   {
> > -    cpumask_t cpus;
> > +    cpumask_t cpus, start_cpus;
> >       cpumask_t idlers;
> >       cpumask_t *online;
> > +    struct csched_dom *sdom = CSCHED_DOM(vc->domain);
> >       struct csched_pcpu *spc = NULL;
> >       int cpu;
> >
> >       /*
> > -     * Pick from online CPUs in VCPU''s affinity mask, giving
a
> > -     * preference to its current processor if it''s in there.
> > +     * Pick an online CPU from the&&  of vcpu-affinity and
node-affinity
> > +     * masks (if not empty, in which case only the vcpu-affinity mask
is
> > +     * used). Also, try to give a preference to its current processor
if
> > +     * it''s in there.
> >        */
> >       online = cpupool_scheduler_cpumask(vc->domain->cpupool);
> >       cpumask_and(&cpus, online, vc->cpu_affinity);
> > -    cpu = cpumask_test_cpu(vc->processor,&cpus)
> > +    cpumask_and(&start_cpus,&cpus,
sdom->node_affinity_cpumask);
> > +    if ( unlikely(cpumask_empty(&start_cpus)) )
> > +        cpumask_copy(&start_cpus,&cpus);
> > +    cpu = cpumask_test_cpu(vc->processor,&start_cpus)
> >               ? vc->processor
> > -            : cpumask_cycle(vc->processor,&cpus);
> > +            : cpumask_cycle(vc->processor,&start_cpus);
> >       ASSERT( !cpumask_empty(&cpus)&& 
cpumask_test_cpu(cpu,&cpus) );
> 
> Shouldn''t the ASSERT be changed to start_cpus, too?
> Well, it seems it definitely should, and I seem to have missed that! 

Thanks a lot,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-09 10:29 UTC

head link

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

On Fri, 2012-10-05 at 15:25 +0100, Jan Beulich wrote: > >>> On 05.10.12 at 16:08, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> > @@ -287,22 +344,26 @@ static inline void
> >          }
> >          else
> >          {
> > -            cpumask_t idle_mask;
> > +            cpumask_t idle_mask, balance_mask;
> 
> Be _very_ careful about adding on-stack CPU mask variables
> (also further below): each one of them grows the stack frame
> by 512 bytes (when building for the current maximum of 4095
> CPUs), which is generally too much; you may want to consider
> pre-allocated scratch space instead.
> I see your point, and I think you''re right... I wasn''t
"thinking that
big". :-)

I''ll look into all of these situations and see if I can move the masks
off the stack. Any preference between global variables and members of
one of the scheduler''s data structures?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-09 10:45 UTC

head link

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

On Mon, 2012-10-08 at 12:43 -0700, Dan Magenheimer wrote:
> Just wondering... is the NUMA information preserved on live migration?
> I''m not saying that it necessarily should, but it may just work
> due to the implementation (since migration is a form of domain creation).
>What could I say... yes, but "preserved" is not the right word. :-)

In fact, something happens when you migrate a VM. As you said, migration
is a special case of domain creation, so the placement algorithm will
trigger as a part of the process of creating the target VM (unless you
override the relevant options in the config file during migration
itself). That means the target VM will be placed on one (some) node(s)
of the target host, and it''s node-affinity will be set accordingly.

_However_, there is right now no guarantee for the final decision from
the placing algorithm on the target machine to be "compatible" with
the
one made on the source machine at initial VM creation time. For
instance, if your VM fits in just one node and is placed there on
machine A, it well could end up being split on two or more nodes when
migrated on machine B (and, of course, the vice versa).

Whether, that is acceptable or not, is of course debatable, and we had a
bit of this discussion already (although no real conclusion has been
reached yet).
My take is that, right now, since we do not yet expose any virtual NUMA
topology to the VM itself, the behaviour described above is fine. As
soon as we''ll have some guest NUMA awareness, than it might be
worthwhile to try to preserve it, at least to some extent.

Oh, and BTW, I''m of course talking about migration with xl and libxl.
If
you use other toolstacks, then the hypervisor will default to his
current (_without_ this series) behaviour, and it all will depend on who
and when calls xc_domain_node_setaffinity() or, perhaps,
XEN_DOMCTL_setnodeaffinity directly.
> In either case, it might be good to comment about live migration
> on your wiki.
> That is definitely a good point, I will put something there about
migration and the behaviour described above.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-09 11:07 UTC

head link

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

On Fri, 2012-10-05 at 17:36 +0100, Ian Jackson wrote: > > +static void print_bitmap(uint8_t *map, int maplen, FILE *stream, int
cpu_node)
> > +{
> > +    int i;
> > +    uint8_t pmap = 0, bitmask = 0;
> > +    int firstset = 0, state = 0;
> > +
> > ...
> 
> Is this business with a state variable really the least opaque way of
> writing this ?  Oh I see you''re just moving it about.  Oh well..
> I don''t think it''s any opaque and, yes, I''m mostly
moving that
print_bitmap function up in the file, but the new status variable is
mine (guilty ass charge :-D).

Honestly, despite the fact that the function is called print_bitmap(),
it contains the following code:

        case 1:
            if (firstset == 0) {
                fprintf(stream, "any cpu");
                break;
            }
        case 3:

Which is what made me thinking that opacity was not its first concern in
the first place, and that turning it into being opaque was none of this
change''s business. :-)

However, I see your point... Perhaps I can add two functions (something
like print_{cpumap,nodemap}()), both calling the original
print_bitmap(), and deal with the "any {cpu,node}" case within them...

Do you like that better?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Keir Fraser

2012-Oct-09 11:10 UTC

head link

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

On 09/10/2012 11:29, "Dario Faggioli"
<dario.faggioli@citrix.com> wrote:
> On Fri, 2012-10-05 at 15:25 +0100, Jan Beulich wrote:
>>>>> On 05.10.12 at 16:08, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
>>> @@ -287,22 +344,26 @@ static inline void
>>>          }
>>>          else
>>>          {
>>> -            cpumask_t idle_mask;
>>> +            cpumask_t idle_mask, balance_mask;
>> 
>> Be _very_ careful about adding on-stack CPU mask variables
>> (also further below): each one of them grows the stack frame
>> by 512 bytes (when building for the current maximum of 4095
>> CPUs), which is generally too much; you may want to consider
>> pre-allocated scratch space instead.
>> 
> I see your point, and I think you''re right... I wasn''t
"thinking that
> big". :-)
> 
> I''ll look into all of these situations and see if I can move the
masks
> off the stack. Any preference between global variables and members of
> one of the scheduler''s data structures?
Since multiple instances of the scheduler can be active, across multiple cpu
pools, surely they have to be allocated in the per-scheduler-instance
structures? Or dynamically xmalloc''ed just in the scope they are
needed.

 -- Keir
> Thanks and Regards,
> Dario

Ian Jackson

2012-Oct-09 15:03 UTC

head link

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Dario Faggioli writes ("Re: [Xen-devel] [PATCH 8 of 8] xl: add
node-affinity to the output of `xl list`"):> Honestly, despite the fact that the function is called print_bitmap(),
> it contains the following code:
> 
>         case 1:
>             if (firstset == 0) {
>                 fprintf(stream, "any cpu");
>                 break;
>             }
>         case 3:
Uh, yes, I see what you mean.
> Which is what made me thinking that opacity was not its first concern in
> the first place, and that turning it into being opaque was none of this
> change''s business. :-)
You are right that since you''re just moving the code, it''s not
a
problem for this patch.
> However, I see your point... Perhaps I can add two functions (something
> like print_{cpumap,nodemap}()), both calling the original
> print_bitmap(), and deal with the "any {cpu,node}" case within
them...
> 
> Do you like that better?
That would indeed be an improvement.

Ian.

George Dunlap

2012-Oct-09 15:59 UTC

head link

Re: [PATCH 1 of 8] xen, libxc: rename xenctl_cpumap to xenctl_bitmap

On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> More specifically:
>  1. replaces xenctl_cpumap with xenctl_bitmap
>  2. provides bitmap_to_xenctl_bitmap and the reverse;
>  3. re-implement cpumask_to_xenctl_bitmap with
>     bitmap_to_xenctl_bitmap and the reverse;
>
> Other than #3, no functional changes. Interface only slightly
> afected.
>
> This is in preparation of introducing NUMA node-affinity maps.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>
> diff --git a/tools/libxc/xc_cpupool.c b/tools/libxc/xc_cpupool.c
> --- a/tools/libxc/xc_cpupool.c
> +++ b/tools/libxc/xc_cpupool.c
> @@ -90,7 +90,7 @@ xc_cpupoolinfo_t *xc_cpupool_getinfo(xc_
>      sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_INFO;
>      sysctl.u.cpupool_op.cpupool_id = poolid;
>      set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
> -    sysctl.u.cpupool_op.cpumap.nr_cpus = local_size * 8;
> +    sysctl.u.cpupool_op.cpumap.nr_elems = local_size * 8;
>
>      err = do_sysctl_save(xch, &sysctl);
>
> @@ -184,7 +184,7 @@ xc_cpumap_t xc_cpupool_freeinfo(xc_inter
>      sysctl.cmd = XEN_SYSCTL_cpupool_op;
>      sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_FREEINFO;
>      set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
> -    sysctl.u.cpupool_op.cpumap.nr_cpus = mapsize * 8;
> +    sysctl.u.cpupool_op.cpumap.nr_elems = mapsize * 8;
>
>      err = do_sysctl_save(xch, &sysctl);
>
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -142,7 +142,7 @@ int xc_vcpu_setaffinity(xc_interface *xc
>
>      set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
>
> -    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
> +    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
>
>      ret = do_domctl(xch, &domctl);
>
> @@ -182,7 +182,7 @@ int xc_vcpu_getaffinity(xc_interface *xc
>      domctl.u.vcpuaffinity.vcpu = vcpu;
>
>      set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
> -    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
> +    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
>
>      ret = do_domctl(xch, &domctl);
>
> diff --git a/tools/libxc/xc_tbuf.c b/tools/libxc/xc_tbuf.c
> --- a/tools/libxc/xc_tbuf.c
> +++ b/tools/libxc/xc_tbuf.c
> @@ -134,7 +134,7 @@ int xc_tbuf_set_cpu_mask(xc_interface *x
>      bitmap_64_to_byte(bytemap, &mask64, sizeof (mask64) * 8);
>
>      set_xen_guest_handle(sysctl.u.tbuf_op.cpu_mask.bitmap, bytemap);
> -    sysctl.u.tbuf_op.cpu_mask.nr_cpus = sizeof(bytemap) * 8;
> +    sysctl.u.tbuf_op.cpu_mask.nr_elems = sizeof(bytemap) * 8;
>
>      ret = do_sysctl(xch, &sysctl);
>
> diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c
> --- a/xen/arch/x86/cpu/mcheck/mce.c
> +++ b/xen/arch/x86/cpu/mcheck/mce.c
> @@ -1474,8 +1474,7 @@ long do_mca(XEN_GUEST_HANDLE(xen_mc_t) u
>              cpumap = &cpu_online_map;
>          else
>          {
> -            ret = xenctl_cpumap_to_cpumask(&cmv,
> -                                          
&op->u.mc_inject_v2.cpumap);
> +            ret = xenctl_bitmap_to_cpumask(&cmv,
&op->u.mc_inject_v2.cpumap);
>              if ( ret )
>                  break;
>              cpumap = cmv;
> diff --git a/xen/arch/x86/platform_hypercall.c
b/xen/arch/x86/platform_hypercall.c
> --- a/xen/arch/x86/platform_hypercall.c
> +++ b/xen/arch/x86/platform_hypercall.c
> @@ -371,7 +371,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
>      {
>          uint32_t cpu;
>          uint64_t idletime, now = NOW();
> -        struct xenctl_cpumap ctlmap;
> +        struct xenctl_bitmap ctlmap;
>          cpumask_var_t cpumap;
>          XEN_GUEST_HANDLE(uint8) cpumap_bitmap;
>          XEN_GUEST_HANDLE(uint64) idletimes;
> @@ -384,11 +384,11 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
>          if ( cpufreq_controller != FREQCTL_dom0_kernel )
>              break;
>
> -        ctlmap.nr_cpus  = op->u.getidletime.cpumap_nr_cpus;
> +        ctlmap.nr_elems  = op->u.getidletime.cpumap_nr_cpus;
>          guest_from_compat_handle(cpumap_bitmap,
>                                   op->u.getidletime.cpumap_bitmap);
>          ctlmap.bitmap.p = cpumap_bitmap.p; /* handle -> handle_64
conversion */
> -        if ( (ret = xenctl_cpumap_to_cpumask(&cpumap, &ctlmap)) !=
0 )
> +        if ( (ret = xenctl_bitmap_to_cpumask(&cpumap, &ctlmap)) !=
0 )
>              goto out;
>          guest_from_compat_handle(idletimes,
op->u.getidletime.idletime);
>
> @@ -407,7 +407,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
>
>          op->u.getidletime.now = now;
>          if ( ret == 0 )
> -            ret = cpumask_to_xenctl_cpumap(&ctlmap, cpumap);
> +            ret = cpumask_to_xenctl_bitmap(&ctlmap, cpumap);
>          free_cpumask_var(cpumap);
>
>          if ( ret == 0 && copy_to_guest(u_xenpf_op, op, 1) )
> diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c
> --- a/xen/common/cpupool.c
> +++ b/xen/common/cpupool.c
> @@ -493,7 +493,7 @@ int cpupool_do_sysctl(struct xen_sysctl_
>          op->cpupool_id = c->cpupool_id;
>          op->sched_id = c->sched->sched_id;
>          op->n_dom = c->n_dom;
> -        ret = cpumask_to_xenctl_cpumap(&op->cpumap,
c->cpu_valid);
> +        ret = cpumask_to_xenctl_bitmap(&op->cpumap,
c->cpu_valid);
>          cpupool_put(c);
>      }
>      break;
> @@ -588,7 +588,7 @@ int cpupool_do_sysctl(struct xen_sysctl_
>
>      case XEN_SYSCTL_CPUPOOL_OP_FREEINFO:
>      {
> -        ret = cpumask_to_xenctl_cpumap(
> +        ret = cpumask_to_xenctl_bitmap(
>              &op->cpumap, &cpupool_free_cpus);
>      }
>      break;
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -32,28 +32,29 @@
>  static DEFINE_SPINLOCK(domctl_lock);
>  DEFINE_SPINLOCK(vcpu_alloc_lock);
>
> -int cpumask_to_xenctl_cpumap(
> -    struct xenctl_cpumap *xenctl_cpumap, const cpumask_t *cpumask)
> +int bitmap_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_bitmap,
> +                            const unsigned long *bitmap,
> +                            unsigned int nbits)
>  {
>      unsigned int guest_bytes, copy_bytes, i;
>      uint8_t zero = 0;
>      int err = 0;
> -    uint8_t *bytemap = xmalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
> +    uint8_t *bytemap = xmalloc_array(uint8_t, (nbits + 7) / 8);
>
>      if ( !bytemap )
>          return -ENOMEM;
>
> -    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
> -    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
> +    guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8;
> +    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
>
> -    bitmap_long_to_byte(bytemap, cpumask_bits(cpumask), nr_cpu_ids);
> +    bitmap_long_to_byte(bytemap, bitmap, nbits);
>
>      if ( copy_bytes != 0 )
> -        if ( copy_to_guest(xenctl_cpumap->bitmap, bytemap, copy_bytes)
)
> +        if ( copy_to_guest(xenctl_bitmap->bitmap, bytemap, copy_bytes)
)
>              err = -EFAULT;
>
>      for ( i = copy_bytes; !err && i < guest_bytes; i++ )
> -        if ( copy_to_guest_offset(xenctl_cpumap->bitmap, i, &zero,
1) )
> +        if ( copy_to_guest_offset(xenctl_bitmap->bitmap, i, &zero,
1) )
>              err = -EFAULT;
>
>      xfree(bytemap);
> @@ -61,36 +62,59 @@ int cpumask_to_xenctl_cpumap(
>      return err;
>  }
>
> -int xenctl_cpumap_to_cpumask(
> -    cpumask_var_t *cpumask, const struct xenctl_cpumap *xenctl_cpumap)
> +int xenctl_bitmap_to_bitmap(unsigned long *bitmap,
> +                            const struct xenctl_bitmap *xenctl_bitmap,
> +                            unsigned int nbits)
>  {
>      unsigned int guest_bytes, copy_bytes;
>      int err = 0;
> -    uint8_t *bytemap = xzalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
> +    uint8_t *bytemap = xzalloc_array(uint8_t, (nbits + 7) / 8);
>
>      if ( !bytemap )
>          return -ENOMEM;
>
> -    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
> -    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
> +    guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8;
> +    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
>
>      if ( copy_bytes != 0 )
>      {
> -        if ( copy_from_guest(bytemap, xenctl_cpumap->bitmap,
copy_bytes) )
> +        if ( copy_from_guest(bytemap, xenctl_bitmap->bitmap,
copy_bytes) )
>              err = -EFAULT;
> -        if ( (xenctl_cpumap->nr_cpus & 7) && (guest_bytes
<= sizeof(bytemap)) )
> -            bytemap[guest_bytes-1] &= ~(0xff <<
(xenctl_cpumap->nr_cpus & 7));
> +        if ( (xenctl_bitmap->nr_elems & 7) &&
> +             (guest_bytes <= sizeof(bytemap)) )
> +            bytemap[guest_bytes-1] &= ~(0xff <<
(xenctl_bitmap->nr_elems & 7));
>      }
>
> -    if ( err )
> -        /* nothing */;
> -    else if ( alloc_cpumask_var(cpumask) )
> -        bitmap_byte_to_long(cpumask_bits(*cpumask), bytemap, nr_cpu_ids);
> +    if ( !err )
> +        bitmap_byte_to_long(bitmap, bytemap, nbits);
> +
> +    xfree(bytemap);
> +
> +    return err;
> +}
> +
> +int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_cpumap,
> +                             const cpumask_t *cpumask)
> +{
> +    return bitmap_to_xenctl_bitmap(xenctl_cpumap, cpumask_bits(cpumask),
> +                                   nr_cpu_ids);
> +}
> +
> +int xenctl_bitmap_to_cpumask(cpumask_var_t *cpumask,
> +                             const struct xenctl_bitmap *xenctl_cpumap)
> +{
> +    int err = 0;
> +
> +    if ( alloc_cpumask_var(cpumask) ) {
> +        err = xenctl_bitmap_to_bitmap(cpumask_bits(*cpumask),
xenctl_cpumap,
> +                                      nr_cpu_ids);
> +        /* In case of error, cleanup is up to us, as the caller
won''t care! */
> +        if ( err )
> +            free_cpumask_var(*cpumask);
> +    }
>      else
>          err = -ENOMEM;
>
> -    xfree(bytemap);
> -
>      return err;
>  }
>
> @@ -621,7 +645,7 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
>          {
>              cpumask_var_t new_affinity;
>
> -            ret = xenctl_cpumap_to_cpumask(
> +            ret = xenctl_bitmap_to_cpumask(
>                  &new_affinity, &op->u.vcpuaffinity.cpumap);
>              if ( !ret )
>              {
> @@ -631,7 +655,7 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
>          }
>          else
>          {
> -            ret = cpumask_to_xenctl_cpumap(
> +            ret = cpumask_to_xenctl_bitmap(
>                  &op->u.vcpuaffinity.cpumap, v->cpu_affinity);
>          }
>
> diff --git a/xen/common/trace.c b/xen/common/trace.c
> --- a/xen/common/trace.c
> +++ b/xen/common/trace.c
> @@ -384,7 +384,7 @@ int tb_control(xen_sysctl_tbuf_op_t *tbc
>      {
>          cpumask_var_t mask;
>
> -        rc = xenctl_cpumap_to_cpumask(&mask, &tbc->cpu_mask);
> +        rc = xenctl_bitmap_to_cpumask(&mask, &tbc->cpu_mask);
>          if ( !rc )
>          {
>              cpumask_copy(&tb_cpu_mask, mask);
> diff --git a/xen/include/public/arch-x86/xen-mca.h
b/xen/include/public/arch-x86/xen-mca.h
> --- a/xen/include/public/arch-x86/xen-mca.h
> +++ b/xen/include/public/arch-x86/xen-mca.h
> @@ -414,7 +414,7 @@ struct xen_mc_mceinject {
>
>  struct xen_mc_inject_v2 {
>         uint32_t flags;
> -       struct xenctl_cpumap cpumap;
> +       struct xenctl_bitmap cpumap;
>  };
>  #endif
>
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -284,7 +284,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvc
>  /* XEN_DOMCTL_getvcpuaffinity */
>  struct xen_domctl_vcpuaffinity {
>      uint32_t  vcpu;              /* IN */
> -    struct xenctl_cpumap cpumap; /* IN/OUT */
> +    struct xenctl_bitmap cpumap; /* IN/OUT */
>  };
>  typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t);
> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -71,7 +71,7 @@ struct xen_sysctl_tbuf_op {
>  #define XEN_SYSCTL_TBUFOP_disable      5
>      uint32_t cmd;
>      /* IN/OUT variables */
> -    struct xenctl_cpumap cpu_mask;
> +    struct xenctl_bitmap cpu_mask;
>      uint32_t             evt_mask;
>      /* OUT variables */
>      uint64_aligned_t buffer_mfn;
> @@ -532,7 +532,7 @@ struct xen_sysctl_cpupool_op {
>      uint32_t domid;       /* IN: M              */
>      uint32_t cpu;         /* IN: AR             */
>      uint32_t n_dom;       /*            OUT: I  */
> -    struct xenctl_cpumap cpumap; /*     OUT: IF */
> +    struct xenctl_bitmap cpumap; /*     OUT: IF */
>  };
>  typedef struct xen_sysctl_cpupool_op xen_sysctl_cpupool_op_t;
>  DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpupool_op_t);
> diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
> --- a/xen/include/public/xen.h
> +++ b/xen/include/public/xen.h
> @@ -820,9 +820,9 @@ typedef uint8_t xen_domain_handle_t[16];
>  #endif
>
>  #ifndef __ASSEMBLY__
> -struct xenctl_cpumap {
> +struct xenctl_bitmap {
>      XEN_GUEST_HANDLE_64(uint8) bitmap;
> -    uint32_t nr_cpus;
> +    uint32_t nr_elems;
>  };
>  #endif
>
> diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h
> --- a/xen/include/xen/cpumask.h
> +++ b/xen/include/xen/cpumask.h
> @@ -424,8 +424,8 @@ extern cpumask_t cpu_present_map;
>  #define for_each_present_cpu(cpu)  for_each_cpu(cpu, &cpu_present_map)
>
>  /* Copy to/from cpumap provided by control tools. */
> -struct xenctl_cpumap;
> -int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const cpumask_t *);
> -int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_cpumap
*);
> +struct xenctl_bitmap;
> +int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *, const cpumask_t *);
> +int xenctl_bitmap_to_cpumask(cpumask_var_t *, const struct xenctl_bitmap
*);
>
>  #endif /* __XEN_CPUMASK_H */
> diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
> --- a/xen/include/xlat.lst
> +++ b/xen/include/xlat.lst
> @@ -2,7 +2,7 @@
>  # ! - needs translation
>  # ? - needs checking
>  ?      dom0_vga_console_info           xen.h
> -?      xenctl_cpumap                   xen.h
> +?      xenctl_bitmap                   xen.h
>  ?      mmu_update                      xen.h
>  !      mmuext_op                       xen.h
>  !      start_info                      xen.h
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-09 15:59 UTC

head link

Re: [PATCH 2 of 8] xen, libxc: introduce node maps and masks

On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> Following suit from cpumap and cpumask implementations.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>
> diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
> --- a/tools/libxc/xc_misc.c
> +++ b/tools/libxc/xc_misc.c
> @@ -54,6 +54,11 @@ int xc_get_cpumap_size(xc_interface *xch
>      return (xc_get_max_cpus(xch) + 7) / 8;
>  }
>
> +int xc_get_nodemap_size(xc_interface *xch)
> +{
> +    return (xc_get_max_nodes(xch) + 7) / 8;
> +}
> +
>  xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
>  {
>      int sz;
> @@ -64,6 +69,16 @@ xc_cpumap_t xc_cpumap_alloc(xc_interface
>      return calloc(1, sz);
>  }
>
> +xc_nodemap_t xc_nodemap_alloc(xc_interface *xch)
> +{
> +    int sz;
> +
> +    sz = xc_get_nodemap_size(xch);
> +    if (sz == 0)
> +        return NULL;
> +    return calloc(1, sz);
> +}
> +
>  int xc_readconsolering(xc_interface *xch,
>                         char *buffer,
>                         unsigned int *pnr_chars,
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -330,12 +330,20 @@ int xc_get_cpumap_size(xc_interface *xch
>  /* allocate a cpumap */
>  xc_cpumap_t xc_cpumap_alloc(xc_interface *xch);
>
> - /*
> +/*
>   * NODEMAP handling
>   */
> +typedef uint8_t *xc_nodemap_t;
> +
>  /* return maximum number of NUMA nodes the hypervisor supports */
>  int xc_get_max_nodes(xc_interface *xch);
>
> +/* return array size for nodemap */
> +int xc_get_nodemap_size(xc_interface *xch);
> +
> +/* allocate a nodemap */
> +xc_nodemap_t xc_nodemap_alloc(xc_interface *xch);
> +
>  /*
>   * DOMAIN DEBUGGING FUNCTIONS
>   */
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -118,6 +118,30 @@ int xenctl_bitmap_to_cpumask(cpumask_var
>      return err;
>  }
>
> +int nodemask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_nodemap,
> +                              const nodemask_t *nodemask)
> +{
> +    return bitmap_to_xenctl_bitmap(xenctl_nodemap, cpumask_bits(nodemask),
> +                                   MAX_NUMNODES);
> +}
> +
> +int xenctl_bitmap_to_nodemask(nodemask_t *nodemask,
> +                              const struct xenctl_bitmap *xenctl_nodemap)
> +{
> +    int err = 0;
> +
> +    if ( alloc_nodemask_var(nodemask) ) {
> +        err = xenctl_bitmap_to_bitmap(nodes_addr(*nodemask),
xenctl_nodemap,
> +                                      MAX_NUMNODES);
> +        if ( err )
> +            free_nodemask_var(*nodemask);
> +    }
> +    else
> +        err = -ENOMEM;
> +
> +    return err;
> +}
> +
>  static inline int is_free_domid(domid_t dom)
>  {
>      struct domain *d;
> diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const
>  }
>  #endif
>
> +/*
> + * nodemask_var_t: struct nodemask for stack usage.
> + *
> + * See definition of cpumask_var_t in include/xen//cpumask.h.
> + */
> +#if MAX_NUMNODES > 2 * BITS_PER_LONG
> +#include <xen/xmalloc.h>
> +
> +typedef nodemask_t *nodemask_var_t;
> +
> +#define nr_nodemask_bits (BITS_TO_LONGS(MAX_NUMNODES) * BITS_PER_LONG)
> +
> +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
> +{
> +       *(void **)mask = _xmalloc(nr_nodemask_bits / 8, sizeof(long));
> +       return *mask != NULL;
> +}
> +
> +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
> +{
> +       *(void **)mask = _xzalloc(nr_nodemask_bits / 8, sizeof(long));
> +       return *mask != NULL;
> +}
> +
> +static inline void free_nodemask_var(nodemask_var_t mask)
> +{
> +       xfree(mask);
> +}
> +#else
> +typedef nodemask_t nodemask_var_t;
> +
> +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask)
> +{
> +       return 1;
> +}
> +
> +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask)
> +{
> +       nodes_clear(*mask);
> +       return 1;
> +}
> +
> +static inline void free_nodemask_var(nodemask_var_t mask)
> +{
> +}
> +#endif
> +
>  #if MAX_NUMNODES > 1
>  #define for_each_node_mask(node, mask)                 \
>         for ((node) = first_node(mask);                 \
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-09 16:29 UTC

head link

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> As vcpu-affinity tells where vcpus can run, node-affinity tells
> where a domain''s vcpus prefer to run. Respecting vcpu-affinity is
> the primary concern, but honouring node-affinity will likely
> result in some performances benefit.
>
> This change modifies the vcpu load balancing algorithm (for the
> credit scheduler only), introducing a two steps logic.
> During the first step, we use the node-affinity mask. The aim is
> giving precedence to the CPUs where it is known to be preferrable
> for the domain to run. If that fails in finding a valid CPU, the
> node-affinity is just ignored and, in the second step, we fall
> back to using cpu-affinity only.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Looking at the load-balancing code, it makes me think that there is
probably some interesting work to do there in the future; but I think
this patch can go in as it is for now.  So

(Once others'' comments are addressed)
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>

Do scroll down to read my comments on load balancing...
> @@ -1211,30 +1289,48 @@ csched_runq_steal(int peer_cpu, int cpu,
>       */
>      if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
>      {
> -        list_for_each( iter, &peer_pcpu->runq )
> +        int balance_step;
> +
> +        /*
> +         * Take node-affinity into account. That means, for all the vcpus
> +         * in peer_pcpu''s runq, check _first_ if their
node-affinity allows
> +         * them to run on cpu. If not, retry the loop considering plain
> +         * vcpu-affinity. Also, notice that as soon as one vcpu is found,
> +         * balancing is considered done, and the vcpu is returned to the
> +         * caller.
> +         */
> +        for_each_csched_balance_step(balance_step)
>          {
> -            speer = __runq_elem(iter);
> +            list_for_each( iter, &peer_pcpu->runq )
> +            {
> +                cpumask_t balance_mask;
>
> -            /*
> -             * If next available VCPU here is not of strictly higher
> -             * priority than ours, this PCPU is useless to us.
> -             */
> -            if ( speer->pri <= pri )
> -                break;
> +                speer = __runq_elem(iter);
>
> -            /* Is this VCPU is runnable on our PCPU? */
> -            vc = speer->vcpu;
> -            BUG_ON( is_idle_vcpu(vc) );
> +                /*
> +                 * If next available VCPU here is not of strictly higher
> +                 * priority than ours, this PCPU is useless to us.
> +                 */
> +                if ( speer->pri <= pri )
> +                    break;
>
> -            if (__csched_vcpu_is_migrateable(vc, cpu))
> -            {
> -                /* We got a candidate. Grab it! */
> -                CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
> -                CSCHED_STAT_CRANK(migrate_queued);
> -                WARN_ON(vc->is_urgent);
> -                __runq_remove(speer);
> -                vc->processor = cpu;
> -                return speer;
> +                /* Is this VCPU runnable on our PCPU? */
> +                vc = speer->vcpu;
> +                BUG_ON( is_idle_vcpu(vc) );
> +
> +                if ( csched_balance_cpumask(vc, balance_step,
&balance_mask) )
> +                    continue;
This will have the effect that a vcpu with any node affinity at all
will be stolen before a vcpu with no node affinity: i.e., if you have
a system with 4 nodes, and one vcpu has an affinity to nodes 1-2-3,
another has affinity with only 1, and another has an affinity to all
4, the one which has an affinity to all 4 will be passed over the
first round, while either of the first ones might be nabbed (depending
on what pcpu they''re on).

Furthermore, the effect of this whole thing (if I''m reading it right)
will be to go through *each runqueue* twice, rather than checking all
cpus for vcpus with good node affinity, and then all cpus for vcpus
with good cpumasks.

It seems like it would be better to check:
* This node for node-affine work to steal
* Other nodes for node-affine work to steal
* All nodes for cpu-affine work to steal.

Ideally, the search would terminate fairly quickly with the first set,
meaning that in the common case we never even check other nodes.

Going through the cpu list twice means trying to grab the scheduler
lock for each cpu twice; but hopefully that would be made up for by
having a shorter list.

Thoughts?

Like I said, I think this is something to put on our to-do list; this
patch should go in so we can start testing it as soon as possible.

 -George
> +
> +                if (__csched_vcpu_is_migrateable(vc, cpu,
&balance_mask))
> +                {
> +                    /* We got a candidate. Grab it! */
> +                    CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
> +                    CSCHED_STAT_CRANK(migrate_queued);
> +                    WARN_ON(vc->is_urgent);
> +                    __runq_remove(speer);
> +                    vc->processor = cpu;
> +                    return speer;
> +                }
>              }
>          }
>      }
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-09 16:47 UTC

head link

Re: [PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> Make it possible to pass the node-affinity of a domain to the hypervisor
> from the upper layers, instead of always being computed automatically.
>
> Note that this also required generalizing the Flask hooks for setting
> and getting the affinity, so that they now deal with both vcpu and
> node affinity.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
>
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -222,6 +222,7 @@ struct domain *domain_create(
>
>      spin_lock_init(&d->node_affinity_lock);
>      d->node_affinity = NODE_MASK_ALL;
> +    d->auto_node_affinity = 1;
>
>      spin_lock_init(&d->shutdown_lock);
>      d->shutdown_code = -1;
> @@ -362,11 +363,26 @@ void domain_update_node_affinity(struct
>          cpumask_or(cpumask, cpumask, online_affinity);
>      }
>
> -    for_each_online_node ( node )
> -        if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
> -            node_set(node, nodemask);
> +    if ( d->auto_node_affinity )
> +    {
> +        /* Node-affinity is automaically computed from all vcpu-affinities
*/
> +        for_each_online_node ( node )
> +            if ( cpumask_intersects(&node_to_cpumask(node), cpumask) )
> +                node_set(node, nodemask);
>
> -    d->node_affinity = nodemask;
> +        d->node_affinity = nodemask;
> +    }
> +    else
> +    {
> +        /* Node-affinity is provided by someone else, just filter out cpus
> +         * that are either offline or not in the affinity of any vcpus. */
> +        for_each_node_mask ( node, d->node_affinity )
> +            if ( !cpumask_intersects(&node_to_cpumask(node), cpumask)
)
> +                node_clear(node, d->node_affinity);
> +    }
> +
> +    sched_set_node_affinity(d, &d->node_affinity);
> +
>      spin_unlock(&d->node_affinity_lock);
>
>      free_cpumask_var(online_affinity);
> @@ -374,6 +390,36 @@ void domain_update_node_affinity(struct
>  }
>
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
> +{
> +    /* Being affine with no nodes is just wrong */
> +    if ( nodes_empty(*affinity) )
> +        return -EINVAL;
> +
> +    spin_lock(&d->node_affinity_lock);
> +
> +    /*
> +     * Being/becoming explicitly affine to all nodes is not particularly
> +     * useful. Let''s take it as the `reset node affinity`
command.
> +     */
> +    if ( nodes_full(*affinity) )
> +    {
> +        d->auto_node_affinity = 1;
> +        goto out;
> +    }
> +
> +    d->auto_node_affinity = 0;
> +    d->node_affinity = *affinity;
> +
> +out:
> +    spin_unlock(&d->node_affinity_lock);
> +
> +    domain_update_node_affinity(d);
> +
> +    return 0;
> +}
> +
> +
>  struct domain *get_domain_by_id(domid_t dom)
>  {
>      struct domain *d;
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -642,6 +642,40 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
>      }
>      break;
>
> +    case XEN_DOMCTL_setnodeaffinity:
> +    case XEN_DOMCTL_getnodeaffinity:
> +    {
> +        domid_t dom = op->domain;
> +        struct domain *d = rcu_lock_domain_by_id(dom);
> +
> +        ret = -ESRCH;
> +        if ( d == NULL )
> +            break;
> +
> +        ret = xsm_nodeaffinity(op->cmd, d);
> +        if ( ret )
> +            goto nodeaffinity_out;
> +
> +        if ( op->cmd == XEN_DOMCTL_setnodeaffinity )
> +        {
> +            nodemask_t new_affinity;
> +
> +            ret = xenctl_bitmap_to_nodemask(&new_affinity,
> +                                           
&op->u.nodeaffinity.nodemap);
> +            if ( !ret )
> +                ret = domain_set_node_affinity(d, &new_affinity);
> +        }
> +        else
> +        {
> +            ret =
nodemask_to_xenctl_bitmap(&op->u.nodeaffinity.nodemap,
> +                                            &d->node_affinity);
> +        }
> +
> +    nodeaffinity_out:
> +        rcu_unlock_domain(d);
> +    }
> +    break;
> +
>      case XEN_DOMCTL_setvcpuaffinity:
>      case XEN_DOMCTL_getvcpuaffinity:
>      {
> diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
> --- a/xen/common/keyhandler.c
> +++ b/xen/common/keyhandler.c
> @@ -217,6 +217,14 @@ static void cpuset_print(char *set, int
>      *set++ = ''\0'';
>  }
>
> +static void nodeset_print(char *set, int size, const nodemask_t *mask)
> +{
> +    *set++ = ''['';
> +    set += nodelist_scnprintf(set, size-2, mask);
> +    *set++ = '']'';
> +    *set++ = ''\0'';
> +}
> +
>  static void periodic_timer_print(char *str, int size, uint64_t period)
>  {
>      if ( period == 0 )
> @@ -272,6 +280,9 @@ static void dump_domains(unsigned char k
>
>          dump_pageframe_info(d);
>
> +        nodeset_print(tmpstr, sizeof(tmpstr), &d->node_affinity);
> +        printk("NODE affinity for domain %d: %s\n",
d->domain_id, tmpstr);
> +
>          printk("VCPU information and callbacks for domain
%u:\n",
>                 d->domain_id);
>          for_each_vcpu ( d, v )
> diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
> --- a/xen/common/sched_credit.c
> +++ b/xen/common/sched_credit.c
> @@ -238,6 +238,33 @@ static inline void
>      list_del_init(&svc->runq_elem);
>  }
>
> +/*
> + * Translates node-affinity mask into a cpumask, so that we can use it
during
> + * actual scheduling. That of course will contain all the cpus from all
the
> + * set nodes in the original node-affinity mask.
> + *
> + * Note that any serialization needed to access mask safely is complete
> + * responsibility of the caller of this function/hook.
> + */
> +static void csched_set_node_affinity(
> +    const struct scheduler *ops,
> +    struct domain *d,
> +    nodemask_t *mask)
> +{
> +    struct csched_dom *sdom;
> +    int node;
> +
> +    /* Skip idle domain since it doesn''t even have a
node_affinity_cpumask */
> +    if ( unlikely(is_idle_domain(d)) )
> +        return;
> +
> +    sdom = CSCHED_DOM(d);
> +    cpumask_clear(sdom->node_affinity_cpumask);
> +    for_each_node_mask( node, *mask )
> +        cpumask_or(sdom->node_affinity_cpumask,
sdom->node_affinity_cpumask,
> +                   &node_to_cpumask(node));
> +}
> +
>  #define for_each_csched_balance_step(__step) \
>      for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )
>
> @@ -260,7 +287,8 @@ csched_balance_cpumask(const struct vcpu
>          struct domain *d = vc->domain;
>          struct csched_dom *sdom = CSCHED_DOM(d);
>
> -        if ( cpumask_full(sdom->node_affinity_cpumask) )
> +        if ( cpumask_full(sdom->node_affinity_cpumask) ||
> +             d->auto_node_affinity == 1 )
>              return -1;
>
>          cpumask_and(mask, sdom->node_affinity_cpumask,
vc->cpu_affinity);
> @@ -1786,6 +1814,8 @@ const struct scheduler sched_credit_def
>      .adjust         = csched_dom_cntl,
>      .adjust_global  = csched_sys_cntl,
>
> +    .set_node_affinity  = csched_set_node_affinity,
> +
>      .pick_cpu       = csched_cpu_pick,
>      .do_schedule    = csched_schedule,
>
> diff --git a/xen/common/schedule.c b/xen/common/schedule.c
> --- a/xen/common/schedule.c
> +++ b/xen/common/schedule.c
> @@ -588,6 +588,11 @@ int cpu_disable_scheduler(unsigned int c
>      return ret;
>  }
>
> +void sched_set_node_affinity(struct domain *d, nodemask_t *mask)
> +{
> +    SCHED_OP(DOM2OP(d), set_node_affinity, d, mask);
> +}
> +
>  int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity)
>  {
>      cpumask_t online_affinity;
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -279,6 +279,16 @@ typedef struct xen_domctl_getvcpuinfo xe
>  DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t);
>
>
> +/* Get/set the NUMA node(s) with which the guest has affinity with. */
> +/* XEN_DOMCTL_setnodeaffinity */
> +/* XEN_DOMCTL_getnodeaffinity */
> +struct xen_domctl_nodeaffinity {
> +    struct xenctl_bitmap nodemap;/* IN */
> +};
> +typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t);
> +
> +
>  /* Get/set which physical cpus a vcpu can execute on. */
>  /* XEN_DOMCTL_setvcpuaffinity */
>  /* XEN_DOMCTL_getvcpuaffinity */
> @@ -900,6 +910,8 @@ struct xen_domctl {
>  #define XEN_DOMCTL_set_access_required           64
>  #define XEN_DOMCTL_audit_p2m                     65
>  #define XEN_DOMCTL_set_virq_handler              66
> +#define XEN_DOMCTL_setnodeaffinity               67
> +#define XEN_DOMCTL_getnodeaffinity               68
>  #define XEN_DOMCTL_gdbsx_guestmemio            1000
>  #define XEN_DOMCTL_gdbsx_pausevcpu             1001
>  #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
> @@ -913,6 +925,7 @@ struct xen_domctl {
>          struct xen_domctl_getpageframeinfo  getpageframeinfo;
>          struct xen_domctl_getpageframeinfo2 getpageframeinfo2;
>          struct xen_domctl_getpageframeinfo3 getpageframeinfo3;
> +        struct xen_domctl_nodeaffinity      nodeaffinity;
>          struct xen_domctl_vcpuaffinity      vcpuaffinity;
>          struct xen_domctl_shadow_op         shadow_op;
>          struct xen_domctl_max_mem           max_mem;
> diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -8,8 +8,9 @@
>   * See detailed comments in the file linux/bitmap.h describing the
>   * data type on which these nodemasks are based.
>   *
> - * For details of nodemask_scnprintf() and nodemask_parse(),
> - * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
> + * For details of nodemask_scnprintf(), nodelist_scnpintf() and
> + * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse()
> + * in lib/bitmap.c.
>   *
>   * The available nodemask operations are:
>   *
> @@ -48,6 +49,7 @@
>   * unsigned long *nodes_addr(mask)     Array of unsigned long''s
in mask
>   *
>   * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
> + * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for
printing
>   * int nodemask_parse(ubuf, ulen, mask)        Parse ascii string as
nodemask
>   *
>   * for_each_node_mask(node, mask)      for-loop node over mask
> @@ -280,6 +282,14 @@ static inline int __first_unset_node(con
>
>  #define nodes_addr(src) ((src).bits)
>
> +#define nodelist_scnprintf(buf, len, src) \
> +                       __nodelist_scnprintf((buf), (len), (src),
MAX_NUMNODES)
> +static inline int __nodelist_scnprintf(char *buf, int len,
> +                                       const nodemask_t *srcp, int nbits)
> +{
> +       return bitmap_scnlistprintf(buf, len, srcp->bits, nbits);
> +}
> +
>  #if 0
>  #define nodemask_scnprintf(buf, len, src) \
>                         __nodemask_scnprintf((buf), (len), &(src),
MAX_NUMNODES)
> diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h
> --- a/xen/include/xen/sched-if.h
> +++ b/xen/include/xen/sched-if.h
> @@ -182,6 +182,8 @@ struct scheduler {
>                                      struct xen_domctl_scheduler_op *);
>      int          (*adjust_global)  (const struct scheduler *,
>                                      struct xen_sysctl_scheduler_op *);
> +    void         (*set_node_affinity) (const struct scheduler *,
> +                                       struct domain *, nodemask_t *);
>      void         (*dump_settings)  (const struct scheduler *);
>      void         (*dump_cpu_state) (const struct scheduler *, int);
>
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -346,8 +346,12 @@ struct domain
>      /* Various mem_events */
>      struct mem_event_per_domain *mem_event;
>
> -    /* Currently computed from union of all vcpu cpu-affinity masks. */
> +    /*
> +     * Can be specified by the user. If that is not the case, it is
> +     * computed from the union of all the vcpu cpu-affinity masks.
> +     */
>      nodemask_t node_affinity;
> +    int auto_node_affinity;
>      unsigned int last_alloc_node;
>      spinlock_t node_affinity_lock;
>  };
> @@ -416,6 +420,7 @@ static inline void get_knownalive_domain
>      ASSERT(!(atomic_read(&d->refcnt) & DOMAIN_DESTROYED));
>  }
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t
*affinity);
>  void domain_update_node_affinity(struct domain *d);
>
>  struct domain *domain_create(
> @@ -519,6 +524,7 @@ void sched_destroy_domain(struct domain
>  int sched_move_domain(struct domain *d, struct cpupool *c);
>  long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *);
>  long sched_adjust_global(struct xen_sysctl_scheduler_op *);
> +void sched_set_node_affinity(struct domain *, nodemask_t *);
>  int  sched_id(void);
>  void sched_tick_suspend(void);
>  void sched_tick_resume(void);
> diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h
> --- a/xen/include/xsm/xsm.h
> +++ b/xen/include/xsm/xsm.h
> @@ -56,6 +56,7 @@ struct xsm_operations {
>      int (*domain_create) (struct domain *d, u32 ssidref);
>      int (*max_vcpus) (struct domain *d);
>      int (*destroydomain) (struct domain *d);
> +    int (*nodeaffinity) (int cmd, struct domain *d);
>      int (*vcpuaffinity) (int cmd, struct domain *d);
>      int (*scheduler) (struct domain *d);
>      int (*getdomaininfo) (struct domain *d);
> @@ -229,6 +230,11 @@ static inline int xsm_destroydomain (str
>      return xsm_call(destroydomain(d));
>  }
>
> +static inline int xsm_nodeaffinity (int cmd, struct domain *d)
> +{
> +    return xsm_call(nodeaffinity(cmd, d));
> +}
> +
>  static inline int xsm_vcpuaffinity (int cmd, struct domain *d)
>  {
>      return xsm_call(vcpuaffinity(cmd, d));
> diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c
> --- a/xen/xsm/dummy.c
> +++ b/xen/xsm/dummy.c
> @@ -634,6 +634,7 @@ void xsm_fixup_ops (struct xsm_operation
>      set_to_dummy_if_null(ops, domain_create);
>      set_to_dummy_if_null(ops, max_vcpus);
>      set_to_dummy_if_null(ops, destroydomain);
> +    set_to_dummy_if_null(ops, nodeaffinity);
>      set_to_dummy_if_null(ops, vcpuaffinity);
>      set_to_dummy_if_null(ops, scheduler);
>      set_to_dummy_if_null(ops, getdomaininfo);
> diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c
> --- a/xen/xsm/flask/hooks.c
> +++ b/xen/xsm/flask/hooks.c
> @@ -521,17 +521,19 @@ static int flask_destroydomain(struct do
>                             DOMAIN__DESTROY);
>  }
>
> -static int flask_vcpuaffinity(int cmd, struct domain *d)
> +static int flask_affinity(int cmd, struct domain *d)
>  {
>      u32 perm;
>
>      switch ( cmd )
>      {
>      case XEN_DOMCTL_setvcpuaffinity:
> -        perm = DOMAIN__SETVCPUAFFINITY;
> +    case XEN_DOMCTL_setnodeaffinity:
> +        perm = DOMAIN__SETAFFINITY;
>          break;
>      case XEN_DOMCTL_getvcpuaffinity:
> -        perm = DOMAIN__GETVCPUAFFINITY;
> +    case XEN_DOMCTL_getnodeaffinity:
> +        perm = DOMAIN__GETAFFINITY;
>          break;
>      default:
>          return -EPERM;
> @@ -1473,7 +1475,8 @@ static struct xsm_operations flask_ops >     
.domain_create = flask_domain_create,
>      .max_vcpus = flask_max_vcpus,
>      .destroydomain = flask_destroydomain,
> -    .vcpuaffinity = flask_vcpuaffinity,
> +    .nodeaffinity = flask_affinity,
> +    .vcpuaffinity = flask_affinity,
>      .scheduler = flask_scheduler,
>      .getdomaininfo = flask_getdomaininfo,
>      .getvcpucontext = flask_getvcpucontext,
> diff --git a/xen/xsm/flask/include/av_perm_to_string.h
b/xen/xsm/flask/include/av_perm_to_string.h
> --- a/xen/xsm/flask/include/av_perm_to_string.h
> +++ b/xen/xsm/flask/include/av_perm_to_string.h
> @@ -37,8 +37,8 @@
>     S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
>     S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
>     S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY,
"setvcpuaffinity")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY,
"getvcpuaffinity")
> +   S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity")
> +   S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity")
The top of this file says, "This file is automatically generated. Do
not edit."  I didn''t see any files that might have been modified
to
effect these changes -- did I miss them?  Or is the comment a lie?  Or
should you find that file and edit it instead? :-)
>     S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler")
>     S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo")
>     S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo")
> diff --git a/xen/xsm/flask/include/av_permissions.h
b/xen/xsm/flask/include/av_permissions.h
> --- a/xen/xsm/flask/include/av_permissions.h
> +++ b/xen/xsm/flask/include/av_permissions.h
> @@ -38,8 +38,8 @@
>  #define DOMAIN__TRANSITION                        0x00000020UL
>  #define DOMAIN__MAX_VCPUS                         0x00000040UL
>  #define DOMAIN__DESTROY                           0x00000080UL
> -#define DOMAIN__SETVCPUAFFINITY                   0x00000100UL
> -#define DOMAIN__GETVCPUAFFINITY                   0x00000200UL
> +#define DOMAIN__SETAFFINITY                       0x00000100UL
> +#define DOMAIN__GETAFFINITY                       0x00000200UL
Same thing here.

Other than that, looks good!

 -George
>  #define DOMAIN__SCHEDULER                         0x00000400UL
>  #define DOMAIN__GETDOMAININFO                     0x00000800UL
>  #define DOMAIN__GETVCPUINFO                       0x00001000UL
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Ian Campbell

2012-Oct-09 16:52 UTC

head link

Re: [PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

Could you trim your quotes please?
> > diff --git a/xen/xsm/flask/include/av_perm_to_string.h
b/xen/xsm/flask/include/av_perm_to_string.h
> > --- a/xen/xsm/flask/include/av_perm_to_string.h
> > +++ b/xen/xsm/flask/include/av_perm_to_string.h
> > @@ -37,8 +37,8 @@
> >     S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
> >     S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
> >     S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
> > -   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY,
"setvcpuaffinity")
> > -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY,
"getvcpuaffinity")
> > +   S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity")
> > +   S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity")
> 
> The top of this file says, "This file is automatically generated. Do
> not edit."  I didn''t see any files that might have been
modified to
> effect these changes -- did I miss them?  Or is the comment a lie?  Or
> should you find that file and edit it instead? :-)
Also, in that case why is this file checked in?

Usually the reason is if the generating tool is not widely available,
but in this case it seems to be
tools/flask/policy/policy/flask/mkflask.sh which depends on awk and not
a lot else -- so I think we can rely on it being available.

Ian.

Dario Faggioli

2012-Oct-09 17:17 UTC

head link

Re: [PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

On Tue, 2012-10-09 at 17:47 +0100, George Dunlap wrote: > > diff --git a/xen/xsm/flask/include/av_perm_to_string.h
b/xen/xsm/flask/include/av_perm_to_string.h
> > --- a/xen/xsm/flask/include/av_perm_to_string.h
> > +++ b/xen/xsm/flask/include/av_perm_to_string.h
> > @@ -37,8 +37,8 @@
> >     S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
> >     S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
> >     S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
> > -   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY,
"setvcpuaffinity")
> > -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY,
"getvcpuaffinity")
> > +   S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity")
> > +   S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity")
> 
> The top of this file says, "This file is automatically generated. Do
> not edit."  I didn''t see any files that might have been
modified to
> effect these changes -- did I miss them?  Or is the comment a lie?  Or
> should you find that file and edit it instead? :-)
> Wow! I said I have very poor knowledge of this security hook thing, but
that is something quite big that I appear to have missed!! :-P

Thanks for pointing that out, and sorry for that. I''ll definitely have
to take a look at the generator, and will do while resending.

Thanks again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Daniel De Graaf

2012-Oct-09 18:31 UTC

head link

[PATCH RFC] flask: move policy header sources into hypervisor

Ian Campbell wrote:
[...]>>> +++ b/xen/xsm/flask/include/av_perm_to_string.h
> Also, in that case why is this file checked in?
This patch fixes the autogenerated files, but doesn''t fully wire them
in
to things like "make clean" or .{git,hg}ignore. I don''t see
an obvious
way to clean generated header files in Xen''s build system; perhaps
someone who knows the build system better can point out the right way to
wire this up.

--------------------------------------->8----------------------------

Rather than keeping around headers that are autogenerated in order to
avoid adding build dependencies from xen/ to files in tools/, move the
relevant parts of the FLASK policy into the hypervisor tree and generate
the headers as part of the hypervisor''s build.

Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
---
 tools/flask/policy/Makefile                        |   2 +-
 tools/flask/policy/policy/flask/Makefile           |  41 ------
 xen/xsm/flask/Makefile                             |  21 +++
 xen/xsm/flask/include/av_perm_to_string.h          | 147 -------------------
 xen/xsm/flask/include/av_permissions.h             | 157 ---------------------
 xen/xsm/flask/include/class_to_string.h            |  15 --
 xen/xsm/flask/include/flask.h                      |  35 -----
 xen/xsm/flask/include/initial_sid_to_string.h      |  16 ---
 .../flask => xen/xsm/flask/policy}/access_vectors  |   0
 .../flask => xen/xsm/flask/policy}/initial_sids    |   0
 .../xsm/flask/policy}/mkaccess_vector.sh           |   4 +-
 .../flask => xen/xsm/flask/policy}/mkflask.sh      |   6 +-
 .../xsm/flask/policy}/security_classes             |   0
 13 files changed, 27 insertions(+), 417 deletions(-)
 delete mode 100644 tools/flask/policy/policy/flask/Makefile
 delete mode 100644 xen/xsm/flask/include/av_perm_to_string.h
 delete mode 100644 xen/xsm/flask/include/av_permissions.h
 delete mode 100644 xen/xsm/flask/include/class_to_string.h
 delete mode 100644 xen/xsm/flask/include/flask.h
 delete mode 100644 xen/xsm/flask/include/initial_sid_to_string.h
 rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/access_vectors (100%)
 rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/initial_sids (100%)
 rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/mkaccess_vector.sh (97%)
 rename {tools/flask/policy/policy/flask => xen/xsm/flask/policy}/mkflask.sh
(95%)
 rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/security_classes (100%)

diff --git a/tools/flask/policy/Makefile b/tools/flask/policy/Makefile
index 5c25cbe..3f5aa38 100644
--- a/tools/flask/policy/Makefile
+++ b/tools/flask/policy/Makefile
@@ -61,7 +61,7 @@ LOADPOLICY := $(SBINDIR)/flask-loadpolicy
 # policy source layout
 POLDIR := policy
 MODDIR := $(POLDIR)/modules
-FLASKDIR := $(POLDIR)/flask
+FLASKDIR := ../../../xen/xsm/flask/policy
 SECCLASS := $(FLASKDIR)/security_classes
 ISIDS := $(FLASKDIR)/initial_sids
 AVS := $(FLASKDIR)/access_vectors
diff --git a/tools/flask/policy/policy/flask/Makefile
b/tools/flask/policy/policy/flask/Makefile
deleted file mode 100644
index 5f57e88..0000000
--- a/tools/flask/policy/policy/flask/Makefile
+++ /dev/null
@@ -1,41 +0,0 @@
-# flask needs to know where to export the libselinux headers.
-LIBSEL ?= ../../libselinux
-
-# flask needs to know where to export the kernel headers.
-LINUXDIR ?= ../../../linux-2.6
-
-AWK = awk
-
-CONFIG_SHELL := $(shell if [ -x "$$BASH" ]; then echo $$BASH; \
-          else if [ -x /bin/bash ]; then echo /bin/bash; \
-          else echo sh; fi ; fi)
-
-FLASK_H_DEPEND = security_classes initial_sids
-AV_H_DEPEND = access_vectors
-
-FLASK_H_FILES = class_to_string.h flask.h initial_sid_to_string.h
-AV_H_FILES = av_perm_to_string.h av_permissions.h
-ALL_H_FILES = $(FLASK_H_FILES) $(AV_H_FILES)
-
-all:  $(ALL_H_FILES)
-
-$(FLASK_H_FILES): $(FLASK_H_DEPEND)
-	$(CONFIG_SHELL) mkflask.sh $(AWK) $(FLASK_H_DEPEND)
-
-$(AV_H_FILES): $(AV_H_DEPEND)
-	$(CONFIG_SHELL) mkaccess_vector.sh $(AWK) $(AV_H_DEPEND)
-
-tolib: all
-	install -m 644 flask.h av_permissions.h $(LIBSEL)/include/selinux
-	install -m 644 class_to_string.h av_inherit.h common_perm_to_string.h
av_perm_to_string.h $(LIBSEL)/src
-
-tokern: all
-	install -m 644 $(ALL_H_FILES) $(LINUXDIR)/security/selinux/include
-
-install: all
-
-relabel:
-
-clean:  
-	rm -f $(FLASK_H_FILES)
-	rm -f $(AV_H_FILES)
diff --git a/xen/xsm/flask/Makefile b/xen/xsm/flask/Makefile
index 92fb410..238495a 100644
--- a/xen/xsm/flask/Makefile
+++ b/xen/xsm/flask/Makefile
@@ -5,3 +5,24 @@ obj-y += flask_op.o
 subdir-y += ss
 
 CFLAGS += -I./include
+
+AWK = awk
+
+CONFIG_SHELL := $(shell if [ -x "$$BASH" ]; then echo $$BASH; \
+          else if [ -x /bin/bash ]; then echo /bin/bash; \
+          else echo sh; fi ; fi)
+
+FLASK_H_DEPEND = policy/security_classes policy/initial_sids
+AV_H_DEPEND = policy/access_vectors
+
+FLASK_H_FILES = include/flask.h include/class_to_string.h
include/initial_sid_to_string.h
+AV_H_FILES = include/av_perm_to_string.h include/av_permissions.h
+ALL_H_FILES = $(FLASK_H_FILES) $(AV_H_FILES)
+
+$(obj-y) ss/built_in.o: $(ALL_H_FILES)
+
+$(FLASK_H_FILES): $(FLASK_H_DEPEND)
+	$(CONFIG_SHELL) policy/mkflask.sh $(AWK) $(FLASK_H_DEPEND)
+
+$(AV_H_FILES): $(AV_H_DEPEND)
+	$(CONFIG_SHELL) policy/mkaccess_vector.sh $(AWK) $(AV_H_DEPEND)
diff --git a/xen/xsm/flask/include/av_perm_to_string.h
b/xen/xsm/flask/include/av_perm_to_string.h
deleted file mode 100644
index c3f2370..0000000
--- a/xen/xsm/flask/include/av_perm_to_string.h
+++ /dev/null
@@ -1,147 +0,0 @@
-/* This file is automatically generated.  Do not edit. */
-   S_(SECCLASS_XEN, XEN__SCHEDULER, "scheduler")
-   S_(SECCLASS_XEN, XEN__SETTIME, "settime")
-   S_(SECCLASS_XEN, XEN__TBUFCONTROL, "tbufcontrol")
-   S_(SECCLASS_XEN, XEN__READCONSOLE, "readconsole")
-   S_(SECCLASS_XEN, XEN__CLEARCONSOLE, "clearconsole")
-   S_(SECCLASS_XEN, XEN__PERFCONTROL, "perfcontrol")
-   S_(SECCLASS_XEN, XEN__MTRR_ADD, "mtrr_add")
-   S_(SECCLASS_XEN, XEN__MTRR_DEL, "mtrr_del")
-   S_(SECCLASS_XEN, XEN__MTRR_READ, "mtrr_read")
-   S_(SECCLASS_XEN, XEN__MICROCODE, "microcode")
-   S_(SECCLASS_XEN, XEN__PHYSINFO, "physinfo")
-   S_(SECCLASS_XEN, XEN__QUIRK, "quirk")
-   S_(SECCLASS_XEN, XEN__WRITECONSOLE, "writeconsole")
-   S_(SECCLASS_XEN, XEN__READAPIC, "readapic")
-   S_(SECCLASS_XEN, XEN__WRITEAPIC, "writeapic")
-   S_(SECCLASS_XEN, XEN__PRIVPROFILE, "privprofile")
-   S_(SECCLASS_XEN, XEN__NONPRIVPROFILE, "nonprivprofile")
-   S_(SECCLASS_XEN, XEN__KEXEC, "kexec")
-   S_(SECCLASS_XEN, XEN__FIRMWARE, "firmware")
-   S_(SECCLASS_XEN, XEN__SLEEP, "sleep")
-   S_(SECCLASS_XEN, XEN__FREQUENCY, "frequency")
-   S_(SECCLASS_XEN, XEN__GETIDLE, "getidle")
-   S_(SECCLASS_XEN, XEN__DEBUG, "debug")
-   S_(SECCLASS_XEN, XEN__GETCPUINFO, "getcpuinfo")
-   S_(SECCLASS_XEN, XEN__HEAP, "heap")
-   S_(SECCLASS_XEN, XEN__PM_OP, "pm_op")
-   S_(SECCLASS_XEN, XEN__MCA_OP, "mca_op")
-   S_(SECCLASS_XEN, XEN__LOCKPROF, "lockprof")
-   S_(SECCLASS_XEN, XEN__CPUPOOL_OP, "cpupool_op")
-   S_(SECCLASS_XEN, XEN__SCHED_OP, "sched_op")
-   S_(SECCLASS_XEN, XEN__TMEM_OP, "tmem_op")
-   S_(SECCLASS_XEN, XEN__TMEM_CONTROL, "tmem_control")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUCONTEXT, "setvcpucontext")
-   S_(SECCLASS_DOMAIN, DOMAIN__PAUSE, "pause")
-   S_(SECCLASS_DOMAIN, DOMAIN__UNPAUSE, "unpause")
-   S_(SECCLASS_DOMAIN, DOMAIN__RESUME, "resume")
-   S_(SECCLASS_DOMAIN, DOMAIN__CREATE, "create")
-   S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
-   S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
-   S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY, "setvcpuaffinity")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY, "getvcpuaffinity")
-   S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUCONTEXT, "getvcpucontext")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETDOMAINMAXMEM, "setdomainmaxmem")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETDOMAINHANDLE, "setdomainhandle")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETDEBUGGING, "setdebugging")
-   S_(SECCLASS_DOMAIN, DOMAIN__HYPERCALL, "hypercall")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETTIME, "settime")
-   S_(SECCLASS_DOMAIN, DOMAIN__SET_TARGET, "set_target")
-   S_(SECCLASS_DOMAIN, DOMAIN__SHUTDOWN, "shutdown")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETADDRSIZE, "setaddrsize")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETADDRSIZE, "getaddrsize")
-   S_(SECCLASS_DOMAIN, DOMAIN__TRIGGER, "trigger")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETEXTVCPUCONTEXT,
"getextvcpucontext")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETEXTVCPUCONTEXT,
"setextvcpucontext")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUEXTSTATE, "getvcpuextstate")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUEXTSTATE, "setvcpuextstate")
-   S_(SECCLASS_DOMAIN, DOMAIN__GETPODTARGET, "getpodtarget")
-   S_(SECCLASS_DOMAIN, DOMAIN__SETPODTARGET, "setpodtarget")
-   S_(SECCLASS_DOMAIN, DOMAIN__SET_MISC_INFO, "set_misc_info")
-   S_(SECCLASS_DOMAIN, DOMAIN__SET_VIRQ_HANDLER, "set_virq_handler")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__RELABELFROM, "relabelfrom")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__RELABELTO, "relabelto")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__RELABELSELF, "relabelself")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__MAKE_PRIV_FOR, "make_priv_for")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__SET_AS_TARGET, "set_as_target")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__SET_CPUID, "set_cpuid")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__GETTSC, "gettsc")
-   S_(SECCLASS_DOMAIN2, DOMAIN2__SETTSC, "settsc")
-   S_(SECCLASS_HVM, HVM__SETHVMC, "sethvmc")
-   S_(SECCLASS_HVM, HVM__GETHVMC, "gethvmc")
-   S_(SECCLASS_HVM, HVM__SETPARAM, "setparam")
-   S_(SECCLASS_HVM, HVM__GETPARAM, "getparam")
-   S_(SECCLASS_HVM, HVM__PCILEVEL, "pcilevel")
-   S_(SECCLASS_HVM, HVM__IRQLEVEL, "irqlevel")
-   S_(SECCLASS_HVM, HVM__PCIROUTE, "pciroute")
-   S_(SECCLASS_HVM, HVM__BIND_IRQ, "bind_irq")
-   S_(SECCLASS_HVM, HVM__CACHEATTR, "cacheattr")
-   S_(SECCLASS_HVM, HVM__TRACKDIRTYVRAM, "trackdirtyvram")
-   S_(SECCLASS_HVM, HVM__HVMCTL, "hvmctl")
-   S_(SECCLASS_HVM, HVM__MEM_EVENT, "mem_event")
-   S_(SECCLASS_HVM, HVM__MEM_SHARING, "mem_sharing")
-   S_(SECCLASS_HVM, HVM__AUDIT_P2M, "audit_p2m")
-   S_(SECCLASS_HVM, HVM__SEND_IRQ, "send_irq")
-   S_(SECCLASS_HVM, HVM__SHARE_MEM, "share_mem")
-   S_(SECCLASS_EVENT, EVENT__BIND, "bind")
-   S_(SECCLASS_EVENT, EVENT__SEND, "send")
-   S_(SECCLASS_EVENT, EVENT__STATUS, "status")
-   S_(SECCLASS_EVENT, EVENT__NOTIFY, "notify")
-   S_(SECCLASS_EVENT, EVENT__CREATE, "create")
-   S_(SECCLASS_EVENT, EVENT__RESET, "reset")
-   S_(SECCLASS_GRANT, GRANT__MAP_READ, "map_read")
-   S_(SECCLASS_GRANT, GRANT__MAP_WRITE, "map_write")
-   S_(SECCLASS_GRANT, GRANT__UNMAP, "unmap")
-   S_(SECCLASS_GRANT, GRANT__TRANSFER, "transfer")
-   S_(SECCLASS_GRANT, GRANT__SETUP, "setup")
-   S_(SECCLASS_GRANT, GRANT__COPY, "copy")
-   S_(SECCLASS_GRANT, GRANT__QUERY, "query")
-   S_(SECCLASS_MMU, MMU__MAP_READ, "map_read")
-   S_(SECCLASS_MMU, MMU__MAP_WRITE, "map_write")
-   S_(SECCLASS_MMU, MMU__PAGEINFO, "pageinfo")
-   S_(SECCLASS_MMU, MMU__PAGELIST, "pagelist")
-   S_(SECCLASS_MMU, MMU__ADJUST, "adjust")
-   S_(SECCLASS_MMU, MMU__STAT, "stat")
-   S_(SECCLASS_MMU, MMU__TRANSLATEGP, "translategp")
-   S_(SECCLASS_MMU, MMU__UPDATEMP, "updatemp")
-   S_(SECCLASS_MMU, MMU__PHYSMAP, "physmap")
-   S_(SECCLASS_MMU, MMU__PINPAGE, "pinpage")
-   S_(SECCLASS_MMU, MMU__MFNLIST, "mfnlist")
-   S_(SECCLASS_MMU, MMU__MEMORYMAP, "memorymap")
-   S_(SECCLASS_MMU, MMU__REMOTE_REMAP, "remote_remap")
-   S_(SECCLASS_MMU, MMU__MMUEXT_OP, "mmuext_op")
-   S_(SECCLASS_MMU, MMU__EXCHANGE, "exchange")
-   S_(SECCLASS_SHADOW, SHADOW__DISABLE, "disable")
-   S_(SECCLASS_SHADOW, SHADOW__ENABLE, "enable")
-   S_(SECCLASS_SHADOW, SHADOW__LOGDIRTY, "logdirty")
-   S_(SECCLASS_RESOURCE, RESOURCE__ADD, "add")
-   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE, "remove")
-   S_(SECCLASS_RESOURCE, RESOURCE__USE, "use")
-   S_(SECCLASS_RESOURCE, RESOURCE__ADD_IRQ, "add_irq")
-   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_IRQ, "remove_irq")
-   S_(SECCLASS_RESOURCE, RESOURCE__ADD_IOPORT, "add_ioport")
-   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_IOPORT, "remove_ioport")
-   S_(SECCLASS_RESOURCE, RESOURCE__ADD_IOMEM, "add_iomem")
-   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_IOMEM, "remove_iomem")
-   S_(SECCLASS_RESOURCE, RESOURCE__STAT_DEVICE, "stat_device")
-   S_(SECCLASS_RESOURCE, RESOURCE__ADD_DEVICE, "add_device")
-   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_DEVICE, "remove_device")
-   S_(SECCLASS_RESOURCE, RESOURCE__PLUG, "plug")
-   S_(SECCLASS_RESOURCE, RESOURCE__UNPLUG, "unplug")
-   S_(SECCLASS_RESOURCE, RESOURCE__SETUP, "setup")
-   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_AV, "compute_av")
-   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_CREATE, "compute_create")
-   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_MEMBER, "compute_member")
-   S_(SECCLASS_SECURITY, SECURITY__CHECK_CONTEXT, "check_context")
-   S_(SECCLASS_SECURITY, SECURITY__LOAD_POLICY, "load_policy")
-   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_RELABEL,
"compute_relabel")
-   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_USER, "compute_user")
-   S_(SECCLASS_SECURITY, SECURITY__SETENFORCE, "setenforce")
-   S_(SECCLASS_SECURITY, SECURITY__SETBOOL, "setbool")
-   S_(SECCLASS_SECURITY, SECURITY__SETSECPARAM, "setsecparam")
-   S_(SECCLASS_SECURITY, SECURITY__ADD_OCONTEXT, "add_ocontext")
-   S_(SECCLASS_SECURITY, SECURITY__DEL_OCONTEXT, "del_ocontext")
diff --git a/xen/xsm/flask/include/av_permissions.h
b/xen/xsm/flask/include/av_permissions.h
deleted file mode 100644
index 65302e8..0000000
--- a/xen/xsm/flask/include/av_permissions.h
+++ /dev/null
@@ -1,157 +0,0 @@
-/* This file is automatically generated.  Do not edit. */
-#define XEN__SCHEDULER                            0x00000001UL
-#define XEN__SETTIME                              0x00000002UL
-#define XEN__TBUFCONTROL                          0x00000004UL
-#define XEN__READCONSOLE                          0x00000008UL
-#define XEN__CLEARCONSOLE                         0x00000010UL
-#define XEN__PERFCONTROL                          0x00000020UL
-#define XEN__MTRR_ADD                             0x00000040UL
-#define XEN__MTRR_DEL                             0x00000080UL
-#define XEN__MTRR_READ                            0x00000100UL
-#define XEN__MICROCODE                            0x00000200UL
-#define XEN__PHYSINFO                             0x00000400UL
-#define XEN__QUIRK                                0x00000800UL
-#define XEN__WRITECONSOLE                         0x00001000UL
-#define XEN__READAPIC                             0x00002000UL
-#define XEN__WRITEAPIC                            0x00004000UL
-#define XEN__PRIVPROFILE                          0x00008000UL
-#define XEN__NONPRIVPROFILE                       0x00010000UL
-#define XEN__KEXEC                                0x00020000UL
-#define XEN__FIRMWARE                             0x00040000UL
-#define XEN__SLEEP                                0x00080000UL
-#define XEN__FREQUENCY                            0x00100000UL
-#define XEN__GETIDLE                              0x00200000UL
-#define XEN__DEBUG                                0x00400000UL
-#define XEN__GETCPUINFO                           0x00800000UL
-#define XEN__HEAP                                 0x01000000UL
-#define XEN__PM_OP                                0x02000000UL
-#define XEN__MCA_OP                               0x04000000UL
-#define XEN__LOCKPROF                             0x08000000UL
-#define XEN__CPUPOOL_OP                           0x10000000UL
-#define XEN__SCHED_OP                             0x20000000UL
-#define XEN__TMEM_OP                              0x40000000UL
-#define XEN__TMEM_CONTROL                         0x80000000UL
-
-#define DOMAIN__SETVCPUCONTEXT                    0x00000001UL
-#define DOMAIN__PAUSE                             0x00000002UL
-#define DOMAIN__UNPAUSE                           0x00000004UL
-#define DOMAIN__RESUME                            0x00000008UL
-#define DOMAIN__CREATE                            0x00000010UL
-#define DOMAIN__TRANSITION                        0x00000020UL
-#define DOMAIN__MAX_VCPUS                         0x00000040UL
-#define DOMAIN__DESTROY                           0x00000080UL
-#define DOMAIN__SETVCPUAFFINITY                   0x00000100UL
-#define DOMAIN__GETVCPUAFFINITY                   0x00000200UL
-#define DOMAIN__SCHEDULER                         0x00000400UL
-#define DOMAIN__GETDOMAININFO                     0x00000800UL
-#define DOMAIN__GETVCPUINFO                       0x00001000UL
-#define DOMAIN__GETVCPUCONTEXT                    0x00002000UL
-#define DOMAIN__SETDOMAINMAXMEM                   0x00004000UL
-#define DOMAIN__SETDOMAINHANDLE                   0x00008000UL
-#define DOMAIN__SETDEBUGGING                      0x00010000UL
-#define DOMAIN__HYPERCALL                         0x00020000UL
-#define DOMAIN__SETTIME                           0x00040000UL
-#define DOMAIN__SET_TARGET                        0x00080000UL
-#define DOMAIN__SHUTDOWN                          0x00100000UL
-#define DOMAIN__SETADDRSIZE                       0x00200000UL
-#define DOMAIN__GETADDRSIZE                       0x00400000UL
-#define DOMAIN__TRIGGER                           0x00800000UL
-#define DOMAIN__GETEXTVCPUCONTEXT                 0x01000000UL
-#define DOMAIN__SETEXTVCPUCONTEXT                 0x02000000UL
-#define DOMAIN__GETVCPUEXTSTATE                   0x04000000UL
-#define DOMAIN__SETVCPUEXTSTATE                   0x08000000UL
-#define DOMAIN__GETPODTARGET                      0x10000000UL
-#define DOMAIN__SETPODTARGET                      0x20000000UL
-#define DOMAIN__SET_MISC_INFO                     0x40000000UL
-#define DOMAIN__SET_VIRQ_HANDLER                  0x80000000UL
-
-#define DOMAIN2__RELABELFROM                      0x00000001UL
-#define DOMAIN2__RELABELTO                        0x00000002UL
-#define DOMAIN2__RELABELSELF                      0x00000004UL
-#define DOMAIN2__MAKE_PRIV_FOR                    0x00000008UL
-#define DOMAIN2__SET_AS_TARGET                    0x00000010UL
-#define DOMAIN2__SET_CPUID                        0x00000020UL
-#define DOMAIN2__GETTSC                           0x00000040UL
-#define DOMAIN2__SETTSC                           0x00000080UL
-
-#define HVM__SETHVMC                              0x00000001UL
-#define HVM__GETHVMC                              0x00000002UL
-#define HVM__SETPARAM                             0x00000004UL
-#define HVM__GETPARAM                             0x00000008UL
-#define HVM__PCILEVEL                             0x00000010UL
-#define HVM__IRQLEVEL                             0x00000020UL
-#define HVM__PCIROUTE                             0x00000040UL
-#define HVM__BIND_IRQ                             0x00000080UL
-#define HVM__CACHEATTR                            0x00000100UL
-#define HVM__TRACKDIRTYVRAM                       0x00000200UL
-#define HVM__HVMCTL                               0x00000400UL
-#define HVM__MEM_EVENT                            0x00000800UL
-#define HVM__MEM_SHARING                          0x00001000UL
-#define HVM__AUDIT_P2M                            0x00002000UL
-#define HVM__SEND_IRQ                             0x00004000UL
-#define HVM__SHARE_MEM                            0x00008000UL
-
-#define EVENT__BIND                               0x00000001UL
-#define EVENT__SEND                               0x00000002UL
-#define EVENT__STATUS                             0x00000004UL
-#define EVENT__NOTIFY                             0x00000008UL
-#define EVENT__CREATE                             0x00000010UL
-#define EVENT__RESET                              0x00000020UL
-
-#define GRANT__MAP_READ                           0x00000001UL
-#define GRANT__MAP_WRITE                          0x00000002UL
-#define GRANT__UNMAP                              0x00000004UL
-#define GRANT__TRANSFER                           0x00000008UL
-#define GRANT__SETUP                              0x00000010UL
-#define GRANT__COPY                               0x00000020UL
-#define GRANT__QUERY                              0x00000040UL
-
-#define MMU__MAP_READ                             0x00000001UL
-#define MMU__MAP_WRITE                            0x00000002UL
-#define MMU__PAGEINFO                             0x00000004UL
-#define MMU__PAGELIST                             0x00000008UL
-#define MMU__ADJUST                               0x00000010UL
-#define MMU__STAT                                 0x00000020UL
-#define MMU__TRANSLATEGP                          0x00000040UL
-#define MMU__UPDATEMP                             0x00000080UL
-#define MMU__PHYSMAP                              0x00000100UL
-#define MMU__PINPAGE                              0x00000200UL
-#define MMU__MFNLIST                              0x00000400UL
-#define MMU__MEMORYMAP                            0x00000800UL
-#define MMU__REMOTE_REMAP                         0x00001000UL
-#define MMU__MMUEXT_OP                            0x00002000UL
-#define MMU__EXCHANGE                             0x00004000UL
-
-#define SHADOW__DISABLE                           0x00000001UL
-#define SHADOW__ENABLE                            0x00000002UL
-#define SHADOW__LOGDIRTY                          0x00000004UL
-
-#define RESOURCE__ADD                             0x00000001UL
-#define RESOURCE__REMOVE                          0x00000002UL
-#define RESOURCE__USE                             0x00000004UL
-#define RESOURCE__ADD_IRQ                         0x00000008UL
-#define RESOURCE__REMOVE_IRQ                      0x00000010UL
-#define RESOURCE__ADD_IOPORT                      0x00000020UL
-#define RESOURCE__REMOVE_IOPORT                   0x00000040UL
-#define RESOURCE__ADD_IOMEM                       0x00000080UL
-#define RESOURCE__REMOVE_IOMEM                    0x00000100UL
-#define RESOURCE__STAT_DEVICE                     0x00000200UL
-#define RESOURCE__ADD_DEVICE                      0x00000400UL
-#define RESOURCE__REMOVE_DEVICE                   0x00000800UL
-#define RESOURCE__PLUG                            0x00001000UL
-#define RESOURCE__UNPLUG                          0x00002000UL
-#define RESOURCE__SETUP                           0x00004000UL
-
-#define SECURITY__COMPUTE_AV                      0x00000001UL
-#define SECURITY__COMPUTE_CREATE                  0x00000002UL
-#define SECURITY__COMPUTE_MEMBER                  0x00000004UL
-#define SECURITY__CHECK_CONTEXT                   0x00000008UL
-#define SECURITY__LOAD_POLICY                     0x00000010UL
-#define SECURITY__COMPUTE_RELABEL                 0x00000020UL
-#define SECURITY__COMPUTE_USER                    0x00000040UL
-#define SECURITY__SETENFORCE                      0x00000080UL
-#define SECURITY__SETBOOL                         0x00000100UL
-#define SECURITY__SETSECPARAM                     0x00000200UL
-#define SECURITY__ADD_OCONTEXT                    0x00000400UL
-#define SECURITY__DEL_OCONTEXT                    0x00000800UL
-
diff --git a/xen/xsm/flask/include/class_to_string.h
b/xen/xsm/flask/include/class_to_string.h
deleted file mode 100644
index 7716645..0000000
--- a/xen/xsm/flask/include/class_to_string.h
+++ /dev/null
@@ -1,15 +0,0 @@
-/* This file is automatically generated.  Do not edit. */
-/*
- * Security object class definitions
- */
-    S_("null")
-    S_("xen")
-    S_("domain")
-    S_("domain2")
-    S_("hvm")
-    S_("mmu")
-    S_("resource")
-    S_("shadow")
-    S_("event")
-    S_("grant")
-    S_("security")
diff --git a/xen/xsm/flask/include/flask.h b/xen/xsm/flask/include/flask.h
deleted file mode 100644
index 3bff998..0000000
--- a/xen/xsm/flask/include/flask.h
+++ /dev/null
@@ -1,35 +0,0 @@
-/* This file is automatically generated.  Do not edit. */
-#ifndef _SELINUX_FLASK_H_
-#define _SELINUX_FLASK_H_
-
-/*
- * Security object class definitions
- */
-#define SECCLASS_XEN                                     1
-#define SECCLASS_DOMAIN                                  2
-#define SECCLASS_DOMAIN2                                 3
-#define SECCLASS_HVM                                     4
-#define SECCLASS_MMU                                     5
-#define SECCLASS_RESOURCE                                6
-#define SECCLASS_SHADOW                                  7
-#define SECCLASS_EVENT                                   8
-#define SECCLASS_GRANT                                   9
-#define SECCLASS_SECURITY                                10
-
-/*
- * Security identifier indices for initial entities
- */
-#define SECINITSID_XEN                                  1
-#define SECINITSID_DOM0                                 2
-#define SECINITSID_DOMIO                                3
-#define SECINITSID_DOMXEN                               4
-#define SECINITSID_UNLABELED                            5
-#define SECINITSID_SECURITY                             6
-#define SECINITSID_IOPORT                               7
-#define SECINITSID_IOMEM                                8
-#define SECINITSID_IRQ                                  9
-#define SECINITSID_DEVICE                               10
-
-#define SECINITSID_NUM                                  10
-
-#endif
diff --git a/xen/xsm/flask/include/initial_sid_to_string.h
b/xen/xsm/flask/include/initial_sid_to_string.h
deleted file mode 100644
index 814f4bf..0000000
--- a/xen/xsm/flask/include/initial_sid_to_string.h
+++ /dev/null
@@ -1,16 +0,0 @@
-/* This file is automatically generated.  Do not edit. */
-static char *initial_sid_to_string[] -{
-    "null",
-    "xen",
-    "dom0",
-    "domio",
-    "domxen",
-    "unlabeled",
-    "security",
-    "ioport",
-    "iomem",
-    "irq",
-    "device",
-};
-
diff --git a/tools/flask/policy/policy/flask/access_vectors
b/xen/xsm/flask/policy/access_vectors
similarity index 100%
rename from tools/flask/policy/policy/flask/access_vectors
rename to xen/xsm/flask/policy/access_vectors
diff --git a/tools/flask/policy/policy/flask/initial_sids
b/xen/xsm/flask/policy/initial_sids
similarity index 100%
rename from tools/flask/policy/policy/flask/initial_sids
rename to xen/xsm/flask/policy/initial_sids
diff --git a/tools/flask/policy/policy/flask/mkaccess_vector.sh
b/xen/xsm/flask/policy/mkaccess_vector.sh
similarity index 97%
rename from tools/flask/policy/policy/flask/mkaccess_vector.sh
rename to xen/xsm/flask/policy/mkaccess_vector.sh
index 43a60a7..8ec87f7 100644
--- a/tools/flask/policy/policy/flask/mkaccess_vector.sh
+++ b/xen/xsm/flask/policy/mkaccess_vector.sh
@@ -9,8 +9,8 @@ awk=$1
 shift
 
 # output files
-av_permissions="av_permissions.h"
-av_perm_to_string="av_perm_to_string.h"
+av_permissions="include/av_permissions.h"
+av_perm_to_string="include/av_perm_to_string.h"
 
 cat $* | $awk "
 BEGIN	{
diff --git a/tools/flask/policy/policy/flask/mkflask.sh
b/xen/xsm/flask/policy/mkflask.sh
similarity index 95%
rename from tools/flask/policy/policy/flask/mkflask.sh
rename to xen/xsm/flask/policy/mkflask.sh
index 9c84754..e8d8fb5 100644
--- a/tools/flask/policy/policy/flask/mkflask.sh
+++ b/xen/xsm/flask/policy/mkflask.sh
@@ -9,9 +9,9 @@ awk=$1
 shift 1
 
 # output file
-output_file="flask.h"
-debug_file="class_to_string.h"
-debug_file2="initial_sid_to_string.h"
+output_file="include/flask.h"
+debug_file="include/class_to_string.h"
+debug_file2="include/initial_sid_to_string.h"
 
 cat $* | $awk "
 BEGIN	{
diff --git a/tools/flask/policy/policy/flask/security_classes
b/xen/xsm/flask/policy/security_classes
similarity index 100%
rename from tools/flask/policy/policy/flask/security_classes
rename to xen/xsm/flask/policy/security_classes
-- 
1.7.11.4

Matt Wilson

2012-Oct-09 20:20 UTC

head link

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

On Tue, Oct 09, 2012 at 11:45:49AM +0100, Dario Faggioli
wrote:> 
> Whether, that is acceptable or not, is of course debatable, and we had a
> bit of this discussion already (although no real conclusion has been
> reached yet).
> My take is that, right now, since we do not yet expose any virtual NUMA
> topology to the VM itself, the behaviour described above is fine. As
> soon as we''ll have some guest NUMA awareness, than it might be
> worthwhile to try to preserve it, at least to some extent.
For what it''s worth, under VMware all bets are off if a vNUMA enabled
guest is migrated via vMotion. See "Performance Best Practices for
VMware vSphere 5.0" [1] page 40. There is also a good deal of
information in a paper published by VMware labs on HPC workloads [2]
and a blog post on NUMA load balancing [3].

Matt

[1] http://www.vmware.com/pdf/Perf_Best_Practices_vSphere5.0.pdf
[2]
http://labs.vmware.com/publications/performance-evaluation-of-hpc-benchmarks-on-vmwares-esxi-server
[3] http://blogs.vmware.com/vsphere/2012/02/vspherenuma-loadbalancing.htmlvnu

Ian Campbell

2012-Oct-10 08:38 UTC

head link

Re: [PATCH RFC] flask: move policy header sources into hypervisor

On Tue, 2012-10-09 at 19:31 +0100, Daniel De Graaf
wrote:> Ian Campbell wrote:
> [...]
> >>> +++ b/xen/xsm/flask/include/av_perm_to_string.h
> > Also, in that case why is this file checked in?
> 
> This patch fixes the autogenerated files, but doesn''t fully wire
them in
> to things like "make clean" or .{git,hg}ignore. I don''t
see an obvious
> way to clean generated header files in Xen''s build system; perhaps
> someone who knows the build system better can point out the right way to
> wire this up.
xen/arch/x86/Makefile has a clean:: rule which removes autogenerated
stuff like the asm-offsets files. Probably the right model to follow.

Ian.
> 
> --------------------------------------->8----------------------------
> 
> Rather than keeping around headers that are autogenerated in order to
> avoid adding build dependencies from xen/ to files in tools/, move the
> relevant parts of the FLASK policy into the hypervisor tree and generate
> the headers as part of the hypervisor''s build.
> 
> Signed-off-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>
> ---
>  tools/flask/policy/Makefile                        |   2 +-
>  tools/flask/policy/policy/flask/Makefile           |  41 ------
>  xen/xsm/flask/Makefile                             |  21 +++
>  xen/xsm/flask/include/av_perm_to_string.h          | 147
-------------------
>  xen/xsm/flask/include/av_permissions.h             | 157
---------------------
>  xen/xsm/flask/include/class_to_string.h            |  15 --
>  xen/xsm/flask/include/flask.h                      |  35 -----
>  xen/xsm/flask/include/initial_sid_to_string.h      |  16 ---
>  .../flask => xen/xsm/flask/policy}/access_vectors  |   0
>  .../flask => xen/xsm/flask/policy}/initial_sids    |   0
>  .../xsm/flask/policy}/mkaccess_vector.sh           |   4 +-
>  .../flask => xen/xsm/flask/policy}/mkflask.sh      |   6 +-
>  .../xsm/flask/policy}/security_classes             |   0
>  13 files changed, 27 insertions(+), 417 deletions(-)
>  delete mode 100644 tools/flask/policy/policy/flask/Makefile
>  delete mode 100644 xen/xsm/flask/include/av_perm_to_string.h
>  delete mode 100644 xen/xsm/flask/include/av_permissions.h
>  delete mode 100644 xen/xsm/flask/include/class_to_string.h
>  delete mode 100644 xen/xsm/flask/include/flask.h
>  delete mode 100644 xen/xsm/flask/include/initial_sid_to_string.h
>  rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/access_vectors (100%)
>  rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/initial_sids (100%)
>  rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/mkaccess_vector.sh (97%)
>  rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/mkflask.sh (95%)
>  rename {tools/flask/policy/policy/flask =>
xen/xsm/flask/policy}/security_classes (100%)
> 
> diff --git a/tools/flask/policy/Makefile b/tools/flask/policy/Makefile
> index 5c25cbe..3f5aa38 100644
> --- a/tools/flask/policy/Makefile
> +++ b/tools/flask/policy/Makefile
> @@ -61,7 +61,7 @@ LOADPOLICY := $(SBINDIR)/flask-loadpolicy
>  # policy source layout
>  POLDIR := policy
>  MODDIR := $(POLDIR)/modules
> -FLASKDIR := $(POLDIR)/flask
> +FLASKDIR := ../../../xen/xsm/flask/policy
>  SECCLASS := $(FLASKDIR)/security_classes
>  ISIDS := $(FLASKDIR)/initial_sids
>  AVS := $(FLASKDIR)/access_vectors
> diff --git a/tools/flask/policy/policy/flask/Makefile
b/tools/flask/policy/policy/flask/Makefile
> deleted file mode 100644
> index 5f57e88..0000000
> --- a/tools/flask/policy/policy/flask/Makefile
> +++ /dev/null
> @@ -1,41 +0,0 @@
> -# flask needs to know where to export the libselinux headers.
> -LIBSEL ?= ../../libselinux
> -
> -# flask needs to know where to export the kernel headers.
> -LINUXDIR ?= ../../../linux-2.6
> -
> -AWK = awk
> -
> -CONFIG_SHELL := $(shell if [ -x "$$BASH" ]; then echo $$BASH; \
> -          else if [ -x /bin/bash ]; then echo /bin/bash; \
> -          else echo sh; fi ; fi)
> -
> -FLASK_H_DEPEND = security_classes initial_sids
> -AV_H_DEPEND = access_vectors
> -
> -FLASK_H_FILES = class_to_string.h flask.h initial_sid_to_string.h
> -AV_H_FILES = av_perm_to_string.h av_permissions.h
> -ALL_H_FILES = $(FLASK_H_FILES) $(AV_H_FILES)
> -
> -all:  $(ALL_H_FILES)
> -
> -$(FLASK_H_FILES): $(FLASK_H_DEPEND)
> -       $(CONFIG_SHELL) mkflask.sh $(AWK) $(FLASK_H_DEPEND)
> -
> -$(AV_H_FILES): $(AV_H_DEPEND)
> -       $(CONFIG_SHELL) mkaccess_vector.sh $(AWK) $(AV_H_DEPEND)
> -
> -tolib: all
> -       install -m 644 flask.h av_permissions.h $(LIBSEL)/include/selinux
> -       install -m 644 class_to_string.h av_inherit.h
common_perm_to_string.h av_perm_to_string.h $(LIBSEL)/src
> -
> -tokern: all
> -       install -m 644 $(ALL_H_FILES) $(LINUXDIR)/security/selinux/include
> -
> -install: all
> -
> -relabel:
> -
> -clean:
> -       rm -f $(FLASK_H_FILES)
> -       rm -f $(AV_H_FILES)
> diff --git a/xen/xsm/flask/Makefile b/xen/xsm/flask/Makefile
> index 92fb410..238495a 100644
> --- a/xen/xsm/flask/Makefile
> +++ b/xen/xsm/flask/Makefile
> @@ -5,3 +5,24 @@ obj-y += flask_op.o
>  subdir-y += ss
> 
>  CFLAGS += -I./include
> +
> +AWK = awk
> +
> +CONFIG_SHELL := $(shell if [ -x "$$BASH" ]; then echo $$BASH; \
> +          else if [ -x /bin/bash ]; then echo /bin/bash; \
> +          else echo sh; fi ; fi)
> +
> +FLASK_H_DEPEND = policy/security_classes policy/initial_sids
> +AV_H_DEPEND = policy/access_vectors
> +
> +FLASK_H_FILES = include/flask.h include/class_to_string.h
include/initial_sid_to_string.h
> +AV_H_FILES = include/av_perm_to_string.h include/av_permissions.h
> +ALL_H_FILES = $(FLASK_H_FILES) $(AV_H_FILES)
> +
> +$(obj-y) ss/built_in.o: $(ALL_H_FILES)
> +
> +$(FLASK_H_FILES): $(FLASK_H_DEPEND)
> +       $(CONFIG_SHELL) policy/mkflask.sh $(AWK) $(FLASK_H_DEPEND)
> +
> +$(AV_H_FILES): $(AV_H_DEPEND)
> +       $(CONFIG_SHELL) policy/mkaccess_vector.sh $(AWK) $(AV_H_DEPEND)
> diff --git a/xen/xsm/flask/include/av_perm_to_string.h
b/xen/xsm/flask/include/av_perm_to_string.h
> deleted file mode 100644
> index c3f2370..0000000
> --- a/xen/xsm/flask/include/av_perm_to_string.h
> +++ /dev/null
> @@ -1,147 +0,0 @@
> -/* This file is automatically generated.  Do not edit. */
> -   S_(SECCLASS_XEN, XEN__SCHEDULER, "scheduler")
> -   S_(SECCLASS_XEN, XEN__SETTIME, "settime")
> -   S_(SECCLASS_XEN, XEN__TBUFCONTROL, "tbufcontrol")
> -   S_(SECCLASS_XEN, XEN__READCONSOLE, "readconsole")
> -   S_(SECCLASS_XEN, XEN__CLEARCONSOLE, "clearconsole")
> -   S_(SECCLASS_XEN, XEN__PERFCONTROL, "perfcontrol")
> -   S_(SECCLASS_XEN, XEN__MTRR_ADD, "mtrr_add")
> -   S_(SECCLASS_XEN, XEN__MTRR_DEL, "mtrr_del")
> -   S_(SECCLASS_XEN, XEN__MTRR_READ, "mtrr_read")
> -   S_(SECCLASS_XEN, XEN__MICROCODE, "microcode")
> -   S_(SECCLASS_XEN, XEN__PHYSINFO, "physinfo")
> -   S_(SECCLASS_XEN, XEN__QUIRK, "quirk")
> -   S_(SECCLASS_XEN, XEN__WRITECONSOLE, "writeconsole")
> -   S_(SECCLASS_XEN, XEN__READAPIC, "readapic")
> -   S_(SECCLASS_XEN, XEN__WRITEAPIC, "writeapic")
> -   S_(SECCLASS_XEN, XEN__PRIVPROFILE, "privprofile")
> -   S_(SECCLASS_XEN, XEN__NONPRIVPROFILE, "nonprivprofile")
> -   S_(SECCLASS_XEN, XEN__KEXEC, "kexec")
> -   S_(SECCLASS_XEN, XEN__FIRMWARE, "firmware")
> -   S_(SECCLASS_XEN, XEN__SLEEP, "sleep")
> -   S_(SECCLASS_XEN, XEN__FREQUENCY, "frequency")
> -   S_(SECCLASS_XEN, XEN__GETIDLE, "getidle")
> -   S_(SECCLASS_XEN, XEN__DEBUG, "debug")
> -   S_(SECCLASS_XEN, XEN__GETCPUINFO, "getcpuinfo")
> -   S_(SECCLASS_XEN, XEN__HEAP, "heap")
> -   S_(SECCLASS_XEN, XEN__PM_OP, "pm_op")
> -   S_(SECCLASS_XEN, XEN__MCA_OP, "mca_op")
> -   S_(SECCLASS_XEN, XEN__LOCKPROF, "lockprof")
> -   S_(SECCLASS_XEN, XEN__CPUPOOL_OP, "cpupool_op")
> -   S_(SECCLASS_XEN, XEN__SCHED_OP, "sched_op")
> -   S_(SECCLASS_XEN, XEN__TMEM_OP, "tmem_op")
> -   S_(SECCLASS_XEN, XEN__TMEM_CONTROL, "tmem_control")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUCONTEXT, "setvcpucontext")
> -   S_(SECCLASS_DOMAIN, DOMAIN__PAUSE, "pause")
> -   S_(SECCLASS_DOMAIN, DOMAIN__UNPAUSE, "unpause")
> -   S_(SECCLASS_DOMAIN, DOMAIN__RESUME, "resume")
> -   S_(SECCLASS_DOMAIN, DOMAIN__CREATE, "create")
> -   S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition")
> -   S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus")
> -   S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY,
"setvcpuaffinity")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY,
"getvcpuaffinity")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUCONTEXT, "getvcpucontext")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETDOMAINMAXMEM,
"setdomainmaxmem")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETDOMAINHANDLE,
"setdomainhandle")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETDEBUGGING, "setdebugging")
> -   S_(SECCLASS_DOMAIN, DOMAIN__HYPERCALL, "hypercall")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETTIME, "settime")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SET_TARGET, "set_target")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SHUTDOWN, "shutdown")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETADDRSIZE, "setaddrsize")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETADDRSIZE, "getaddrsize")
> -   S_(SECCLASS_DOMAIN, DOMAIN__TRIGGER, "trigger")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETEXTVCPUCONTEXT,
"getextvcpucontext")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETEXTVCPUCONTEXT,
"setextvcpucontext")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUEXTSTATE,
"getvcpuextstate")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUEXTSTATE,
"setvcpuextstate")
> -   S_(SECCLASS_DOMAIN, DOMAIN__GETPODTARGET, "getpodtarget")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SETPODTARGET, "setpodtarget")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SET_MISC_INFO, "set_misc_info")
> -   S_(SECCLASS_DOMAIN, DOMAIN__SET_VIRQ_HANDLER,
"set_virq_handler")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__RELABELFROM, "relabelfrom")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__RELABELTO, "relabelto")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__RELABELSELF, "relabelself")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__MAKE_PRIV_FOR, "make_priv_for")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__SET_AS_TARGET, "set_as_target")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__SET_CPUID, "set_cpuid")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__GETTSC, "gettsc")
> -   S_(SECCLASS_DOMAIN2, DOMAIN2__SETTSC, "settsc")
> -   S_(SECCLASS_HVM, HVM__SETHVMC, "sethvmc")
> -   S_(SECCLASS_HVM, HVM__GETHVMC, "gethvmc")
> -   S_(SECCLASS_HVM, HVM__SETPARAM, "setparam")
> -   S_(SECCLASS_HVM, HVM__GETPARAM, "getparam")
> -   S_(SECCLASS_HVM, HVM__PCILEVEL, "pcilevel")
> -   S_(SECCLASS_HVM, HVM__IRQLEVEL, "irqlevel")
> -   S_(SECCLASS_HVM, HVM__PCIROUTE, "pciroute")
> -   S_(SECCLASS_HVM, HVM__BIND_IRQ, "bind_irq")
> -   S_(SECCLASS_HVM, HVM__CACHEATTR, "cacheattr")
> -   S_(SECCLASS_HVM, HVM__TRACKDIRTYVRAM, "trackdirtyvram")
> -   S_(SECCLASS_HVM, HVM__HVMCTL, "hvmctl")
> -   S_(SECCLASS_HVM, HVM__MEM_EVENT, "mem_event")
> -   S_(SECCLASS_HVM, HVM__MEM_SHARING, "mem_sharing")
> -   S_(SECCLASS_HVM, HVM__AUDIT_P2M, "audit_p2m")
> -   S_(SECCLASS_HVM, HVM__SEND_IRQ, "send_irq")
> -   S_(SECCLASS_HVM, HVM__SHARE_MEM, "share_mem")
> -   S_(SECCLASS_EVENT, EVENT__BIND, "bind")
> -   S_(SECCLASS_EVENT, EVENT__SEND, "send")
> -   S_(SECCLASS_EVENT, EVENT__STATUS, "status")
> -   S_(SECCLASS_EVENT, EVENT__NOTIFY, "notify")
> -   S_(SECCLASS_EVENT, EVENT__CREATE, "create")
> -   S_(SECCLASS_EVENT, EVENT__RESET, "reset")
> -   S_(SECCLASS_GRANT, GRANT__MAP_READ, "map_read")
> -   S_(SECCLASS_GRANT, GRANT__MAP_WRITE, "map_write")
> -   S_(SECCLASS_GRANT, GRANT__UNMAP, "unmap")
> -   S_(SECCLASS_GRANT, GRANT__TRANSFER, "transfer")
> -   S_(SECCLASS_GRANT, GRANT__SETUP, "setup")
> -   S_(SECCLASS_GRANT, GRANT__COPY, "copy")
> -   S_(SECCLASS_GRANT, GRANT__QUERY, "query")
> -   S_(SECCLASS_MMU, MMU__MAP_READ, "map_read")
> -   S_(SECCLASS_MMU, MMU__MAP_WRITE, "map_write")
> -   S_(SECCLASS_MMU, MMU__PAGEINFO, "pageinfo")
> -   S_(SECCLASS_MMU, MMU__PAGELIST, "pagelist")
> -   S_(SECCLASS_MMU, MMU__ADJUST, "adjust")
> -   S_(SECCLASS_MMU, MMU__STAT, "stat")
> -   S_(SECCLASS_MMU, MMU__TRANSLATEGP, "translategp")
> -   S_(SECCLASS_MMU, MMU__UPDATEMP, "updatemp")
> -   S_(SECCLASS_MMU, MMU__PHYSMAP, "physmap")
> -   S_(SECCLASS_MMU, MMU__PINPAGE, "pinpage")
> -   S_(SECCLASS_MMU, MMU__MFNLIST, "mfnlist")
> -   S_(SECCLASS_MMU, MMU__MEMORYMAP, "memorymap")
> -   S_(SECCLASS_MMU, MMU__REMOTE_REMAP, "remote_remap")
> -   S_(SECCLASS_MMU, MMU__MMUEXT_OP, "mmuext_op")
> -   S_(SECCLASS_MMU, MMU__EXCHANGE, "exchange")
> -   S_(SECCLASS_SHADOW, SHADOW__DISABLE, "disable")
> -   S_(SECCLASS_SHADOW, SHADOW__ENABLE, "enable")
> -   S_(SECCLASS_SHADOW, SHADOW__LOGDIRTY, "logdirty")
> -   S_(SECCLASS_RESOURCE, RESOURCE__ADD, "add")
> -   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE, "remove")
> -   S_(SECCLASS_RESOURCE, RESOURCE__USE, "use")
> -   S_(SECCLASS_RESOURCE, RESOURCE__ADD_IRQ, "add_irq")
> -   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_IRQ, "remove_irq")
> -   S_(SECCLASS_RESOURCE, RESOURCE__ADD_IOPORT, "add_ioport")
> -   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_IOPORT,
"remove_ioport")
> -   S_(SECCLASS_RESOURCE, RESOURCE__ADD_IOMEM, "add_iomem")
> -   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_IOMEM, "remove_iomem")
> -   S_(SECCLASS_RESOURCE, RESOURCE__STAT_DEVICE, "stat_device")
> -   S_(SECCLASS_RESOURCE, RESOURCE__ADD_DEVICE, "add_device")
> -   S_(SECCLASS_RESOURCE, RESOURCE__REMOVE_DEVICE,
"remove_device")
> -   S_(SECCLASS_RESOURCE, RESOURCE__PLUG, "plug")
> -   S_(SECCLASS_RESOURCE, RESOURCE__UNPLUG, "unplug")
> -   S_(SECCLASS_RESOURCE, RESOURCE__SETUP, "setup")
> -   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_AV, "compute_av")
> -   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_CREATE,
"compute_create")
> -   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_MEMBER,
"compute_member")
> -   S_(SECCLASS_SECURITY, SECURITY__CHECK_CONTEXT,
"check_context")
> -   S_(SECCLASS_SECURITY, SECURITY__LOAD_POLICY, "load_policy")
> -   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_RELABEL,
"compute_relabel")
> -   S_(SECCLASS_SECURITY, SECURITY__COMPUTE_USER, "compute_user")
> -   S_(SECCLASS_SECURITY, SECURITY__SETENFORCE, "setenforce")
> -   S_(SECCLASS_SECURITY, SECURITY__SETBOOL, "setbool")
> -   S_(SECCLASS_SECURITY, SECURITY__SETSECPARAM, "setsecparam")
> -   S_(SECCLASS_SECURITY, SECURITY__ADD_OCONTEXT, "add_ocontext")
> -   S_(SECCLASS_SECURITY, SECURITY__DEL_OCONTEXT, "del_ocontext")
> diff --git a/xen/xsm/flask/include/av_permissions.h
b/xen/xsm/flask/include/av_permissions.h
> deleted file mode 100644
> index 65302e8..0000000
> --- a/xen/xsm/flask/include/av_permissions.h
> +++ /dev/null
> @@ -1,157 +0,0 @@
> -/* This file is automatically generated.  Do not edit. */
> -#define XEN__SCHEDULER                            0x00000001UL
> -#define XEN__SETTIME                              0x00000002UL
> -#define XEN__TBUFCONTROL                          0x00000004UL
> -#define XEN__READCONSOLE                          0x00000008UL
> -#define XEN__CLEARCONSOLE                         0x00000010UL
> -#define XEN__PERFCONTROL                          0x00000020UL
> -#define XEN__MTRR_ADD                             0x00000040UL
> -#define XEN__MTRR_DEL                             0x00000080UL
> -#define XEN__MTRR_READ                            0x00000100UL
> -#define XEN__MICROCODE                            0x00000200UL
> -#define XEN__PHYSINFO                             0x00000400UL
> -#define XEN__QUIRK                                0x00000800UL
> -#define XEN__WRITECONSOLE                         0x00001000UL
> -#define XEN__READAPIC                             0x00002000UL
> -#define XEN__WRITEAPIC                            0x00004000UL
> -#define XEN__PRIVPROFILE                          0x00008000UL
> -#define XEN__NONPRIVPROFILE                       0x00010000UL
> -#define XEN__KEXEC                                0x00020000UL
> -#define XEN__FIRMWARE                             0x00040000UL
> -#define XEN__SLEEP                                0x00080000UL
> -#define XEN__FREQUENCY                            0x00100000UL
> -#define XEN__GETIDLE                              0x00200000UL
> -#define XEN__DEBUG                                0x00400000UL
> -#define XEN__GETCPUINFO                           0x00800000UL
> -#define XEN__HEAP                                 0x01000000UL
> -#define XEN__PM_OP                                0x02000000UL
> -#define XEN__MCA_OP                               0x04000000UL
> -#define XEN__LOCKPROF                             0x08000000UL
> -#define XEN__CPUPOOL_OP                           0x10000000UL
> -#define XEN__SCHED_OP                             0x20000000UL
> -#define XEN__TMEM_OP                              0x40000000UL
> -#define XEN__TMEM_CONTROL                         0x80000000UL
> -
> -#define DOMAIN__SETVCPUCONTEXT                    0x00000001UL
> -#define DOMAIN__PAUSE                             0x00000002UL
> -#define DOMAIN__UNPAUSE                           0x00000004UL
> -#define DOMAIN__RESUME                            0x00000008UL
> -#define DOMAIN__CREATE                            0x00000010UL
> -#define DOMAIN__TRANSITION                        0x00000020UL
> -#define DOMAIN__MAX_VCPUS                         0x00000040UL
> -#define DOMAIN__DESTROY                           0x00000080UL
> -#define DOMAIN__SETVCPUAFFINITY                   0x00000100UL
> -#define DOMAIN__GETVCPUAFFINITY                   0x00000200UL
> -#define DOMAIN__SCHEDULER                         0x00000400UL
> -#define DOMAIN__GETDOMAININFO                     0x00000800UL
> -#define DOMAIN__GETVCPUINFO                       0x00001000UL
> -#define DOMAIN__GETVCPUCONTEXT                    0x00002000UL
> -#define DOMAIN__SETDOMAINMAXMEM                   0x00004000UL
> -#define DOMAIN__SETDOMAINHANDLE                   0x00008000UL
> -#define DOMAIN__SETDEBUGGING                      0x00010000UL
> -#define DOMAIN__HYPERCALL                         0x00020000UL
> -#define DOMAIN__SETTIME                           0x00040000UL
> -#define DOMAIN__SET_TARGET                        0x00080000UL
> -#define DOMAIN__SHUTDOWN                          0x00100000UL
> -#define DOMAIN__SETADDRSIZE                       0x00200000UL
> -#define DOMAIN__GETADDRSIZE                       0x00400000UL
> -#define DOMAIN__TRIGGER                           0x00800000UL
> -#define DOMAIN__GETEXTVCPUCONTEXT                 0x01000000UL
> -#define DOMAIN__SETEXTVCPUCONTEXT                 0x02000000UL
> -#define DOMAIN__GETVCPUEXTSTATE                   0x04000000UL
> -#define DOMAIN__SETVCPUEXTSTATE                   0x08000000UL
> -#define DOMAIN__GETPODTARGET                      0x10000000UL
> -#define DOMAIN__SETPODTARGET                      0x20000000UL
> -#define DOMAIN__SET_MISC_INFO                     0x40000000UL
> -#define DOMAIN__SET_VIRQ_HANDLER                  0x80000000UL
> -
> -#define DOMAIN2__RELABELFROM                      0x00000001UL
> -#define DOMAIN2__RELABELTO                        0x00000002UL
> -#define DOMAIN2__RELABELSELF                      0x00000004UL
> -#define DOMAIN2__MAKE_PRIV_FOR                    0x00000008UL
> -#define DOMAIN2__SET_AS_TARGET                    0x00000010UL
> -#define DOMAIN2__SET_CPUID                        0x00000020UL
> -#define DOMAIN2__GETTSC                           0x00000040UL
> -#define DOMAIN2__SETTSC                           0x00000080UL
> -
> -#define HVM__SETHVMC                              0x00000001UL
> -#define HVM__GETHVMC                              0x00000002UL
> -#define HVM__SETPARAM                             0x00000004UL
> -#define HVM__GETPARAM                             0x00000008UL
> -#define HVM__PCILEVEL                             0x00000010UL
> -#define HVM__IRQLEVEL                             0x00000020UL
> -#define HVM__PCIROUTE                             0x00000040UL
> -#define HVM__BIND_IRQ                             0x00000080UL
> -#define HVM__CACHEATTR                            0x00000100UL
> -#define HVM__TRACKDIRTYVRAM                       0x00000200UL
> -#define HVM__HVMCTL                               0x00000400UL
> -#define HVM__MEM_EVENT                            0x00000800UL
> -#define HVM__MEM_SHARING                          0x00001000UL
> -#define HVM__AUDIT_P2M                            0x00002000UL
> -#define HVM__SEND_IRQ                             0x00004000UL
> -#define HVM__SHARE_MEM                            0x00008000UL
> -
> -#define EVENT__BIND                               0x00000001UL
> -#define EVENT__SEND                               0x00000002UL
> -#define EVENT__STATUS                             0x00000004UL
> -#define EVENT__NOTIFY                             0x00000008UL
> -#define EVENT__CREATE                             0x00000010UL
> -#define EVENT__RESET                              0x00000020UL
> -
> -#define GRANT__MAP_READ                           0x00000001UL
> -#define GRANT__MAP_WRITE                          0x00000002UL
> -#define GRANT__UNMAP                              0x00000004UL
> -#define GRANT__TRANSFER                           0x00000008UL
> -#define GRANT__SETUP                              0x00000010UL
> -#define GRANT__COPY                               0x00000020UL
> -#define GRANT__QUERY                              0x00000040UL
> -
> -#define MMU__MAP_READ                             0x00000001UL
> -#define MMU__MAP_WRITE                            0x00000002UL
> -#define MMU__PAGEINFO                             0x00000004UL
> -#define MMU__PAGELIST                             0x00000008UL
> -#define MMU__ADJUST                               0x00000010UL
> -#define MMU__STAT                                 0x00000020UL
> -#define MMU__TRANSLATEGP                          0x00000040UL
> -#define MMU__UPDATEMP                             0x00000080UL
> -#define MMU__PHYSMAP                              0x00000100UL
> -#define MMU__PINPAGE                              0x00000200UL
> -#define MMU__MFNLIST                              0x00000400UL
> -#define MMU__MEMORYMAP                            0x00000800UL
> -#define MMU__REMOTE_REMAP                         0x00001000UL
> -#define MMU__MMUEXT_OP                            0x00002000UL
> -#define MMU__EXCHANGE                             0x00004000UL
> -
> -#define SHADOW__DISABLE                           0x00000001UL
> -#define SHADOW__ENABLE                            0x00000002UL
> -#define SHADOW__LOGDIRTY                          0x00000004UL
> -
> -#define RESOURCE__ADD                             0x00000001UL
> -#define RESOURCE__REMOVE                          0x00000002UL
> -#define RESOURCE__USE                             0x00000004UL
> -#define RESOURCE__ADD_IRQ                         0x00000008UL
> -#define RESOURCE__REMOVE_IRQ                      0x00000010UL
> -#define RESOURCE__ADD_IOPORT                      0x00000020UL
> -#define RESOURCE__REMOVE_IOPORT                   0x00000040UL
> -#define RESOURCE__ADD_IOMEM                       0x00000080UL
> -#define RESOURCE__REMOVE_IOMEM                    0x00000100UL
> -#define RESOURCE__STAT_DEVICE                     0x00000200UL
> -#define RESOURCE__ADD_DEVICE                      0x00000400UL
> -#define RESOURCE__REMOVE_DEVICE                   0x00000800UL
> -#define RESOURCE__PLUG                            0x00001000UL
> -#define RESOURCE__UNPLUG                          0x00002000UL
> -#define RESOURCE__SETUP                           0x00004000UL
> -
> -#define SECURITY__COMPUTE_AV                      0x00000001UL
> -#define SECURITY__COMPUTE_CREATE                  0x00000002UL
> -#define SECURITY__COMPUTE_MEMBER                  0x00000004UL
> -#define SECURITY__CHECK_CONTEXT                   0x00000008UL
> -#define SECURITY__LOAD_POLICY                     0x00000010UL
> -#define SECURITY__COMPUTE_RELABEL                 0x00000020UL
> -#define SECURITY__COMPUTE_USER                    0x00000040UL
> -#define SECURITY__SETENFORCE                      0x00000080UL
> -#define SECURITY__SETBOOL                         0x00000100UL
> -#define SECURITY__SETSECPARAM                     0x00000200UL
> -#define SECURITY__ADD_OCONTEXT                    0x00000400UL
> -#define SECURITY__DEL_OCONTEXT                    0x00000800UL
> -
> diff --git a/xen/xsm/flask/include/class_to_string.h
b/xen/xsm/flask/include/class_to_string.h
> deleted file mode 100644
> index 7716645..0000000
> --- a/xen/xsm/flask/include/class_to_string.h
> +++ /dev/null
> @@ -1,15 +0,0 @@
> -/* This file is automatically generated.  Do not edit. */
> -/*
> - * Security object class definitions
> - */
> -    S_("null")
> -    S_("xen")
> -    S_("domain")
> -    S_("domain2")
> -    S_("hvm")
> -    S_("mmu")
> -    S_("resource")
> -    S_("shadow")
> -    S_("event")
> -    S_("grant")
> -    S_("security")
> diff --git a/xen/xsm/flask/include/flask.h b/xen/xsm/flask/include/flask.h
> deleted file mode 100644
> index 3bff998..0000000
> --- a/xen/xsm/flask/include/flask.h
> +++ /dev/null
> @@ -1,35 +0,0 @@
> -/* This file is automatically generated.  Do not edit. */
> -#ifndef _SELINUX_FLASK_H_
> -#define _SELINUX_FLASK_H_
> -
> -/*
> - * Security object class definitions
> - */
> -#define SECCLASS_XEN                                     1
> -#define SECCLASS_DOMAIN                                  2
> -#define SECCLASS_DOMAIN2                                 3
> -#define SECCLASS_HVM                                     4
> -#define SECCLASS_MMU                                     5
> -#define SECCLASS_RESOURCE                                6
> -#define SECCLASS_SHADOW                                  7
> -#define SECCLASS_EVENT                                   8
> -#define SECCLASS_GRANT                                   9
> -#define SECCLASS_SECURITY                                10
> -
> -/*
> - * Security identifier indices for initial entities
> - */
> -#define SECINITSID_XEN                                  1
> -#define SECINITSID_DOM0                                 2
> -#define SECINITSID_DOMIO                                3
> -#define SECINITSID_DOMXEN                               4
> -#define SECINITSID_UNLABELED                            5
> -#define SECINITSID_SECURITY                             6
> -#define SECINITSID_IOPORT                               7
> -#define SECINITSID_IOMEM                                8
> -#define SECINITSID_IRQ                                  9
> -#define SECINITSID_DEVICE                               10
> -
> -#define SECINITSID_NUM                                  10
> -
> -#endif
> diff --git a/xen/xsm/flask/include/initial_sid_to_string.h
b/xen/xsm/flask/include/initial_sid_to_string.h
> deleted file mode 100644
> index 814f4bf..0000000
> --- a/xen/xsm/flask/include/initial_sid_to_string.h
> +++ /dev/null
> @@ -1,16 +0,0 @@
> -/* This file is automatically generated.  Do not edit. */
> -static char *initial_sid_to_string[] > -{
> -    "null",
> -    "xen",
> -    "dom0",
> -    "domio",
> -    "domxen",
> -    "unlabeled",
> -    "security",
> -    "ioport",
> -    "iomem",
> -    "irq",
> -    "device",
> -};
> -
> diff --git a/tools/flask/policy/policy/flask/access_vectors
b/xen/xsm/flask/policy/access_vectors
> similarity index 100%
> rename from tools/flask/policy/policy/flask/access_vectors
> rename to xen/xsm/flask/policy/access_vectors
> diff --git a/tools/flask/policy/policy/flask/initial_sids
b/xen/xsm/flask/policy/initial_sids
> similarity index 100%
> rename from tools/flask/policy/policy/flask/initial_sids
> rename to xen/xsm/flask/policy/initial_sids
> diff --git a/tools/flask/policy/policy/flask/mkaccess_vector.sh
b/xen/xsm/flask/policy/mkaccess_vector.sh
> similarity index 97%
> rename from tools/flask/policy/policy/flask/mkaccess_vector.sh
> rename to xen/xsm/flask/policy/mkaccess_vector.sh
> index 43a60a7..8ec87f7 100644
> --- a/tools/flask/policy/policy/flask/mkaccess_vector.sh
> +++ b/xen/xsm/flask/policy/mkaccess_vector.sh
> @@ -9,8 +9,8 @@ awk=$1
>  shift
> 
>  # output files
> -av_permissions="av_permissions.h"
> -av_perm_to_string="av_perm_to_string.h"
> +av_permissions="include/av_permissions.h"
> +av_perm_to_string="include/av_perm_to_string.h"
> 
>  cat $* | $awk "
>  BEGIN  {
> diff --git a/tools/flask/policy/policy/flask/mkflask.sh
b/xen/xsm/flask/policy/mkflask.sh
> similarity index 95%
> rename from tools/flask/policy/policy/flask/mkflask.sh
> rename to xen/xsm/flask/policy/mkflask.sh
> index 9c84754..e8d8fb5 100644
> --- a/tools/flask/policy/policy/flask/mkflask.sh
> +++ b/xen/xsm/flask/policy/mkflask.sh
> @@ -9,9 +9,9 @@ awk=$1
>  shift 1
> 
>  # output file
> -output_file="flask.h"
> -debug_file="class_to_string.h"
> -debug_file2="initial_sid_to_string.h"
> +output_file="include/flask.h"
> +debug_file="include/class_to_string.h"
> +debug_file2="include/initial_sid_to_string.h"
> 
>  cat $* | $awk "
>  BEGIN  {
> diff --git a/tools/flask/policy/policy/flask/security_classes
b/xen/xsm/flask/policy/security_classes
> similarity index 100%
> rename from tools/flask/policy/policy/flask/security_classes
> rename to xen/xsm/flask/policy/security_classes
> --
> 1.7.11.4
>

Dario Faggioli

2012-Oct-10 08:44 UTC

head link

Re: [PATCH RFC] flask: move policy header sources into hypervisor

Hello Daniel,

On Tue, 2012-10-09 at 14:31 -0400, Daniel De Graaf wrote:
> Ian Campbell wrote:
> [...]
> >>> +++ b/xen/xsm/flask/include/av_perm_to_string.h
> > Also, in that case why is this file checked in?
> 
> This patch fixes the autogenerated files, but doesn''t fully wire
them in
> to things like "make clean" or .{git,hg}ignore. 
>Forgive me for pushing but, while you''re here, do you mind taking a
look
and sharing your thoughts about the hunks of the patch touching XSM and
FLASK? As I said, I''ve very few experience with that part of Xen, and
it
would be wonderful to know whether what I did looks sane, or I messed
something up! :-P

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-10 08:46 UTC

head link

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

On Tue, 2012-10-09 at 16:03 +0100, Ian Jackson wrote: > > Which is what made me thinking that opacity was not its first concern
in
> > the first place, and that turning it into being opaque was none of
this
> > change''s business. :-)
> 
> You are right that since you''re just moving the code,
it''s not a
> problem for this patch.
> Ok.
> > However, I see your point... Perhaps I can add two functions
(something
> > like print_{cpumap,nodemap}()), both calling the original
> > print_bitmap(), and deal with the "any {cpu,node}" case
within them...
> > 
> > Do you like that better?
> 
> That would indeed be an improvement.
> But I think I''ll go for it then. It''s a small effort, and I
think the
final results would be better too.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-10 10:55 UTC

head link

Re: [PATCH 7 of 8] libxl: automatic placement deals with node-affinity

On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> Which basically means the following two things:
>  1) during domain creation, it is the node-affinity of
>     the domain --rather than the vcpu-affinities of its
>     vcpus-- that is affected by automatic placement;
>  2) during automatic placement, when counting how many
>     vcpus are already "bound" to a placement candidate
>     (as part of the process of choosing the best
>     candidate), node-affinity is also considered,
>     together with vcpu-affinity.
>
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -133,13 +133,13 @@ static int numa_place_domain(libxl__gc *
>  {
>      int found;
>      libxl__numa_candidate candidate;
> -    libxl_bitmap candidate_nodemap;
> +    libxl_bitmap cpupool_nodemap;
>      libxl_cpupoolinfo cpupool_info;
>      int i, cpupool, rc = 0;
>      uint32_t memkb;
>
>      libxl__numa_candidate_init(&candidate);
> -    libxl_bitmap_init(&candidate_nodemap);
> +    libxl_bitmap_init(&cpupool_nodemap);
>
>      /*
>       * Extract the cpumap from the cpupool the domain belong to. In fact,
> @@ -156,7 +156,7 @@ static int numa_place_domain(libxl__gc *
>      rc = libxl_domain_need_memory(CTX, info, &memkb);
>      if (rc)
>          goto out;
> -    if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) {
> +    if (libxl_node_bitmap_alloc(CTX, &cpupool_nodemap, 0)) {
>          rc = ERROR_FAIL;
>          goto out;
>      }
> @@ -174,17 +174,19 @@ static int numa_place_domain(libxl__gc *
>      if (found == 0)
>          goto out;
>
> -    /* Map the candidate''s node map to the domain''s
info->cpumap */
> -    libxl__numa_candidate_get_nodemap(gc, &candidate,
&candidate_nodemap);
> -    rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap,
&info->cpumap);
> +    /* Map the candidate''s node map to the domain''s
info->nodemap */
> +    libxl__numa_candidate_get_nodemap(gc, &candidate,
&info->nodemap);
> +
> +    /* Avoid trying to set the affinity to nodes that might be in the
> +     * candidate''s nodemap but out of our cpupool. */
> +    rc = libxl_cpumap_to_nodemap(CTX, &cpupool_info.cpumap,
> +                                 &cpupool_nodemap);
>      if (rc)
>          goto out;
>
> -    /* Avoid trying to set the affinity to cpus that might be in the
> -     * nodemap but not in our cpupool. */
> -    libxl_for_each_set_bit(i, info->cpumap) {
> -        if (!libxl_bitmap_test(&cpupool_info.cpumap, i))
> -            libxl_bitmap_reset(&info->cpumap, i);
> +    libxl_for_each_set_bit(i, info->nodemap) {
> +        if (!libxl_bitmap_test(&cpupool_nodemap, i))
> +            libxl_bitmap_reset(&info->nodemap, i);
>      }
>
>      LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and
"
> @@ -193,7 +195,7 @@ static int numa_place_domain(libxl__gc *
>
>   out:
>      libxl__numa_candidate_dispose(&candidate);
> -    libxl_bitmap_dispose(&candidate_nodemap);
> +    libxl_bitmap_dispose(&cpupool_nodemap);
>      libxl_cpupoolinfo_dispose(&cpupool_info);
>      return rc;
>  }
> @@ -211,10 +213,10 @@ int libxl__build_pre(libxl__gc *gc, uint
>      /*
>       * Check if the domain has any CPU affinity. If not, try to build
>       * up one. In case numa_place_domain() find at least a suitable
> -     * candidate, it will affect info->cpumap accordingly; if it
> +     * candidate, it will affect info->nodemap accordingly; if it
>       * does not, it just leaves it as it is. This means (unless
>       * some weird error manifests) the subsequent call to
> -     * libxl_set_vcpuaffinity_all() will do the actual placement,
> +     * libxl_domain_set_nodeaffinity() will do the actual placement,
>       * whatever that turns out to be.
>       */
>      if (libxl_defbool_val(info->numa_placement)) {
> diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c
> --- a/tools/libxl/libxl_numa.c
> +++ b/tools/libxl/libxl_numa.c
> @@ -171,7 +171,7 @@ static int nodemap_to_nr_vcpus(libxl__gc
>                                 const libxl_bitmap *nodemap)
>  {
>      libxl_dominfo *dinfo = NULL;
> -    libxl_bitmap vcpu_nodemap;
> +    libxl_bitmap vcpu_nodemap, dom_nodemap;
>      int nr_doms, nr_cpus;
>      int nr_vcpus = 0;
>      int i, j, k;
> @@ -185,6 +185,12 @@ static int nodemap_to_nr_vcpus(libxl__gc
>          return ERROR_FAIL;
>      }
>
> +    if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) {
> +        libxl_dominfo_list_free(dinfo, nr_doms);
> +        libxl_bitmap_dispose(&vcpu_nodemap);
> +        return ERROR_FAIL;
> +    }
> +
>      for (i = 0; i < nr_doms; i++) {
>          libxl_vcpuinfo *vinfo;
>          int nr_dom_vcpus;
> @@ -193,6 +199,9 @@ static int nodemap_to_nr_vcpus(libxl__gc
>          if (vinfo == NULL)
>              continue;
>
> +        /* Retrieve the domain''s node-affinity map (see below) */
> +        libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid,
&dom_nodemap);
> +
>          /* For each vcpu of each domain ... */
>          for (j = 0; j < nr_dom_vcpus; j++) {
>
> @@ -201,9 +210,17 @@ static int nodemap_to_nr_vcpus(libxl__gc
>              libxl_for_each_set_bit(k, vinfo[j].cpumap)
>                  libxl_bitmap_set(&vcpu_nodemap, tinfo[k].node);
>
> -            /* And check if that map has any intersection with our nodemap
*/
> +            /*
> +             * We now check whether the && of the vcpu''s
nodemap and the
> +             * domain''s nodemap has any intersection with the
nodemap of our
> +             * canidate.
> +             * Using both (vcpu''s and domain''s) nodemaps
allows us to take
> +             * both vcpu-affinity and node-affinity into account when
counting
> +             * the number of vcpus bound to the candidate.
> +             */
>              libxl_for_each_set_bit(k, vcpu_nodemap) {
> -                if (libxl_bitmap_test(nodemap, k)) {
> +                if (libxl_bitmap_test(&dom_nodemap, k) &&
> +                    libxl_bitmap_test(nodemap, k)) {
>                      nr_vcpus++;
>                      break;
>                  }
> @@ -213,6 +230,7 @@ static int nodemap_to_nr_vcpus(libxl__gc
>          libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus);
>      }
>
> +    libxl_bitmap_dispose(&dom_nodemap);
>      libxl_bitmap_dispose(&vcpu_nodemap);
>      libxl_dominfo_list_free(dinfo, nr_doms);
>      return nr_vcpus;
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

George Dunlap

2012-Oct-10 11:00 UTC

head link

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
<dario.faggioli@citrix.com> wrote:> Hi Everyone,
>
> Here it comes a patch series instilling some NUMA awareness in the Credit
> scheduler.
Hey Dario -- I''ve looked through everything and acked everything I
felt I understood well enough / had the authority to ack.  Thanks for
the good work!

 -George
>
> What the patches do is teaching the Xen''s scheduler how to try
maximizing
> performances on a NUMA host, taking advantage of the information coming
from
> the automatic NUMA placement we have in libxl.  Right now, the
> placement algorithm runs and selects a node (or a set of nodes) where it is
best
> to put a new domain on. Then, all the memory for the new domain is
allocated
> from those node(s) and all the vCPUs of the new domain are pinned to the
pCPUs
> of those node(s). What we do here is, instead of statically pinning the
domain''s
> vCPUs to the nodes'' pCPUs, have the (Credit) scheduler _prefer_
running them
> there. That enables most of the performances benefits of "real"
pinning, but
> without its intrinsic lack of flexibility.
>
> The above happens by extending to the scheduler the knowledge of a
domain''s
> node-affinity. We then ask it to first try to run the domain''s
vCPUs on one of
> the nodes the domain has affinity with. Of course, if that turns out to be
> impossible, it falls back on the old behaviour (i.e., considering
vcpu-affinity
> only).
>
> Just allow me to mention that NUMA aware scheduling not only is one of the
item
> of the NUMA roadmap I''m trying to maintain here
> http://wiki.xen.org/wiki/Xen_NUMA_Roadmap. It is also one of the features
we
> decided we want for Xen 4.3 (and thus it is part of the list of such
features
> that George is maintaining).
>
> Up to now, I''ve been able to thoroughly test this only on my 2
NUMA nodes
> testbox, by running the SpecJBB2005 benchmark concurrently on multiple VMs,
and
> the results looks really nice.  A full set of what I got can be found
inside my
> presentation from last XenSummit, which is available here:
>
> 
http://www.slideshare.net/xen_com_mgr/numa-and-virtualization-the-case-of-xen?ref=http://www.xen.org/xensummit/xs12na_talks/T9.html
>
> However, I rerun some of the tests in these last days (since I changed some
> bits of the implementation) and here''s what I got:
>
> -------------------------------------------------------
>  SpecJBB2005 Total Aggregate Throughput
> -------------------------------------------------------
> #VMs       No NUMA affinity     NUMA affinity &   +/- %
>                                   scheduling
> -------------------------------------------------------
>    2            34653.273          40243.015    +16.13%
>    4            29883.057          35526.807    +18.88%
>    6            23512.926          27015.786    +14.89%
>    8            19120.243          21825.818    +14.15%
>   10            15676.675          17701.472    +12.91%
>
> Basically, results are consistent with what is shown in the super-nice
graphs I
> have in the slides above! :-) As said, this looks nice to me, especially
> considering that my test machine is quite small, i.e., its 2 nodes are very
> close to each others from a latency point of view. I really expect more
> improvement on bigger hardware, where much greater NUMA effect is to be
> expected.  Of course, I myself will continue benchmarking (hopefully, on
> systems with more than 2 nodes too), but should anyone want to run its own
> testing, that would be great, so feel free to do that and report results to
me
> and/or to the list!
>
> A little bit more about the series:
>
>  1/8 xen, libxc: rename xenctl_cpumap to xenctl_bitmap
>  2/8 xen, libxc: introduce node maps and masks
>
> Is some preparation work.
>
>  3/8 xen: let the (credit) scheduler know about `node affinity`
>
> Is where the vcpu load balancing logic of the credit scheduler is modified
to
> support node-affinity.
>
>  4/8 xen: allow for explicitly specifying node-affinity
>  5/8 libxc: allow for explicitly specifying node-affinity
>  6/8 libxl: allow for explicitly specifying node-affinity
>  7/8 libxl: automatic placement deals with node-affinity
>
> Is what wires the in-scheduler node-affinity support with the external
world.
> Please, note that patch 4 touches XSM and Flask, which is the area with
which I
> have less experience and less chance to test properly. So, If Daniel and/or
> anyone interested in that could take a look and comment, that would be
awesome.
>
>  8/8 xl: report node-affinity for domains
>
> Is just some small output enhancement.
>
> Thanks and Regards,
> Dario
>
> --
> <<This happens because I choose it to happen!>> (Raistlin
Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Dario Faggioli

2012-Oct-10 12:28 UTC

head link

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

On Wed, 2012-10-10 at 12:00 +0100, George Dunlap wrote: > On Fri, Oct 5, 2012 at 3:08 PM, Dario Faggioli
> <dario.faggioli@citrix.com> wrote:
> > Hi Everyone,
> >
> > Here it comes a patch series instilling some NUMA awareness in the
Credit
> > scheduler.
> 
> Hey Dario --
>Hi!
> I''ve looked through everything and acked everything I
> felt I understood well enough / had the authority to ack.
>Yep, I''ve seen that. Thanks.
> Thanks for
> the good work!
> Well, thanks to you for the good comments... And be prepared for next
round! :-P

Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Daniel De Graaf

2012-Oct-10 14:03 UTC

head link

Re: [PATCH RFC] flask: move policy header sources into hypervisor

On 10/10/2012 04:44 AM, Dario Faggioli wrote:> Hello Daniel,
> 
> On Tue, 2012-10-09 at 14:31 -0400, Daniel De Graaf wrote: 
>> Ian Campbell wrote:
>> [...]
>>>>> +++ b/xen/xsm/flask/include/av_perm_to_string.h
>>> Also, in that case why is this file checked in?
>>
>> This patch fixes the autogenerated files, but doesn''t fully
wire them in
>> to things like "make clean" or .{git,hg}ignore. 
>>
> Forgive me for pushing but, while you''re here, do you mind taking
a look
> and sharing your thoughts about the hunks of the patch touching XSM and
> FLASK? As I said, I''ve very few experience with that part of Xen,
and it
> would be wonderful to know whether what I did looks sane, or I messed
> something up! :-P
> 
> Thanks and Regards,
> Dario
> 
Ah, in my distraction with fixing the autogeneration I neglected to 
finish looking at the original patch. The XSM changes look good except
for a missing implementation of the dummy_nodeaffinity() function in
xen/xsm/dummy.c. However, since the implementation of xsm_nodeaffinity
and xsm_vcpuaffinity are identical, it may be simpler to just merge them
into a common xsm_affinity_domctl hook (as is implemented in
xsm/flask/hooks.c) - in that case, just renaming the existing dummy hook
will suffice.

A more general note on the topic of what XSM permissions to use: 
normally, each domctl has its own permission, and so adding new domctls
would be done by adding a new permission to the access_vectors file
(which is the source of av_perm_to_string.h). However, for this case, it
seems rather unlikely that one would want to allow access to vcpu
affinity and deny node affinity, so using the same permission for both 
accesses is the best solution.

When renaming a permission (such as getvcpuaffinity => getaffinity), the
FLASK policy also needs to be changed - you can normally just grep for
the permission being changed.

The dummy hook would be caught in a compilation with XSM enabled, but I
notice that current xen-unstable will not build due to a patch being
applied out of order (xsm/flask: add domain relabel support requires
rcu_lock_domain_by_any_id which was added in the prior patch). Adding
Keir to CC since he applied the patch.

-- 
Daniel De Graaf
National Security Agency

Dario Faggioli

2012-Oct-10 14:39 UTC

head link

Re: [PATCH RFC] flask: move policy header sources into hypervisor

On Wed, 2012-10-10 at 15:03 +0100, Daniel De Graaf wrote:
> Ah, in my distraction with fixing the autogeneration I neglected to 
> finish looking at the original patch.
>:-)
> The XSM changes look good except
> for a missing implementation of the dummy_nodeaffinity() function in
> xen/xsm/dummy.c. However, since the implementation of xsm_nodeaffinity
> and xsm_vcpuaffinity are identical, it may be simpler to just merge them
> into a common xsm_affinity_domctl hook (as is implemented in
> xsm/flask/hooks.c) - in that case, just renaming the existing dummy hook
> will suffice.
> Ok, thanks. I will do that.
> A more general note on the topic of what XSM permissions to use: 
> normally, each domctl has its own permission, and so adding new domctls
> would be done by adding a new permission to the access_vectors file
> (which is the source of av_perm_to_string.h). However, for this case, it
> seems rather unlikely that one would want to allow access to vcpu
> affinity and deny node affinity, so using the same permission for both 
> accesses is the best solution.
> Yes, exactly.

Moreover, looking at xen/xsm/flask/include/av_permissions.h where
DOMAIN__{GET,SET}VCPUAFFINITY are, I got thee impression that there is
no more space left for DOMAIN__* permissions, as they already go from
0x00000001UL to 0x80000000UL... Is that so?
> When renaming a permission (such as getvcpuaffinity => getaffinity), the
> FLASK policy also needs to be changed - you can normally just grep for
> the permission being changed.
> Ok and thanks again. I will do that too...
> The dummy hook would be caught in a compilation with XSM enabled, but I
> notice that current xen-unstable will not build due to a patch being
> applied out of order (xsm/flask: add domain relabel support requires
> rcu_lock_domain_by_any_id which was added in the prior patch). Adding
> Keir to CC since he applied the patch.
> ... As well as I will try to check for this for next round (hoping that
by that time the issue you''re describing here would be fixed :-)).

Thanks a lot and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Daniel De Graaf

2012-Oct-10 15:32 UTC

head link

Re: [PATCH RFC] flask: move policy header sources into hypervisor

On 10/10/2012 10:39 AM, Dario Faggioli wrote:
[...]>> A more general note on the topic of what XSM permissions to use: 
>> normally, each domctl has its own permission, and so adding new domctls
>> would be done by adding a new permission to the access_vectors file
>> (which is the source of av_perm_to_string.h). However, for this case,
it
>> seems rather unlikely that one would want to allow access to vcpu
>> affinity and deny node affinity, so using the same permission for both 
>> accesses is the best solution.
>>
> Yes, exactly.
> 
> Moreover, looking at xen/xsm/flask/include/av_permissions.h where
> DOMAIN__{GET,SET}VCPUAFFINITY are, I got thee impression that there is
> no more space left for DOMAIN__* permissions, as they already go from
> 0x00000001UL to 0x80000000UL... Is that so?
Yes. My XSM patch series expands this by adding SECCLASS_DOMAIN2 to address
this (and that part is already in 4.3). This solution can be applied to any
XSM classes needing more than 32 permission bits.

-- 
Daniel De Graaf
National Security Agency

Dario Faggioli

2012-Oct-10 16:18 UTC

head link

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

On Mon, 2012-10-08 at 12:43 -0700, Dan Magenheimer wrote:
> Just wondering... is the NUMA information preserved on live migration?
> I''m not saying that it necessarily should, but it may just work
> due to the implementation (since migration is a form of domain creation).
> In either case, it might be good to comment about live migration
> on your wiki.
> FYI:

http://wiki.xen.org/wiki/Xen_NUMA_Introduction 
http://wiki.xen.org/wiki?title=Xen_NUMA_Introduction&diff=5327&oldid=4598

As per the NUMA roadmap ( http://wiki.xen.org/wiki/Xen_NUMA_Roadmap ),
you can see here [1] that it was already there. :-)

Regards,
Dario

[1]
http://wiki.xen.org/wiki/Xen_NUMA_Roadmap#Virtual_NUMA_topology_exposure_to_guests

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Oct 2012 - [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

[PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

[PATCH 1 of 8] xen, libxc: rename xenctl_cpumap to xenctl_bitmap

[PATCH 2 of 8] xen, libxc: introduce node maps and masks

[PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

[PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

[PATCH 5 of 8] libxc: allow for explicitly specifying node-affinity

[PATCH 6 of 8] libxl: allow for explicitly specifying node-affinity

[PATCH 7 of 8] libxl: automatic placement deals with node-affinity

[PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Re: [PATCH 1 of 8] xen, libxc: rename xenctl_cpumap to xenctl_bitmap

Re: [PATCH 2 of 8] xen, libxc: introduce node maps and masks

Re: [PATCH 3 of 8] xen: let the (credit) scheduler know about `node affinity`

Re: [PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

Re: [PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

Re: [PATCH 4 of 8] xen: allow for explicitly specifying node-affinity

[PATCH RFC] flask: move policy header sources into hypervisor

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Re: [PATCH RFC] flask: move policy header sources into hypervisor

Re: [PATCH RFC] flask: move policy header sources into hypervisor

Re: [PATCH 8 of 8] xl: add node-affinity to the output of `xl list`

Re: [PATCH 7 of 8] libxl: automatic placement deals with node-affinity

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler

Re: [PATCH RFC] flask: move policy header sources into hypervisor

Re: [PATCH RFC] flask: move policy header sources into hypervisor

Re: [PATCH RFC] flask: move policy header sources into hypervisor

Re: [PATCH 0 of 8] NUMA Awareness for the Credit Scheduler