thr3ads.net - Xen devel - [PATCH 00 of 10 [RFC]] Automatically place guest on host''s NUMA nodes with xl [Apr 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Apr-11 13:17 UTC

[PATCH 00 of 10 [RFC]] Automatically place guest on host''s NUMA nodes with xl

Hello Everybody,

This is the first take of the automatic placement of a guest on the host NUMA
nodes I''ve been working on for a while. It, right now, takes into
account the
amount of memory the guest needs as compared to the amount of free memory on
the various nodes.

It''s still in [RFC] status, as there are still quite a bit of design
choices
I''d like to discuss, and quite a bit of changes I''ve made on
which I''d really
like to have a second opinion. :-P

Just very quickly, these are refactorings of existing data structures and code,
paving the way for the real "meat":

 1 of 10  libxc: Generalize xenctl_cpumap to just xenctl_map
 2 of 10  libxl: Generalize libxl_cpumap to just libxl_map
 3 of 10  libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap
 4 of 10  libxl: Introduce libxl_get_numainfo() calling xc_numainfo()


These enable NUMA affinity to be eplicitly specified with xl, both via
config file and command line:

 5 of 10  xl: Explicit node affinity specification for guests via config file
 6 of 10  xl: Allow user to set or change node affinity on-line


And this is where the fun happens, as these patches contain the core of the
automatic placement logic and the modifications to the (credit) scheduler
needed for taking NUMA node affinity into account:

 7 of 10  sched_credit: Let the scheduler know about `node affinity`
 8 of 10  xl: Introduce First Fit memory-wise placement of guests on nodes
 9 of 10  xl: Introduce Best and Worst Fit guest placement algorithms

Finally, here it comes some rationale and user-level oriented documentation:
 10 of 10 xl: Some automatic NUMA placement documentation


Some of the changelogs contain a TODO list, with stuff that need to be
considered, thought about, or just added, perhaps in the next version of the
series. Also, the various patches have quite a bit of ''XXX''
marked code
comments, to better highlight the spots where I think I might have done
something scary, or where I would like discussion the to concentrate on.
Providing any kind of feedback about these design and coding decisions (I mean
the TODOs and XXXs) I made, is really going to be of great help to me! :-)

As per the timing... I know we''re in feature freeze, and I
don''t see much about
the issue this series tackles in the release plan. So, I''ll be more
than happy
if (even if just some of) the patches becomes 4.2 material, and I can commit on
giving them as much testing and benchmarking as possible, but I understand if
this is judged to be too immature for being considered.

I did some benchmarking about the current performances of Xen on a (small) NUMA
machine, and you can see them here:

 http://xenbits.xen.org/people/dariof/benchmarks/specjbb2005/

This is _before_ any of these patches, just xen-unstable with plain
vcpu-pinning.  As changelogs say, I''m benchmarking the various features
this
series introduces as well, and I''ll share the results as soon as
they''ll be
ready.

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-------------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

In preparation for adding an xc_nodemap_t and its related
hadling logic. Just add one indirection layer, and try to
retain the interface as much as possible (although some
small bits here and there have been affected).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com>

diff --git a/tools/libxc/xc_cpupool.c b/tools/libxc/xc_cpupool.c
--- a/tools/libxc/xc_cpupool.c
+++ b/tools/libxc/xc_cpupool.c
@@ -90,7 +90,7 @@ xc_cpupoolinfo_t *xc_cpupool_getinfo(xc_
     sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_INFO;
     sysctl.u.cpupool_op.cpupool_id = poolid;
     set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
-    sysctl.u.cpupool_op.cpumap.nr_cpus = local_size * 8;
+    sysctl.u.cpupool_op.cpumap.nr_elems = local_size * 8;
 
     err = do_sysctl_save(xch, &sysctl);
 
@@ -184,7 +184,7 @@ xc_cpumap_t xc_cpupool_freeinfo(xc_inter
     sysctl.cmd = XEN_SYSCTL_cpupool_op;
     sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_FREEINFO;
     set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
-    sysctl.u.cpupool_op.cpumap.nr_cpus = mapsize * 8;
+    sysctl.u.cpupool_op.cpumap.nr_elems = mapsize * 8;
 
     err = do_sysctl_save(xch, &sysctl);
 
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -142,7 +142,7 @@ int xc_vcpu_setaffinity(xc_interface *xc
 
     set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
 
-    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
+    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
@@ -182,7 +182,7 @@ int xc_vcpu_getaffinity(xc_interface *xc
     domctl.u.vcpuaffinity.vcpu = vcpu;
 
     set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
-    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
+    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
 
     ret = do_domctl(xch, &domctl);
 
diff --git a/tools/libxc/xc_tbuf.c b/tools/libxc/xc_tbuf.c
--- a/tools/libxc/xc_tbuf.c
+++ b/tools/libxc/xc_tbuf.c
@@ -134,7 +134,7 @@ int xc_tbuf_set_cpu_mask(xc_interface *x
     bitmap_64_to_byte(bytemap, &mask64, sizeof (mask64) * 8);
 
     set_xen_guest_handle(sysctl.u.tbuf_op.cpu_mask.bitmap, bytemap);
-    sysctl.u.tbuf_op.cpu_mask.nr_cpus = sizeof(bytemap) * 8;
+    sysctl.u.tbuf_op.cpu_mask.nr_elems = sizeof(bytemap) * 8;
 
     ret = do_sysctl(xch, &sysctl);
 
diff --git a/xen/arch/x86/platform_hypercall.c
b/xen/arch/x86/platform_hypercall.c
--- a/xen/arch/x86/platform_hypercall.c
+++ b/xen/arch/x86/platform_hypercall.c
@@ -365,7 +365,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
     {
         uint32_t cpu;
         uint64_t idletime, now = NOW();
-        struct xenctl_cpumap ctlmap;
+        struct xenctl_map ctlmap;
         cpumask_var_t cpumap;
         XEN_GUEST_HANDLE(uint8) cpumap_bitmap;
         XEN_GUEST_HANDLE(uint64) idletimes;
@@ -378,7 +378,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
         if ( cpufreq_controller != FREQCTL_dom0_kernel )
             break;
 
-        ctlmap.nr_cpus  = op->u.getidletime.cpumap_nr_cpus;
+        ctlmap.nr_elems  = op->u.getidletime.cpumap_nr_cpus;
         guest_from_compat_handle(cpumap_bitmap,
                                  op->u.getidletime.cpumap_bitmap);
         ctlmap.bitmap.p = cpumap_bitmap.p; /* handle -> handle_64 conversion
*/
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -31,28 +31,29 @@
 static DEFINE_SPINLOCK(domctl_lock);
 DEFINE_SPINLOCK(vcpu_alloc_lock);
 
-int cpumask_to_xenctl_cpumap(
-    struct xenctl_cpumap *xenctl_cpumap, const cpumask_t *cpumask)
+int bitmap_to_xenctl_map(struct xenctl_map *xenctl_map,
+                         const unsigned long *bitmap,
+                         unsigned int nbits)
 {
     unsigned int guest_bytes, copy_bytes, i;
     uint8_t zero = 0;
     int err = 0;
-    uint8_t *bytemap = xmalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
+    uint8_t *bytemap = xmalloc_array(uint8_t, (nbits + 7) / 8);
 
     if ( !bytemap )
         return -ENOMEM;
 
-    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
-    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
+    guest_bytes = (xenctl_map->nr_elems + 7) / 8;
+    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
 
-    bitmap_long_to_byte(bytemap, cpumask_bits(cpumask), nr_cpu_ids);
+    bitmap_long_to_byte(bytemap, bitmap, nbits);
 
     if ( copy_bytes != 0 )
-        if ( copy_to_guest(xenctl_cpumap->bitmap, bytemap, copy_bytes) )
+        if ( copy_to_guest(xenctl_map->bitmap, bytemap, copy_bytes) )
             err = -EFAULT;
 
     for ( i = copy_bytes; !err && i < guest_bytes; i++ )
-        if ( copy_to_guest_offset(xenctl_cpumap->bitmap, i, &zero, 1) )
+        if ( copy_to_guest_offset(xenctl_map->bitmap, i, &zero, 1) )
             err = -EFAULT;
 
     xfree(bytemap);
@@ -60,36 +61,58 @@ int cpumask_to_xenctl_cpumap(
     return err;
 }
 
-int xenctl_cpumap_to_cpumask(
-    cpumask_var_t *cpumask, const struct xenctl_cpumap *xenctl_cpumap)
+int xenctl_map_to_bitmap(unsigned long *bitmap,
+                         const struct xenctl_map *xenctl_map,
+                         unsigned int nbits)
 {
     unsigned int guest_bytes, copy_bytes;
     int err = 0;
-    uint8_t *bytemap = xzalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
+    uint8_t *bytemap = xzalloc_array(uint8_t, (nbits + 7) / 8);
 
     if ( !bytemap )
         return -ENOMEM;
 
-    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
-    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
+    guest_bytes = (xenctl_map->nr_elems + 7) / 8;
+    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
 
     if ( copy_bytes != 0 )
     {
-        if ( copy_from_guest(bytemap, xenctl_cpumap->bitmap, copy_bytes) )
+        if ( copy_from_guest(bytemap, xenctl_map->bitmap, copy_bytes) )
             err = -EFAULT;
-        if ( (xenctl_cpumap->nr_cpus & 7) && (guest_bytes <=
sizeof(bytemap)) )
-            bytemap[guest_bytes-1] &= ~(0xff <<
(xenctl_cpumap->nr_cpus & 7));
+        if ( (xenctl_map->nr_elems & 7) && (guest_bytes <=
sizeof(bytemap)) )
+            bytemap[guest_bytes-1] &= ~(0xff <<
(xenctl_map->nr_elems & 7));
     }
 
-    if ( err )
-        /* nothing */;
-    else if ( alloc_cpumask_var(cpumask) )
-        bitmap_byte_to_long(cpumask_bits(*cpumask), bytemap, nr_cpu_ids);
+    if ( !err )
+        bitmap_byte_to_long(bitmap, bytemap, nbits);
+
+    xfree(bytemap);
+
+    return err;
+}
+
+int cpumask_to_xenctl_cpumap(struct xenctl_map *xenctl_cpumap,
+                             const cpumask_t *cpumask)
+{
+    return bitmap_to_xenctl_map(xenctl_cpumap, cpumask_bits(cpumask),
+                                nr_cpu_ids);
+}
+
+int xenctl_cpumap_to_cpumask(cpumask_var_t *cpumask,
+                             const struct xenctl_map *xenctl_cpumap)
+{
+    int err = 0;
+
+    if ( alloc_cpumask_var(cpumask) ) {
+        err = xenctl_map_to_bitmap(cpumask_bits(*cpumask), xenctl_cpumap,
+                                   nr_cpu_ids);
+        /* In case of error, cleanup is up to us, as the caller won''t
care! */
+        if ( err )
+            free_cpumask_var(*cpumask);
+    }
     else
         err = -ENOMEM;
 
-    xfree(bytemap);
-
     return err;
 }
 
diff --git a/xen/include/public/arch-x86/xen-mca.h
b/xen/include/public/arch-x86/xen-mca.h
--- a/xen/include/public/arch-x86/xen-mca.h
+++ b/xen/include/public/arch-x86/xen-mca.h
@@ -414,7 +414,7 @@ struct xen_mc_mceinject {
 
 struct xen_mc_inject_v2 {
 	uint32_t flags;
-	struct xenctl_cpumap cpumap;
+	struct xenctl_map cpumap;
 };
 #endif
 
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -283,7 +283,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvc
 /* XEN_DOMCTL_getvcpuaffinity */
 struct xen_domctl_vcpuaffinity {
     uint32_t  vcpu;              /* IN */
-    struct xenctl_cpumap cpumap; /* IN/OUT */
+    struct xenctl_map cpumap;    /* IN/OUT */
 };
 typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t);
diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
--- a/xen/include/public/sysctl.h
+++ b/xen/include/public/sysctl.h
@@ -71,8 +71,8 @@ struct xen_sysctl_tbuf_op {
 #define XEN_SYSCTL_TBUFOP_disable      5
     uint32_t cmd;
     /* IN/OUT variables */
-    struct xenctl_cpumap cpu_mask;
-    uint32_t             evt_mask;
+    struct xenctl_map cpu_mask;
+    uint32_t          evt_mask;
     /* OUT variables */
     uint64_aligned_t buffer_mfn;
     uint32_t size;  /* Also an IN variable! */
@@ -531,7 +531,7 @@ struct xen_sysctl_cpupool_op {
     uint32_t domid;       /* IN: M              */
     uint32_t cpu;         /* IN: AR             */
     uint32_t n_dom;       /*            OUT: I  */
-    struct xenctl_cpumap cpumap; /*     OUT: IF */
+    struct xenctl_map cpumap;    /*     OUT: IF */
 };
 typedef struct xen_sysctl_cpupool_op xen_sysctl_cpupool_op_t;
 DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpupool_op_t);
diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
--- a/xen/include/public/xen.h
+++ b/xen/include/public/xen.h
@@ -822,9 +822,9 @@ typedef uint8_t xen_domain_handle_t[16];
 #endif
 
 #ifndef __ASSEMBLY__
-struct xenctl_cpumap {
+struct xenctl_map {
     XEN_GUEST_HANDLE_64(uint8) bitmap;
-    uint32_t nr_cpus;
+    uint32_t nr_elems;
 };
 #endif
 
diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h
--- a/xen/include/xen/cpumask.h
+++ b/xen/include/xen/cpumask.h
@@ -424,8 +424,8 @@ extern cpumask_t cpu_present_map;
 #define for_each_present_cpu(cpu)  for_each_cpu(cpu, &cpu_present_map)
 
 /* Copy to/from cpumap provided by control tools. */
-struct xenctl_cpumap;
-int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const cpumask_t *);
-int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_cpumap *);
+struct xenctl_map;
+int cpumask_to_xenctl_cpumap(struct xenctl_map *, const cpumask_t *);
+int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_map *);
 
 #endif /* __XEN_CPUMASK_H */
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -2,7 +2,7 @@
 # ! - needs translation
 # ? - needs checking
 ?	dom0_vga_console_info		xen.h
-?	xenctl_cpumap			xen.h
+?	xenctl_map			xen.h
 ?	mmu_update			xen.h
 !	mmuext_op			xen.h
 !	start_info			xen.h

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map

In preparation for adding a libxl_nodemap and its related
hadling logic. No changes to the interface this time.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.eu.com>

diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -277,11 +277,23 @@ typedef uint32_t libxl_hwcap[8];
 
 typedef uint64_t libxl_ev_user;
 
-typedef struct {
+struct libxl_map {
     uint32_t size;          /* number of bytes in map */
     uint8_t *map;
-} libxl_cpumap;
-void libxl_cpumap_dispose(libxl_cpumap *map);
+};
+void libxl_map_dispose(struct libxl_map *map);
+
+typedef struct libxl_map libxl_cpumap;
+static inline void libxl_cpumap_dispose(libxl_cpumap *cpumap)
+{
+    return libxl_map_dispose(cpumap);
+}
+
+typedef struct libxl_map libxl_nodemap;
+static inline void libxl_nodemap_dispose(libxl_nodemap *nodemap)
+{
+    return libxl_map_dispose(nodemap);
+}
 
 typedef struct {
     /*
@@ -474,6 +486,9 @@ int libxl_domain_preserve(libxl_ctx *ctx
 /* get max. number of cpus supported by hypervisor */
 int libxl_get_max_cpus(libxl_ctx *ctx);
 
+/* get max. number of NUMA nodes supported by hypervisor */
+int libxl_get_max_nodes(libxl_ctx *ctx);
+
 /*
  * Run the configured bootloader for a PV domain and update
  * info->kernel, info->u.pv.ramdisk and info->u.pv.cmdline as
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -437,47 +437,53 @@ int libxl_mac_to_device_nic(libxl_ctx *c
     return rc;
 }
 
+void libxl_map_dispose(struct libxl_map *map)
+{
+    free(map->map);
+}
+
+static int libxl_map_alloc(libxl_ctx *ctx, struct libxl_map *map, int n_elems)
+{
+    int sz;
+
+    sz = (n_elems + 7) / 8;
+    map->map = calloc(sz, sizeof(*map->map));
+    if (!map->map)
+        return ERROR_NOMEM;
+    map->size = sz;
+    return 0;
+}
+
+int libxl_map_test(struct libxl_map *map, int elem)
+{
+    if (elem >= map->size * 8)
+        return 0;
+    return (map->map[elem / 8] & (1 << (elem & 7))) ? 1 : 0;
+}
+
+void libxl_map_set(struct libxl_map *map, int elem)
+{
+    if (elem >= map->size * 8)
+        return;
+    map->map[elem / 8] |= 1 << (elem & 7);
+}
+
+void libxl_map_reset(struct libxl_map *map, int elem)
+{
+    if (elem >= map->size * 8)
+        return;
+    map->map[elem / 8] &= ~(1 << (elem & 7));
+}
+
 int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap)
 {
     int max_cpus;
-    int sz;
 
     max_cpus = libxl_get_max_cpus(ctx);
     if (max_cpus == 0)
         return ERROR_FAIL;
 
-    sz = (max_cpus + 7) / 8;
-    cpumap->map = calloc(sz, sizeof(*cpumap->map));
-    if (!cpumap->map)
-        return ERROR_NOMEM;
-    cpumap->size = sz;
-    return 0;
-}
-
-void libxl_cpumap_dispose(libxl_cpumap *map)
-{
-    free(map->map);
-}
-
-int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu)
-{
-    if (cpu >= cpumap->size * 8)
-        return 0;
-    return (cpumap->map[cpu / 8] & (1 << (cpu & 7))) ? 1 : 0;
-}
-
-void libxl_cpumap_set(libxl_cpumap *cpumap, int cpu)
-{
-    if (cpu >= cpumap->size * 8)
-        return;
-    cpumap->map[cpu / 8] |= 1 << (cpu & 7);
-}
-
-void libxl_cpumap_reset(libxl_cpumap *cpumap, int cpu)
-{
-    if (cpu >= cpumap->size * 8)
-        return;
-    cpumap->map[cpu / 8] &= ~(1 << (cpu & 7));
+    return libxl_map_alloc(ctx, cpumap, max_cpus);
 }
 
 int libxl_get_max_cpus(libxl_ctx *ctx)
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -64,21 +64,46 @@ int libxl_devid_to_device_nic(libxl_ctx 
 int libxl_vdev_to_device_disk(libxl_ctx *ctx, uint32_t domid, const char *vdev,
                                libxl_device_disk *disk);
 
+int libxl_map_test(struct libxl_map *map, int elem);
+void libxl_map_set(struct libxl_map *map, int elem);
+void libxl_map_reset(struct libxl_map *map, int elem);
+static inline void libxl_map_set_any(struct libxl_map *map)
+{
+    memset(map->map, -1, map->size);
+}
+static inline void libxl_map_set_none(struct libxl_map *map)
+{
+    memset(map->map, 0, map->size);
+}
+static inline int libxl_map_elem_valid(struct libxl_map *map, int elem)
+{
+    return elem >= 0 && elem < (map->size * 8);
+}
+
 int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap);
-int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu);
-void libxl_cpumap_set(libxl_cpumap *cpumap, int cpu);
-void libxl_cpumap_reset(libxl_cpumap *cpumap, int cpu);
+static inline int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu)
+{
+    return libxl_map_test(cpumap, cpu);
+}
+static inline void libxl_cpumap_set(libxl_cpumap *cpumap, int cpu)
+{
+    libxl_map_set(cpumap, cpu);
+}
+static inline void libxl_cpumap_reset(libxl_cpumap *cpumap, int cpu)
+{
+    libxl_map_reset(cpumap, cpu);
+}
 static inline void libxl_cpumap_set_any(libxl_cpumap *cpumap)
 {
-    memset(cpumap->map, -1, cpumap->size);
+    libxl_map_set_any(cpumap);
 }
 static inline void libxl_cpumap_set_none(libxl_cpumap *cpumap)
 {
-    memset(cpumap->map, 0, cpumap->size);
+    libxl_map_set_none(cpumap);
 }
 static inline int libxl_cpumap_cpu_valid(libxl_cpumap *cpumap, int cpu)
 {
-    return cpu >= 0 && cpu < (cpumap->size * 8);
+    return libxl_map_elem_valid(cpumap, cpu);
 }
 #define libxl_for_each_cpu(var, map) for (var = 0; var < (map).size * 8;
var++)
 #define libxl_for_each_set_cpu(v, m) for (v = 0; v < (m).size * 8; v++) \

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap

Exactly as we have xc_cpumap_t, something similar for representing
NUMA nodes (i.e., xc_nodemap_t and nodemap_t) could be useful. The
same applies to libxl_cpumap, which on its turn now has its
node-related counterpart, libxl_nodemap.

This is all in preparation for making it possible explicit
specification of the `node affinity` of a guest.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
--- a/tools/libxc/xc_misc.c
+++ b/tools/libxc/xc_misc.c
@@ -35,21 +35,49 @@ int xc_get_max_cpus(xc_interface *xch)
     return max_cpus;
 }
 
+int xc_get_max_nodes(xc_interface *xch)
+{
+    static int max_nodes = 0;
+    xc_physinfo_t physinfo;
+
+    if ( max_nodes )
+        return max_nodes;
+
+    if ( !xc_physinfo(xch, &physinfo) )
+        max_nodes = physinfo.max_node_id + 1;
+
+    return max_nodes;
+}
+
 int xc_get_cpumap_size(xc_interface *xch)
 {
     return (xc_get_max_cpus(xch) + 7) / 8;
 }
 
-xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
+int xc_get_nodemap_size(xc_interface *xch)
 {
-    int sz;
+    return (xc_get_max_nodes(xch) + 7) / 8;
+}
 
-    sz = xc_get_cpumap_size(xch);
+/* XXX See [*] below ... */
+static xc_cpumap_t __xc_map_alloc(xc_interface *xch, int sz)
+{
     if (sz == 0)
         return NULL;
     return calloc(1, sz);
 }
 
+xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
+{
+    return __xc_map_alloc(xch, xc_get_cpumap_size(xch));
+}
+
+xc_nodemap_t xc_nodemap_alloc(xc_interface *xch)
+{
+    /* XXX [*] How bad is this? Ideas? */
+    return (xc_nodemap_t) __xc_map_alloc(xch, xc_get_nodemap_size(xch));
+}
+
 int xc_readconsolering(xc_interface *xch,
                        char *buffer,
                        unsigned int *pnr_chars,
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -330,6 +330,20 @@ int xc_get_cpumap_size(xc_interface *xch
 xc_cpumap_t xc_cpumap_alloc(xc_interface *xch);
 
 /*
+ * NODEMAP handling
+ */
+typedef uint8_t *xc_nodemap_t;
+
+/* return maximum number of NUMA nodes the hypervisor supports */
+int xc_get_max_nodes(xc_interface *xch);
+
+/* return array size for nodemap */
+int xc_get_nodemap_size(xc_interface *xch);
+
+/* allocate a nodemap */
+xc_nodemap_t xc_nodemap_alloc(xc_interface *xch);
+
+/*
  * DOMAIN DEBUGGING FUNCTIONS
  */
 
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -486,11 +486,27 @@ int libxl_cpumap_alloc(libxl_ctx *ctx, l
     return libxl_map_alloc(ctx, cpumap, max_cpus);
 }
 
+int libxl_nodemap_alloc(libxl_ctx *ctx, libxl_nodemap *nodemap)
+{
+    int max_nodes;
+
+    max_nodes = libxl_get_max_nodes(ctx);
+    if (max_nodes == 0)
+        return ERROR_FAIL;
+
+    return libxl_map_alloc(ctx, nodemap, max_nodes);
+}
+
 int libxl_get_max_cpus(libxl_ctx *ctx)
 {
     return xc_get_max_cpus(ctx->xch);
 }
 
+int libxl_get_max_nodes(libxl_ctx *ctx)
+{
+    return xc_get_max_nodes(ctx->xch);
+}
+
 int libxl__enum_from_string(const libxl_enum_string_table *t,
                             const char *s, int *e)
 {
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -109,6 +109,35 @@ static inline int libxl_cpumap_cpu_valid
 #define libxl_for_each_set_cpu(v, m) for (v = 0; v < (m).size * 8; v++) \
                                              if (libxl_cpumap_test(&(m),
v))
 
+int libxl_nodemap_alloc(libxl_ctx *ctx, libxl_nodemap *nodemap);
+static inline int libxl_nodemap_test(libxl_nodemap *nodemap, int node)
+{
+    return libxl_map_test(nodemap, node);
+}
+static inline void libxl_nodemap_set(libxl_nodemap *nodemap, int node)
+{
+    libxl_map_set(nodemap, node);
+}
+static inline void libxl_nodemap_reset(libxl_nodemap *nodemap, int node)
+{
+    libxl_map_reset(nodemap, node);
+}
+static inline void libxl_nodemap_set_any(libxl_nodemap *nodemap)
+{
+    libxl_map_set_any(nodemap);
+}
+static inline void libxl_nodemap_set_none(libxl_nodemap *nodemap)
+{
+    libxl_map_set_none(nodemap);
+}
+static inline int libxl_nodemap_node_valid(libxl_nodemap *nodemap, int node)
+{
+    return libxl_map_elem_valid(nodemap, node);
+}
+#define libxl_for_each_node(var, map) for (var = 0; var < (map).size * 8;
var++)
+#define libxl_for_each_set_node(v, m) for (v = 0; v < (m).size * 8; v++) \
+                                              if (libxl_nodemap_test(&(m),
v))
+
 static inline uint32_t libxl__sizekb_to_mb(uint32_t s) {
     return (s + 1023) / 1024;
 }
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -116,6 +116,14 @@ int xenctl_cpumap_to_cpumask(cpumask_var
     return err;
 }
 
+int xenctl_nodemap_to_nodemask(nodemask_t *nodemask,
+                               const struct xenctl_map *xenctl_nodemap)
+{
+    return xenctl_map_to_bitmap(nodes_addr(*nodemask),
+                                xenctl_nodemap,
+                                MAX_NUMNODES);
+}
+
 static inline int is_free_domid(domid_t dom)
 {
     struct domain *d;

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo()

xc_numainfo is already there, and it provides information about
things like free memory on each NUMA node and `distances` among
the various nodes. In preparation of putting an automatic
domain-on-nodes placement logic --which will benefit a lot from
having such information available-- make it possible to retrieve
them within libxl by a new libxl_get_numainfo() API call (very
similar to libxl_get_topologyinfo).

TODO: * Enable exporting node distances as soon as we figure out
        how to stick an array in an IDL defined type (patches
        have been posted, will rebase onto them if it''ll turn
        out to be the case.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -2830,6 +2830,83 @@ int libxl_get_physinfo(libxl_ctx *ctx, l
     return 0;
 }
 
+libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr)
+{
+    xc_numainfo_t ninfo;
+    DECLARE_HYPERCALL_BUFFER(xc_node_to_memsize_t, memsize);
+    DECLARE_HYPERCALL_BUFFER(xc_node_to_memfree_t, memfree);
+    DECLARE_HYPERCALL_BUFFER(uint32_t, node_distances);
+    libxl_numainfo *ret = NULL;
+    int i, max_nodes;
+
+    max_nodes = libxl_get_max_nodes(ctx);
+    if (max_nodes == 0)
+    {
+        LIBXL__LOG(ctx, XTL_ERROR, "Unable to determine number of
NODES");
+        return NULL;
+    }
+
+    memsize = xc_hypercall_buffer_alloc
+        (ctx->xch, memsize, sizeof(*memsize) * max_nodes);
+    memfree = xc_hypercall_buffer_alloc
+        (ctx->xch, memfree, sizeof(*memfree) * max_nodes);
+    node_distances = xc_hypercall_buffer_alloc
+        (ctx->xch, node_distances, sizeof(*node_distances) * max_nodes *
max_nodes);
+    if ((memsize == NULL) || (memfree == NULL) || (node_distances == NULL)) {
+        LIBXL__LOG_ERRNOVAL(ctx, XTL_ERROR, ENOMEM,
+                            "Unable to allocate hypercall
arguments");
+        goto fail;
+    }
+
+    set_xen_guest_handle(ninfo.node_to_memsize, memsize);
+    set_xen_guest_handle(ninfo.node_to_memfree, memfree);
+    set_xen_guest_handle(ninfo.node_to_node_distance, node_distances);
+    ninfo.max_node_index = max_nodes - 1;
+    if (xc_numainfo(ctx->xch, &ninfo) != 0) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting numainfo");
+        goto fail;
+    }
+
+    ret = malloc(sizeof(libxl_numainfo) * max_nodes);
+    if (ret == NULL) {
+        LIBXL__LOG_ERRNOVAL(ctx, XTL_ERROR, ENOMEM,
+                            "Unable to allocate return value");
+        goto fail;
+    }
+    /* XXX Neglect node distances for now, as we need some sort of array
+     *     type support in the IDL. RFC patches for that have been posted,
+     *     will rebase on top of those for next round if it is the case.
+     *
+     * for (i = 0; i < max_nodes; i++) {
+     *     ret[i].distances = malloc(sizeof(*node_distances) * max_nodes);
+     *     if (ret == NULL) {
+     *         for (int j = i; j >=0; j--)
+     *             free(ret[i].distances);
+     *         free(ret);
+     *         goto fail;
+     *     }
+     * }
+     */
+
+    for (i = 0; i < max_nodes; i++) {
+        ret[i].size = memsize[i];
+        ret[i].free = memfree[i];
+        /* for (int j = 0; j < max_nodes; j++)
+         *    ret[i].distances[j] +         *       
node_distances[i*(max_nodes+1) + j];
+         */
+    }
+
+fail:
+    xc_hypercall_buffer_free(ctx->xch, memsize);
+    xc_hypercall_buffer_free(ctx->xch, memfree);
+    xc_hypercall_buffer_free(ctx->xch, node_distances);
+
+    if (ret)
+        *nr = max_nodes;
+    return ret;
+}
+
 libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr)
 {
     xc_topologyinfo_t tinfo;
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -716,6 +716,8 @@ int libxl_userdata_retrieve(libxl_ctx *c
    */
 
 int libxl_get_physinfo(libxl_ctx *ctx, libxl_physinfo *physinfo);
+libxl_numainfo *libxl_get_numainfo(libxl_ctx *ctx, int *nr);
+void libxl_numainfo_list_free(libxl_numainfo *, int nr);
 #define LIBXL_CPUTOPOLOGY_INVALID_ENTRY (~(uint32_t)0)
 libxl_cputopology *libxl_get_cpu_topology(libxl_ctx *ctx, int *nr);
 void libxl_cputopology_list_free(libxl_cputopology *, int nr);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -411,6 +411,12 @@ libxl_physinfo = Struct("physinfo", [
     ("cap_hvm_directio", bool),
     ], dir=DIR_OUT)
 
+libxl_numainfo = Struct("numainfo", [
+    ("size", uint64),
+    ("free", uint64),
+    ("distances", uint32),
+    ], dir=DIR_OUT)
+
 libxl_cputopology = Struct("cputopology", [
     ("core", uint32),
     ("socket", uint32),
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -521,6 +521,14 @@ int libxl__enum_from_string(const libxl_
     return ERROR_FAIL;
 }
 
+void libxl_numainfo_list_free(libxl_numainfo *list, int nr)
+{
+    int i;
+    for (i = 0; i < nr; i++)
+        libxl_numainfo_dispose(&list[i]);
+    free(list);
+}
+
 void libxl_cputopology_list_free(libxl_cputopology *list, int nr)
 {
     int i;

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file

Let the user specify the NUMA node affinity of a guest via the
''nodes='' config switch. Explicitly listing the intended target
host
nodes is required as of now.

A valid setting will directly impact the node_affinity mask
of the guest, i.e., it will change the behaviour of the low level
memory allocator. On the other hand, this commit does not affect
by any means how the guest''s vCPUs are scheduled on the host''s
pCPUs.

TODO: * Better investigate and test interactions with cpupools.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -112,6 +112,15 @@ List of which cpus the guest is allowed 
 (all vcpus will run on cpus 0,2,3,5), or `cpus=["2", "3"]`
(all vcpus
 will run on cpus 2 and 3).
 
+=item B<nodes=[ NODE, NODE, ...]>
+
+List of the NUMA nodes of the host the guest should be considered
+`affine` with. Being affine to a (set of) NUMA node(s) mainly means
+the guest''s memory is going to be allocated on those node(s).
+
+A list of nodes should be specified as follows: `nodes=["0",
"3"]`
+for the guest to be affine with nodes 0 and 3.
+
 =item B<memory=MBYTES>
 
 Start the guest with MBYTES megabytes of RAM.
diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
--- a/tools/libxc/xc_domain.c
+++ b/tools/libxc/xc_domain.c
@@ -110,6 +110,44 @@ int xc_domain_shutdown(xc_interface *xch
 }
 
 
+int xc_domain_node_affinity(xc_interface *xch,
+                            uint32_t domid,
+                            xc_nodemap_t node_affinity)
+{
+    DECLARE_DOMCTL;
+    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
+    int ret = -1;
+    int nodesize;
+
+    nodesize = xc_get_nodemap_size(xch);
+    if (!nodesize)
+    {
+        PERROR("Could not get number of nodes");
+        goto out;
+    }
+
+    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
+    if ( local == NULL )
+    {
+        PERROR("Could not allocate memory for domain_node_affinity domctl
hypercall");
+        goto out;
+    }
+
+    domctl.cmd = XEN_DOMCTL_node_affinity;
+    domctl.domain = (domid_t)domid;
+
+    memcpy(local, node_affinity, nodesize);
+    set_xen_guest_handle(domctl.u.node_affinity.nodemap.bitmap, local);
+    domctl.u.node_affinity.nodemap.nr_elems = nodesize * 8;
+
+    ret = do_domctl(xch, &domctl);
+
+    xc_hypercall_buffer_free(xch, local);
+
+ out:
+    return ret;
+}
+
 int xc_vcpu_setaffinity(xc_interface *xch,
                         uint32_t domid,
                         int vcpu,
diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
--- a/tools/libxc/xenctrl.h
+++ b/tools/libxc/xenctrl.h
@@ -520,6 +520,19 @@ int xc_watchdog(xc_interface *xch,
 		uint32_t id,
 		uint32_t timeout);
 
+/**
+ * This function explicitly sets the affinity of a domain with the
+ * host NUMA nodes.
+ *
+ * @parm xch a handle to an open hypervisor interface.
+ * @parm domid the domain id in which vcpus are to be created.
+ * @parm node_affinity the map of the affine nodes.
+ * @return 0 on success, -1 on failure.
+ */
+int xc_domain_node_affinity(xc_interface *xch,
+                            uint32_t domind,
+                            xc_nodemap_t node_affinity);
+
 int xc_vcpu_setaffinity(xc_interface *xch,
                         uint32_t domid,
                         int vcpu,
diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py
--- a/tools/libxl/gentest.py
+++ b/tools/libxl/gentest.py
@@ -20,7 +20,7 @@ def randomize_case(s):
 def randomize_enum(e):
     return random.choice([v.name for v in e.values])
 
-handcoded = ["libxl_cpumap", "libxl_key_value_list",
+handcoded = ["libxl_cpumap", "libxl_nodemap",
"libxl_key_value_list",
              "libxl_cpuid_policy_list",
"libxl_file_reference",
              "libxl_string_list"]
 
@@ -119,6 +119,19 @@ static void libxl_cpumap_rand_init(libxl
     }
 }
 
+static void libxl_nodemap_rand_init(libxl_nodemap *nodemap)
+{
+    int i;
+    nodemap->size = rand() % 16;
+    nodemap->map = calloc(nodemap->size, sizeof(*nodemap->map));
+    libxl_for_each_node(i, *nodemap) {
+        if (rand() % 2)
+            libxl_nodemap_set(nodemap, i);
+        else
+            libxl_nodemap_reset(nodemap, i);
+    }
+}
+
 static void libxl_key_value_list_rand_init(libxl_key_value_list *pkvl)
 {
     int i, nr_kvp = rand() % 16;
diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
--- a/tools/libxl/libxl.c
+++ b/tools/libxl/libxl.c
@@ -3082,6 +3082,16 @@ int libxl_set_vcpuaffinity_all(libxl_ctx
     return rc;
 }
 
+int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid,
+                            libxl_nodemap *nodemap)
+{
+    if (xc_domain_node_affinity(ctx->xch, domid, nodemap->map)) {
+        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node
affinity");
+        return ERROR_FAIL;
+    }
+    return 0;
+}
+
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap)
 {
     GC_INIT(ctx);
diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
--- a/tools/libxl/libxl.h
+++ b/tools/libxl/libxl.h
@@ -727,6 +727,8 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
                            libxl_cpumap *cpumap);
 int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
                                unsigned int max_vcpus, libxl_cpumap *cpumap);
+int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid,
+                            libxl_nodemap *nodemap);
 int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap *cpumap);
 
 libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx);
diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -121,6 +121,12 @@ int libxl__domain_build_info_setdefault(
         libxl_cpumap_set_any(&b_info->cpumap);
     }
 
+    if (!b_info->nodemap.size) {
+        if (libxl_nodemap_alloc(CTX, &b_info->nodemap))
+            return ERROR_NOMEM;
+        libxl_nodemap_set_none(&b_info->nodemap);
+    }
+
     if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
         b_info->max_memkb = 32 * 1024;
     if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -67,6 +67,7 @@ int libxl__build_pre(libxl__gc *gc, uint
     char *xs_domid, *con_domid;
     xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
     libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus,
&info->cpumap);
+    libxl_set_node_affinity(ctx, domid, &info->nodemap);
     xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
LIBXL_MAXMEM_CONSTANT);
     if (info->type == LIBXL_DOMAIN_TYPE_PV)
         xc_domain_set_memmap_limit(ctx->xch, domid,
diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c
--- a/tools/libxl/libxl_json.c
+++ b/tools/libxl/libxl_json.c
@@ -119,6 +119,26 @@ out:
     return s;
 }
 
+yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand,
+                                       libxl_nodemap *nodemap)
+{
+    yajl_gen_status s;
+    int i;
+
+    s = yajl_gen_array_open(hand);
+    if (s != yajl_gen_status_ok) goto out;
+
+    libxl_for_each_node(i, *nodemap) {
+        if (libxl_nodemap_test(nodemap, i)) {
+            s = yajl_gen_integer(hand, i);
+            if (s != yajl_gen_status_ok) goto out;
+        }
+    }
+    s = yajl_gen_array_close(hand);
+out:
+    return s;
+}
+
 yajl_gen_status libxl_key_value_list_gen_json(yajl_gen hand,
                                               libxl_key_value_list *pkvl)
 {
diff --git a/tools/libxl/libxl_json.h b/tools/libxl/libxl_json.h
--- a/tools/libxl/libxl_json.h
+++ b/tools/libxl/libxl_json.h
@@ -27,6 +27,7 @@ yajl_gen_status libxl_domid_gen_json(yaj
 yajl_gen_status libxl_uuid_gen_json(yajl_gen hand, libxl_uuid *p);
 yajl_gen_status libxl_mac_gen_json(yajl_gen hand, libxl_mac *p);
 yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, libxl_cpumap *p);
+yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand, libxl_nodemap *p);
 yajl_gen_status libxl_cpuid_policy_list_gen_json(yajl_gen hand,
                                                  libxl_cpuid_policy_list *p);
 yajl_gen_status libxl_string_list_gen_json(yajl_gen hand, libxl_string_list
*p);
diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -11,6 +11,7 @@ libxl_domid = Builtin("domid", json_fn  libxl_uuid =
Builtin("uuid", passby=PASS_BY_REFERENCE)
 libxl_mac = Builtin("mac", passby=PASS_BY_REFERENCE)
 libxl_cpumap = Builtin("cpumap",
dispose_fn="libxl_cpumap_dispose", passby=PASS_BY_REFERENCE)
+libxl_nodemap = Builtin("nodemap",
dispose_fn="libxl_nodemap_dispose", passby=PASS_BY_REFERENCE)
 libxl_cpuid_policy_list = Builtin("cpuid_policy_list",
dispose_fn="libxl_cpuid_dispose", passby=PASS_BY_REFERENCE)
 
 libxl_string_list = Builtin("string_list",
dispose_fn="libxl_string_list_dispose", passby=PASS_BY_REFERENCE)
@@ -233,6 +234,7 @@ libxl_domain_build_info = Struct("domain
     ("max_vcpus",       integer),
     ("cur_vcpus",       integer),
     ("cpumap",          libxl_cpumap),
+    ("nodemap",         libxl_nodemap),
     ("tsc_mode",        libxl_tsc_mode),
     ("max_memkb",       MemKB),
     ("target_memkb",    MemKB),
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -515,7 +515,7 @@ static void parse_config_data(const char
     const char *buf;
     long l;
     XLU_Config *config;
-    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids;
+    XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids;
     int pci_power_mgmt = 0;
     int pci_msitranslate = 1;
     int pci_permissive = 0;
@@ -628,6 +628,39 @@ static void parse_config_data(const char
         free(buf2);
     }
 
+    if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 0)) {
+        int i, n_nodes = 0;
+
+        if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        libxl_nodemap_set_none(&b_info->nodemap);
+        while ((buf = xlu_cfg_get_listitem(nodes, n_nodes)) != NULL) {
+            i = atoi(buf);
+            if (!libxl_nodemap_node_valid(&b_info->nodemap, i)) {
+                fprintf(stderr, "node %d illegal\n", i);
+                exit(1);
+            }
+            libxl_nodemap_set(&b_info->nodemap, i);
+            n_nodes++;
+        }
+
+        if (n_nodes == 0)
+            fprintf(stderr, "No valid node found: no affinity will be
set\n");
+    }
+    else {
+        /*
+         * XXX What would a sane default be? I think doing nothing
+         *     (i.e., relying on cpu-affinity/cpupool ==> the current
+         *     behavior) should be just fine.
+         *     That would mean we''re saying to the user "if you
want us
+         *     to take care of NUMA, please tell us, maybe just putting
+         *     ''nodes=auto'', but tell us... otherwise we do
as usual".
+         */
+    }
+
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
         b_info->max_memkb = l * 1024;
         b_info->target_memkb = b_info->max_memkb;
diff --git a/tools/python/xen/lowlevel/xl/xl.c
b/tools/python/xen/lowlevel/xl/xl.c
--- a/tools/python/xen/lowlevel/xl/xl.c
+++ b/tools/python/xen/lowlevel/xl/xl.c
@@ -243,6 +243,18 @@ int attrib__libxl_cpumap_set(PyObject *v
     return 0;
 }
 
+int attrib__libxl_nodemap_set(PyObject *v, libxl_nodemap *pptr)
+{
+    int i;
+    long node;
+
+    for (i = 0; i < PyList_Size(v); i++) {
+        node = PyInt_AsLong(PyList_GetItem(v, i));
+        libxl_nodemap_set(pptr, node);
+    }
+    return 0;
+}
+
 int attrib__libxl_file_reference_set(PyObject *v, libxl_file_reference *pptr)
 {
     return genwrap__string_set(v, &pptr->path);
diff --git a/xen/common/domain.c b/xen/common/domain.c
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -222,6 +222,7 @@ struct domain *domain_create(
 
     spin_lock_init(&d->node_affinity_lock);
     d->node_affinity = NODE_MASK_ALL;
+    d->has_node_affinity = 0;
 
     spin_lock_init(&d->shutdown_lock);
     d->shutdown_code = -1;
@@ -337,7 +338,7 @@ void domain_update_node_affinity(struct 
 {
     cpumask_var_t cpumask;
     cpumask_var_t online_affinity;
-    const cpumask_t *online;
+    const cpumask_t *online = cpupool_online_cpumask(d->cpupool);
     nodemask_t nodemask = NODE_MASK_NONE;
     struct vcpu *v;
     unsigned int node;
@@ -350,9 +351,22 @@ void domain_update_node_affinity(struct 
         return;
     }
 
-    online = cpupool_online_cpumask(d->cpupool);
+    spin_lock(&d->node_affinity_lock);
 
-    spin_lock(&d->node_affinity_lock);
+    /*
+     * If someone explicitly told us our NUMA affinity, avoid messing
+     * that up. Notice, however, that vcpu affinity is still what
+     * drives all scheduling decisions.
+     *
+     * XXX I''m quite sure this is fine wrt to the various
v->cpu_affinity
+     *     (at least, that was the intended design!). Could it be useful
+     *     to cross-check d->node_affinity against `online`? The basic
+     *     idea here is "Hey, if you shoot yourself in the foot...
You''ve
+     *     shot in the foot!", but, you know...
+     */
+    if ( d->has_node_affinity )
+        goto out;
+
 
     for_each_vcpu ( d, v )
     {
@@ -365,6 +379,8 @@ void domain_update_node_affinity(struct 
             node_set(node, nodemask);
 
     d->node_affinity = nodemask;
+
+out:
     spin_unlock(&d->node_affinity_lock);
 
     free_cpumask_var(online_affinity);
@@ -372,6 +388,31 @@ void domain_update_node_affinity(struct 
 }
 
 
+int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
+{
+    spin_lock(&d->node_affinity_lock);
+
+    /*
+     * Being/becoming affine to _no_ nodes is not going to work,
+     * let''s take it as the `reset node affinity` command.
+     */
+    if ( nodes_empty(*affinity) )
+    {
+        d->has_node_affinity = 0;
+        spin_unlock(&d->node_affinity_lock);
+        domain_update_node_affinity(d);
+        return 0;
+    }
+
+    d->has_node_affinity = 1;
+    d->node_affinity = *affinity;
+
+    spin_unlock(&d->node_affinity_lock);
+
+    return 0;
+}
+
+
 struct domain *get_domain_by_id(domid_t dom)
 {
     struct domain *d;
diff --git a/xen/common/domctl.c b/xen/common/domctl.c
--- a/xen/common/domctl.c
+++ b/xen/common/domctl.c
@@ -621,6 +621,27 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
     }
     break;
 
+    case XEN_DOMCTL_node_affinity:
+    {
+        domid_t dom = op->domain;
+        struct domain *d = rcu_lock_domain_by_id(dom);
+        nodemask_t new_affinity;
+
+        ret = -ESRCH;
+        if ( d == NULL )
+            break;
+
+        /* XXX We need an xsm_* for this I guess, right? */
+
+        ret = xenctl_nodemap_to_nodemask(&new_affinity,
+                                         &op->u.node_affinity.nodemap);
+        if ( !ret )
+            ret = domain_set_node_affinity(d, &new_affinity);
+
+        rcu_unlock_domain(d);
+    }
+    break;
+
     case XEN_DOMCTL_setvcpuaffinity:
     case XEN_DOMCTL_getvcpuaffinity:
     {
diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
--- a/xen/common/keyhandler.c
+++ b/xen/common/keyhandler.c
@@ -217,6 +217,14 @@ static void cpuset_print(char *set, int 
     *set++ = ''\0'';
 }
 
+static void nodeset_print(char *set, int size, const nodemask_t *mask)
+{
+    *set++ = ''['';
+    set += nodelist_scnprintf(set, size-2, mask);
+    *set++ = '']'';
+    *set++ = ''\0'';
+}
+
 static void periodic_timer_print(char *str, int size, uint64_t period)
 {
     if ( period == 0 )
@@ -272,6 +280,9 @@ static void dump_domains(unsigned char k
 
         dump_pageframe_info(d);
                
+        nodeset_print(tmpstr, sizeof(tmpstr), &d->node_affinity);
+        printk("NODE affinity for domain %d: %s\n", d->domain_id,
tmpstr);
+
         printk("VCPU information and callbacks for domain %u:\n",
                d->domain_id);
         for_each_vcpu ( d, v )
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -278,6 +278,15 @@ typedef struct xen_domctl_getvcpuinfo xe
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t);
 
 
+/* Set the NUMA node(s) with which the guest is `affine`. */
+/* XEN_DOMCTL_node_affinity */
+struct xen_domctl_node_affinity {
+    struct xenctl_map nodemap;   /* IN */
+};
+typedef struct xen_domctl_node_affinity xen_domctl_node_affinity_t;
+DEFINE_XEN_GUEST_HANDLE(xen_domctl_node_affinity_t);
+
+
 /* Get/set which physical cpus a vcpu can execute on. */
 /* XEN_DOMCTL_setvcpuaffinity */
 /* XEN_DOMCTL_getvcpuaffinity */
@@ -914,6 +923,7 @@ struct xen_domctl {
 #define XEN_DOMCTL_set_access_required           64
 #define XEN_DOMCTL_audit_p2m                     65
 #define XEN_DOMCTL_set_virq_handler              66
+#define XEN_DOMCTL_node_affinity                 67
 #define XEN_DOMCTL_gdbsx_guestmemio            1000
 #define XEN_DOMCTL_gdbsx_pausevcpu             1001
 #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
@@ -927,6 +937,7 @@ struct xen_domctl {
         struct xen_domctl_getpageframeinfo  getpageframeinfo;
         struct xen_domctl_getpageframeinfo2 getpageframeinfo2;
         struct xen_domctl_getpageframeinfo3 getpageframeinfo3;
+        struct xen_domctl_node_affinity     node_affinity;
         struct xen_domctl_vcpuaffinity      vcpuaffinity;
         struct xen_domctl_shadow_op         shadow_op;
         struct xen_domctl_max_mem           max_mem;
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -8,8 +8,9 @@
  * See detailed comments in the file linux/bitmap.h describing the
  * data type on which these nodemasks are based.
  *
- * For details of nodemask_scnprintf() and nodemask_parse(),
- * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
+ * For details of nodemask_scnprintf(), nodelist_scnpintf() and
+ * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse()
+ * in lib/bitmap.c.
  *
  * The available nodemask operations are:
  *
@@ -48,6 +49,7 @@
  * unsigned long *nodes_addr(mask)	Array of unsigned long''s in mask
  *
  * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
+ * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for
printing
  * int nodemask_parse(ubuf, ulen, mask)	Parse ascii string as nodemask
  *
  * for_each_node_mask(node, mask)	for-loop node over mask
@@ -280,6 +282,14 @@ static inline int __first_unset_node(con
 
 #define nodes_addr(src) ((src).bits)
 
+#define nodelist_scnprintf(buf, len, src) \
+			__nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES)
+static inline int __nodelist_scnprintf(char *buf, int len,
+					const nodemask_t *srcp, int nbits)
+{
+	return bitmap_scnlistprintf(buf, len, srcp->bits, nbits);
+}
+
 #if 0
 #define nodemask_scnprintf(buf, len, src) \
 			__nodemask_scnprintf((buf), (len), &(src), MAX_NUMNODES)
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -346,8 +346,12 @@ struct domain
     /* Various mem_events */
     struct mem_event_per_domain *mem_event;
 
-    /* Currently computed from union of all vcpu cpu-affinity masks. */
+    /*
+     * Can be specified by the user. If that is not the case, it is
+     * computed from the union of all the vcpu cpu-affinity masks.
+     */
     nodemask_t node_affinity;
+    int has_node_affinity;
     unsigned int last_alloc_node;
     spinlock_t node_affinity_lock;
 };
@@ -416,6 +420,7 @@ static inline void get_knownalive_domain
     ASSERT(!(atomic_read(&d->refcnt) & DOMAIN_DESTROYED));
 }
 
+int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity);
 void domain_update_node_affinity(struct domain *d);
 
 struct domain *domain_create(

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line

For feature parity with vcpu affinity, allow for specifying
node affinity not only at domain creation time, but at run-time
too.

Of course this is not going to be equally effective, as it will
only affect future memory allocations without touching what''s
already there. However, in future we might want to change this,
and use this as an interface for sort-of manual "domain node
migration".

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
--- a/docs/man/xl.pod.1
+++ b/docs/man/xl.pod.1
@@ -556,6 +556,26 @@ different run state is appropriate.  Pin
 this, by ensuring certain VCPUs can only run on certain physical
 CPUs.
 
+=item B<node-affinity> I<domain-id> I<nodes>
+
+Set the NUMA node affinity for the domain, i.e., the set of NUMA
+nodes of the host from which the memory of the domain will be
+allocated. More specificallly, the domain''s memory will be split
+in equal (well, as equal as possible) parts among all the nodes
+it is affine with, The keyword B<all> can be used to have the
+domain affine to all NUMA nodes in the host.
+
+Normally NUMA node affinity of a domain is automatically computed
+from its VCPU affinity. The default behaviour is to have it equal
+to all the nodes the PCPUs onto which the VCPUs of the domain are
+pinned belong to. Manually specifying it can be used to restrict
+this to a specific subset of the host NUMA nodes, for improved
+locality of memory accesses by the domain. Notice, however, that
+this will not affect the memory that has already been allocated.
+For having the full amount of memory allocated on specific node(s)
+at domain creation time, the domain''s configuration file is what
+should be used.
+
 =item B<vncviewer> [I<OPTIONS>] I<domain-id>
 
 Attach to domain''s VNC server, forking a vncviewer process.
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -442,7 +442,7 @@ void libxl_map_dispose(struct libxl_map 
     free(map->map);
 }
 
-static int libxl_map_alloc(libxl_ctx *ctx, struct libxl_map *map, int n_elems)
+int libxl_map_alloc(libxl_ctx *ctx, struct libxl_map *map, int n_elems)
 {
     int sz;
 
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -64,6 +64,7 @@ int libxl_devid_to_device_nic(libxl_ctx 
 int libxl_vdev_to_device_disk(libxl_ctx *ctx, uint32_t domid, const char *vdev,
                                libxl_device_disk *disk);
 
+int libxl_map_alloc(libxl_ctx *ctx, struct libxl_map *map, int n_elems);
 int libxl_map_test(struct libxl_map *map, int elem);
 void libxl_map_set(struct libxl_map *map, int elem);
 void libxl_map_reset(struct libxl_map *map, int elem);
@@ -79,6 +80,10 @@ static inline int libxl_map_elem_valid(s
 {
     return elem >= 0 && elem < (map->size * 8);
 }
+#define libxl_for_each_elem(v, m) for (v = 0; v < (m).size * 8; v++)
+#define libxl_for_each_set_elem(v, m) for (v = 0; v < (m).size * 8; v++) \
+                                              if (libxl_map_test(&(m), v))
+
 
 int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap);
 static inline int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu)
diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
--- a/tools/libxl/xl.h
+++ b/tools/libxl/xl.h
@@ -54,6 +54,7 @@ int main_config_update(int argc, char **
 int main_button_press(int argc, char **argv);
 int main_vcpupin(int argc, char **argv);
 int main_vcpuset(int argc, char **argv);
+int main_nodeaffinity(int argc, char **argv);
 int main_memmax(int argc, char **argv);
 int main_memset(int argc, char **argv);
 int main_sched_credit(int argc, char **argv);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -448,65 +448,75 @@ static void split_string_into_string_lis
     free(s);
 }
 
-static int vcpupin_parse(char *cpu, libxl_cpumap *cpumap)
+static int affinity_parse(char *str, struct libxl_map *map, int n_elems)
 {
-    libxl_cpumap exclude_cpumap;
-    uint32_t cpuida, cpuidb;
+    struct libxl_map exclude_map;
+    uint32_t stra, strb;
     char *endptr, *toka, *tokb, *saveptr = NULL;
-    int i, rc = 0, rmcpu;
-
-    if (!strcmp(cpu, "all")) {
-        libxl_cpumap_set_any(cpumap);
+    int i, rc = 0, rmelem;
+
+    if (!strcmp(str, "all")) {
+        libxl_map_set_any(map);
         return 0;
     }
 
-    if (libxl_cpumap_alloc(ctx, &exclude_cpumap)) {
-        fprintf(stderr, "Error: Failed to allocate cpumap.\n");
+    if (libxl_map_alloc(ctx, &exclude_map, n_elems)) {
+        fprintf(stderr, "Error: Failed to allocate libxl_map.\n");
         return ENOMEM;
     }
 
-    for (toka = strtok_r(cpu, ",", &saveptr); toka;
+    for (toka = strtok_r(str, ",", &saveptr); toka;
          toka = strtok_r(NULL, ",", &saveptr)) {
-        rmcpu = 0;
+        rmelem = 0;
         if (*toka == ''^'') {
             /* This (These) Cpu(s) will be removed from the map */
             toka++;
-            rmcpu = 1;
+            rmelem = 1;
         }
         /* Extract a valid (range of) cpu(s) */
-        cpuida = cpuidb = strtoul(toka, &endptr, 10);
+        stra = strb = strtoul(toka, &endptr, 10);
         if (endptr == toka) {
             fprintf(stderr, "Error: Invalid argument.\n");
             rc = EINVAL;
-            goto vcpp_out;
+            goto afp_out;
         }
         if (*endptr == ''-'') {
             tokb = endptr + 1;
-            cpuidb = strtoul(tokb, &endptr, 10);
-            if (endptr == tokb || cpuida > cpuidb) {
+            strb = strtoul(tokb, &endptr, 10);
+            if (endptr == tokb || stra > strb) {
                 fprintf(stderr, "Error: Invalid argument.\n");
                 rc = EINVAL;
-                goto vcpp_out;
+                goto afp_out;
             }
         }
-        while (cpuida <= cpuidb) {
-            rmcpu == 0 ? libxl_cpumap_set(cpumap, cpuida) :
-                         libxl_cpumap_set(&exclude_cpumap, cpuida);
-            cpuida++;
+        while (stra <= strb) {
+            rmelem == 0 ? libxl_map_set(map, stra) :
+                          libxl_map_set(&exclude_map, stra);
+            stra++;
         }
     }
 
     /* Clear all the cpus from the removal list */
-    libxl_for_each_set_cpu(i, exclude_cpumap) {
-        libxl_cpumap_reset(cpumap, i);
-    }
-
-vcpp_out:
-    libxl_cpumap_dispose(&exclude_cpumap);
+    libxl_for_each_set_elem(i, exclude_map) {
+        libxl_map_reset(map, i);
+    }
+
+afp_out:
+    libxl_map_dispose(&exclude_map);
 
     return rc;
 }
 
+static inline int vcpupin_parse(char *cpu, libxl_cpumap *cpumap)
+{
+    return affinity_parse(cpu, cpumap, libxl_get_max_cpus(ctx));
+}
+
+static inline int nodeaffinity_parse(char *nodes, libxl_nodemap *nodemap)
+{
+    return affinity_parse(nodes, nodemap, libxl_get_max_nodes(ctx));
+}
+
 static void parse_config_data(const char *configfile_filename_report,
                               const char *configfile_data,
                               int configfile_len,
@@ -3873,6 +3883,40 @@ int main_vcpuset(int argc, char **argv)
     return 0;
 }
 
+static void nodeaffinity(const char *d, char *nodes)
+{
+    libxl_nodemap nodemap;
+
+    find_domain(d);
+
+    if (libxl_nodemap_alloc(ctx, &nodemap))
+        goto nodeaffinity_out;
+
+    if (!strcmp(nodes, "all"))
+        libxl_nodemap_set_any(&nodemap);
+    else if (nodeaffinity_parse(nodes, &nodemap))
+        goto nodeaffinity_out1;
+
+    if (libxl_set_node_affinity(ctx, domid, &nodemap) == -1)
+        fprintf(stderr, "Could not set node affinity for dom
`%d''.\n", domid);
+
+  nodeaffinity_out1:
+    libxl_nodemap_dispose(&nodemap);
+  nodeaffinity_out:
+    ;
+}
+
+int main_nodeaffinity(int argc, char **argv)
+{
+    int opt;
+
+    if ((opt = def_getopt(argc, argv, "", "node-affinity",
2)) != -1)
+        return opt;
+
+    nodeaffinity(argv[optind], argv[optind+1]);
+    return 0;
+}
+
 static void output_xeninfo(void)
 {
     const libxl_version_info *info;
diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
--- a/tools/libxl/xl_cmdtable.c
+++ b/tools/libxl/xl_cmdtable.c
@@ -195,6 +195,11 @@ struct cmd_spec cmd_table[] = {
       "Set the number of active VCPUs allowed for the domain",
       "<Domain> <vCPUs>",
     },
+    { "node-affinity",
+      &main_nodeaffinity, 0,
+      "Set the NUMA node affinity for the domain",
+      "<Domain> <Nodes|all>",
+    },
     { "list-vm",
       &main_list_vm, 0,
       "List the VMs,without DOM0",

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`

Basic idea is: cpu-affinity tells where a vcpu can run,
while node-affinity tells where (in terms of on what NUMA nodes,
but that of course imply a set of CPUs) all the vcpus of the
domain should/prefer to run... Problems starts when you have
both of them speaking at the same time!

Respecting vcpu-affinity should remain the primary concern, thus
what we do here is add some two-step logic here and there in the
scheduler (i.e., where vcpu-affinity related decisions are being
undertaken). During the first step, we check the `&&` of vcpu and
node affinity masks, aiming at giving some precedence to the CPUs
that are both suitable and preferrable for the domain. However,
if that fails in finding a valid CPU, the node-affinity is just
ignored and, in the second step, we just fallback to vcpu-affinity,
as the original behaviour was.

Both the introduced overhead and the benefits this provides has
to be investigated and compared against each other, possibly in
varying conditions and running different workloads.

Finally, although still at the RFC level, efforts have been made
to write the code in a flexible enough fashion so that, if we
ever want to introduce further balancing steps, it shouldn''t be
too much of a pain.

TODO: * Better investigate and test interactions with cpupools.
      * Test, verify and benchmark. Then test, verify and benchmark
        again, and again and again. What I know as of know is that
        it does not explode that easily, but whether it works properly
        and, more important, brings any benefit, this has to be
        proven (and yes, I''m out running these tests and benchs
already,
        but do not esitate to manifest your will for helping with
        that :-D).
      * Add some counter/stats, e.g., to serve the purpose outlined
        right above.

XXX I''m benchmarking this just right now, and peeking at the results
    they don''t seem too bad. Numbers are mostly close to the case where
    the VM is already pinned to a node when created. I''ll post the
    results as soon as I can, and I''ll be happy to investigate any
issue,
    if you feel like the approach could be the right one.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -117,6 +117,8 @@ will run on cpus 2 and 3).
 List of the NUMA nodes of the host the guest should be considered
 `affine` with. Being affine to a (set of) NUMA node(s) mainly means
 the guest''s memory is going to be allocated on those node(s).
+Also, the scheduler will prefer (but not force) running the guest''s
+vcpus on the cpus of the affine node(s).
 
 A list of nodes should be specified as follows: `nodes=["0",
"3"]`
 for the guest to be affine with nodes 0 and 3.
diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c
--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -101,6 +101,15 @@
 
 
 /*
+ * Node Balancing
+ */
+#define CSCHED_BALANCE_NODE_AFFINITY    1
+#define CSCHED_BALANCE_CPU_AFFINITY     0
+#define CSCHED_BALANCE_START_STEP       CSCHED_BALANCE_NODE_AFFINITY
+#define CSCHED_BALANCE_END_STEP         CSCHED_BALANCE_CPU_AFFINITY
+
+
+/*
  * Boot parameters
  */
 static bool_t __read_mostly sched_credit_default_yield;
@@ -150,6 +159,17 @@ struct csched_dom {
     struct list_head active_vcpu;
     struct list_head active_sdom_elem;
     struct domain *dom;
+    /*
+     * Let''s cache domain''s dom->node_affinity here as an
+     * optimization for a couple of hot paths. In fact,
+     * knowing  whether or not dom->node_affinity has changed
+     * would allow us to avoid rebuilding node_affinity_cpumask
+     * (below) duing node balancing and/or scheduling.
+     */
+    nodemask_t node_affinity_cache;
+    /* Basing on what dom->node_affinity says,
+     * on what CPUs would we like to run most? */
+    cpumask_t node_affinity_cpumask;
     uint16_t active_vcpu_count;
     uint16_t weight;
     uint16_t cap;
@@ -230,6 +250,87 @@ static inline void
     list_del_init(&svc->runq_elem);
 }
 
+#define csched_balance(__step) \
+    for ( (__step) = CSCHED_BALANCE_START_STEP; \
+          (__step) >= CSCHED_BALANCE_END_STEP; \
+          (__step)-- )
+
+/*
+ * Sort-of conversion between node-affinity and vcpu-affinity for the domain,
+ * i.e., a cpumask containing all the cpus from all the set nodes in the
+ * node-affinity mask of the domain.
+ *
+ * Notice that we completely forget about serializing this (both here and
+ * in the various sites where node_affinity_cpumask is used) against
+ * d->node_affinity_lock. That would be hard to do properly, as that lock
+ * is a non IRQ-safe one, and it would be nearly impossible to access it
+ * from the scheduling code. However, although this leaves some room for
+ * races with code paths that updates d->node_affinity, it all should still
+ * be fine, considering:
+ *  (1) d->node_affinity updates are going to be quite rare;
+ *  (2) this balancing logic is kind-of "best effort" anyway;
+ *  (3) given (1), any inconsistencies are likely to be fixed by the next
+ *      call to this same function without risking going into instability.
+ *
+ * XXX Validate (via testing/benchmarking) whether this is true, as it
+ *     likely sounds to be, or it causes unexpected issues.
+ */
+static void
+csched_build_balance_cpumask(struct csched_dom *sdom)
+{
+    struct domain *d = sdom->dom;
+    int node;
+
+    /* Check whether we need to refresh the node-affinity cache we keep */
+    if ( likely(nodes_equal(d->node_affinity, sdom->node_affinity_cache))
)
+        return;
+    else
+        nodes_copy(sdom->node_affinity_cache,
sdom->dom->node_affinity);
+
+    /* Create the cpumask matching the current node-affinity mask */
+    cpumask_clear(&sdom->node_affinity_cpumask);
+    for_each_node_mask( node, sdom->node_affinity_cache )
+        cpumask_or(&sdom->node_affinity_cpumask,
&sdom->node_affinity_cpumask,
+                   &node_to_cpumask(node));
+}
+
+/* Put together the cpumask we are going to use for this balancing step */
+static int
+csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
+{
+    struct domain *d = vc->domain;
+    struct csched_dom *sdom = CSCHED_DOM(d);
+
+    /*
+     * Only two possible steps exists for now: node and vcpu affinity.
+     * If this domain does not have a node-affinity or is affine to all the
+     * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
+     * stop the balancing process.
+     */
+    if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
+         step == CSCHED_BALANCE_CPU_AFFINITY )
+    {
+        cpumask_copy(mask, vc->cpu_affinity);
+        return CSCHED_BALANCE_CPU_AFFINITY;
+    }
+
+    csched_build_balance_cpumask(sdom);
+
+    /*
+     * XXX As of now (with only two possible steps, this is easy and readable
+     *     enough. If more steps become necessary at some point in time, we
+     *     can keep an array of cpumask_t in dom_data and return the proper
+     *     element via step-indexing such an array. Of course, we can turn
+     *     this like that even right now... Thoughts?
+     */
+    if ( step == CSCHED_BALANCE_NODE_AFFINITY )
+        cpumask_and(mask, &sdom->node_affinity_cpumask,
vc->cpu_affinity);
+    else
+        cpumask_copy(mask, vc->cpu_affinity);
+
+    return step;
+}
+
 static void burn_credits(struct csched_vcpu *svc, s_time_t now)
 {
     s_time_t delta;
@@ -252,6 +353,20 @@ boolean_param("tickle_one_idle_cpu", opt
 DEFINE_PER_CPU(unsigned int, last_tickle_cpu);
 
 static inline void
+__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
+{
+    CSCHED_STAT_CRANK(tickle_idlers_some);
+    if ( opt_tickle_one_idle )
+    {
+        this_cpu(last_tickle_cpu) = 
+            cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
+        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
+    }
+    else
+        cpumask_or(mask, mask, idle_mask);
+}
+
+static inline void
 __runq_tickle(unsigned int cpu, struct csched_vcpu *new)
 {
     struct csched_vcpu * const cur @@ -289,22 +404,27 @@ static inline void
         }
         else
         {
-            cpumask_t idle_mask;
+            cpumask_t idle_mask, balance_mask;
+            int balance_step;
 
-            cpumask_and(&idle_mask, prv->idlers,
new->vcpu->cpu_affinity);
-            if ( !cpumask_empty(&idle_mask) )
+            csched_balance(balance_step)
             {
-                CSCHED_STAT_CRANK(tickle_idlers_some);
-                if ( opt_tickle_one_idle )
-                {
-                    this_cpu(last_tickle_cpu) = 
-                        cpumask_cycle(this_cpu(last_tickle_cpu),
&idle_mask);
-                    cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask);
-                }
-                else
-                    cpumask_or(&mask, &mask, &idle_mask);
+                /*
+                 * A each balancing step, look for any idler in the poper
+                 * affinity mask for the step itself (for new->vcpu).
+                 */
+                csched_balance_cpumask(new->vcpu, balance_step,
&balance_mask);
+                cpumask_and(&idle_mask, prv->idlers, &balance_mask);
+
+                if ( !cpumask_empty(&idle_mask) )
+                    __cpumask_tickle(&mask, &idle_mask);
+
+                cpumask_and(&mask, &mask, &balance_mask);
+
+                /* We can quit balancing if we found someone to tickle */
+                if ( !cpumask_empty(&mask) )
+                    break;
             }
-            cpumask_and(&mask, &mask, new->vcpu->cpu_affinity);
         }
     }
 
@@ -445,7 +565,7 @@ static inline int
 }
 
 static inline int
-__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu)
+__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask)
 {
     /*
      * Don''t pick up work that''s in the peer''s
scheduling tail or hot on
@@ -453,27 +573,37 @@ static inline int
      */
     return !vc->is_running &&
            !__csched_vcpu_is_cache_hot(vc) &&
-           cpumask_test_cpu(dest_cpu, vc->cpu_affinity);
+           cpumask_test_cpu(dest_cpu, mask);
 }
 
 static int
 _csched_cpu_pick(const struct scheduler *ops, struct vcpu *vc, bool_t commit)
 {
-    cpumask_t cpus;
+    cpumask_t cpus, start_cpus;
     cpumask_t idlers;
     cpumask_t *online;
+    struct csched_dom *sdom = CSCHED_DOM(vc->domain);
     struct csched_pcpu *spc = NULL;
     int cpu;
 
+    csched_build_balance_cpumask(sdom);
+
     /*
      * Pick from online CPUs in VCPU''s affinity mask, giving a
      * preference to its current processor if it''s in there.
+     *
+     * It''s worth tying to take balancing into account,
+     * at least for selecting from which cpu to start looking
+     * around from. XXX ... Or is this going to be too much?
      */
     online = cpupool_scheduler_cpumask(vc->domain->cpupool);
     cpumask_and(&cpus, online, vc->cpu_affinity);
-    cpu = cpumask_test_cpu(vc->processor, &cpus)
+    cpumask_and(&start_cpus, &cpus,
&sdom->node_affinity_cpumask);
+    if ( unlikely(cpumask_empty(&start_cpus)) )
+        cpumask_copy(&start_cpus, &cpus);
+    cpu = cpumask_test_cpu(vc->processor, &start_cpus)
             ? vc->processor
-            : cpumask_cycle(vc->processor, &cpus);
+            : cpumask_cycle(vc->processor, &start_cpus);
     ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu,
&cpus) );
 
     /*
@@ -877,6 +1007,14 @@ csched_alloc_domdata(const struct schedu
     sdom->active_vcpu_count = 0;
     INIT_LIST_HEAD(&sdom->active_sdom_elem);
     sdom->dom = dom;
+    /*
+     *XXX This would be ''The Right Thing'', but as it is still
too
+     *    early and d->node_affinity has not settled yet, maybe we
+     *    can just init the two masks with something like all-nodes
+     *    and all-cpus and rely on the first balancing call for
+     *    having them updated?
+     */
+    csched_build_balance_cpumask(sdom);
     sdom->weight = CSCHED_DEFAULT_WEIGHT;
     sdom->cap = 0U;
 
@@ -1216,30 +1354,54 @@ csched_runq_steal(int peer_cpu, int cpu,
      */
     if ( peer_pcpu != NULL && !is_idle_vcpu(peer_vcpu) )
     {
-        list_for_each( iter, &peer_pcpu->runq )
+        int balance_step;
+
+        /*
+         * Idea here is trying take NUMA affinity (if any) into account.
+         * While at it, try to be general enough, exploiting the balancing
+         * infrastructure as follows:
+         *  - we act in steps (two for now: node and vcpu affinity);
+         *  - at each step we check each item of peer_pcpu''s runq
+         *    looking for a candidate vcpu against the proper CPU
+         *    mask, depending on the step itself;
+         *  - masks are (should be!) checked in order of preference,
+         *    as a match will cause the end of the balancing process;
+         *  - a preferred mask should not contain any
''illegal'' CPU(s)
+         *    (wrt to, e.g., the domain''s
''cpu_affinity'', if any), as
+         *    this might cause the vcpu to be shipped there!
+         */
+        csched_balance(balance_step)
         {
-            speer = __runq_elem(iter);
+            list_for_each( iter, &peer_pcpu->runq )
+            {
+                cpumask_t balance_mask;
 
-            /*
-             * If next available VCPU here is not of strictly higher
-             * priority than ours, this PCPU is useless to us.
-             */
-            if ( speer->pri <= pri )
-                break;
+                speer = __runq_elem(iter);
 
-            /* Is this VCPU is runnable on our PCPU? */
-            vc = speer->vcpu;
-            BUG_ON( is_idle_vcpu(vc) );
+                /*
+                 * If next available VCPU here is not of strictly higher
+                 * priority than ours, this PCPU is useless to us.
+                 */
+                if ( speer->pri <= pri )
+                    break;
 
-            if (__csched_vcpu_is_migrateable(vc, cpu))
-            {
-                /* We got a candidate. Grab it! */
-                CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
-                CSCHED_STAT_CRANK(migrate_queued);
-                WARN_ON(vc->is_urgent);
-                __runq_remove(speer);
-                vc->processor = cpu;
-                return speer;
+                /* Is this VCPU is runnable on our PCPU? */
+                vc = speer->vcpu;
+                BUG_ON( is_idle_vcpu(vc) );
+
+                balance_step = csched_balance_cpumask(vc, balance_step,
+                                                      &balance_mask);
+
+                if (__csched_vcpu_is_migrateable(vc, cpu, &balance_mask))
+                {
+                    /* We got a candidate. Grab it! */
+                    CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
+                    CSCHED_STAT_CRANK(migrate_queued);
+                    WARN_ON(vc->is_urgent);
+                    __runq_remove(speer);
+                    vc->processor = cpu;
+                    return speer;
+                }
             }
         }
     }
diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
--- a/xen/include/xen/nodemask.h
+++ b/xen/include/xen/nodemask.h
@@ -195,6 +195,13 @@ static inline int __nodes_weight(const n
 	return bitmap_weight(srcp->bits, nbits);
 }
 
+#define nodes_copy(dst, src) __nodes_copy(&(dst), &(src), MAX_NUMNODES)
+static inline void __nodes_copy(nodemask_t *dstp,
+				const nodemask_t *srcp, int nbits)
+{
+	bitmap_copy(dstp->bits, srcp->bits, nbits);
+}
+
 #define nodes_shift_right(dst, src, n) \
 			__nodes_shift_right(&(dst), &(src), (n), MAX_NUMNODES)
 static inline void __nodes_shift_right(nodemask_t *dstp,

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

Allow the user to ask for automatic placement of a new domain onto
the NUMA nodes of the host. The most important consideration, wrt
to this, definitely is how much free memory we have on the various
nodes vs. how much memory the domain needs, and this is exactly what
is used to drive the algorithm decisions. Some more refinement, e.g.,
adding more factors, like actual load on the CPUs, etc, to the
"equation" can be easily added later.

The placement logic is very simple and it uses the First Fit algorithm.
This basically means the domain is put in the first node (or, perhaps,
set of nodes) with enough free memory. A form of `sanity check'' is
performed after the memory-wise placement, to be sure the domain has
at least as much pCPUs available as its vCPUs, and if that is not the
case, more nodes are added to its affinity map.

The user can ask for full automatic placement, in which case the least
possible number of nodes will be used, or provide some hints, such
as how many nodes he wants the domain to be affine to. He can also ask
for a more specific placement (an explicit list of nodes), and the
algorithm will honour that, if possible, or just fail. Finally, if
the user doesn''t say anything about node affinity, but the domain has
some vcpu affinity, the algorithm will use that information as a
starting point for placing it.

TODO: * As usual, relationship with cpupools should be better
        both defined and (then) checked.
      * This only takes memory into account, forgetting about how
        busy the CPUs of the various nodes are. Heuristics for
        taking that into account too need to be thought about,
        and proper support for gathering the data they need for
        working (e.g., stats about load on pCPUs?) implemented.
      * This just ignores node distances, as exporting such info
        via libxl would be better when we''ll have arrays in the IDL;
        however, spots where such information should be considered
        are clearly identified and marked, so it should be trivial
        to put it in place once we''ll have the proper backup.
      * Checking for free memory in `xl'' and then allocating it
        to domains in Xen is prone to race conditions. Domain
        creation should be fine, given the current --very coarse--
        locking discipline implemented in `xl'' itself, but, for
        instance, nothing protect creation and ballooning from
        running into each other. This affects the present code
        as well (freemem() already checks for free memory in the
        creation process), but can get particularly nasty with
        this changes applied. It needs to be solved somehow...
        Solutions are under consideration.
      * Benchmarks, e.g., something like put a bunch of VM
        on "auto" placement policy and check for some kind of
        aggregate throughput (I''m working on this, in order to
        have it happening by means of the same infrastructure I
        used for producing the results referred to in the #0 mail).

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

add-firstfit-automatic-domain-placement.patch

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -122,6 +122,36 @@ vcpus on the cpus of the affine node(s).
 
 A list of nodes should be specified as follows: `nodes=["0",
"3"]`
 for the guest to be affine with nodes 0 and 3.
+It is also possible to ask for automatic placement of the guest on
+host nodes by either using `nodes="auto"` or just putting the number
+of nodes the guest should span, e.g., `nodes=2`. In the former case,
+xl will try to make the guest affine with the lowest possible number
+of nodes, in the latter it will try to make it affine to the provided
+number of them.
+
+=item B<nodes_policy="NODE-POLICY">
+
+Allows for better specifying how to automatically set the guest NUMA
+affinity. Available C<NODE-POLICY>-ies are:
+
+=over 4
+
+=item B<auto>
+
+use a not better specified (xl-implementation dependant) algorithm
+to try automatically fitting the guest on the host''s NUMA nodes
+
+=item B<ffit>
+
+use the First Fit algorithm to try automatically fitting the
+guest on the host''s NUMA nodes
+
+=back
+
+This option interacts with `nodes=` such that, if the number of
+nodes to use is specified there (e.g., `nodes=2`), xl will use
+the specified policy to try  fitting the guest on that exact
+number of NUMA nodes of the host.
 
 =item B<memory=MBYTES>
 
diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
--- a/tools/libxl/libxl_utils.c
+++ b/tools/libxl/libxl_utils.c
@@ -454,6 +454,15 @@ int libxl_map_alloc(libxl_ctx *ctx, stru
     return 0;
 }
 
+static void libxl_map_copy(/*XXX libxl_ctx *ctx, */ struct libxl_map *dptr,
+                          const struct libxl_map *sptr)
+{
+    int sz;
+
+    sz = dptr->size = sptr->size;
+    memcpy(dptr->map, sptr->map, sz * sizeof(*dptr->map));
+}
+
 int libxl_map_test(struct libxl_map *map, int elem)
 {
     if (elem >= map->size * 8)
@@ -497,6 +506,12 @@ int libxl_nodemap_alloc(libxl_ctx *ctx, 
     return libxl_map_alloc(ctx, nodemap, max_nodes);
 }
 
+void libxl_nodemap_copy(/*XXX libxl_ctx *ctx, */ libxl_nodemap *dst,
+                       const libxl_nodemap *src)
+{
+   libxl_map_copy(/*XXX ctx, */ dst, src);
+}
+
 int libxl_get_max_cpus(libxl_ctx *ctx)
 {
     return xc_get_max_cpus(ctx->xch);
diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
--- a/tools/libxl/libxl_utils.h
+++ b/tools/libxl/libxl_utils.h
@@ -115,6 +115,8 @@ static inline int libxl_cpumap_cpu_valid
                                              if (libxl_cpumap_test(&(m),
v))
 
 int libxl_nodemap_alloc(libxl_ctx *ctx, libxl_nodemap *nodemap);
+void libxl_nodemap_copy(/*XXX libxl_ctx *ctx, */ libxl_nodemap *dst,
+                       const libxl_nodemap *src);
 static inline int libxl_nodemap_test(libxl_nodemap *nodemap, int node)
 {
     return libxl_map_test(nodemap, node);
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -113,6 +113,55 @@ static const char *action_on_shutdown_na
     [LIBXL_ACTION_ON_SHUTDOWN_COREDUMP_RESTART] = "coredump-restart",
 };
 
+/* Automatic placement policies symbols, with the following meaning:
+ *  NONE   : no automatic placement at all;
+ *  STATIC : explicit nodes specification.
+ *  FFIT   : use the First Fit algorithm for automatic placement;
+ *  AUTO   : an not better specified automatic placement, likely
+ *           to be implemented as a short circuit to (one of)
+ *           the above(s);
+ */
+#define NODES_POLICY_NONE    0
+#define NODES_POLICY_STATIC  1
+#define NODES_POLICY_FFIT    2
+#define NODES_POLICY_AUTO    3
+
+/* Default behaviour wrt to automatic domain placement on nodes,
+ * maximum number of attempts made for applying the desired
+ * placement policy and parameters, and of the same for attempts
+ * made for having enough CPUs available to the domain.
+ */
+#define NODES_POLICY_DEFAULT       NODES_POLICY_NONE
+#define NODES_POLICY_FIT_ATTEMPTS  5
+#define NODES_POLICY_VCPU_ATTEMPTS 5
+
+/* Whether, in case it is not possible to fit a domain on
+ * the requested nodes, we should give up (_NO_RETRY), try
+ * to increment (_RETRY_INC), decrement (_RETRY_DEC) or both
+ * (_RETRY_BOTH) the number of nodes.
+ *
+ * XXX Maybe the actual value of _INC and _DEC, as well as
+ *     _DEFAULT and _*_ATTEMPTS above is something we want
+ *     to be configurable, e.g., in xl.conf or alike?
+ */
+#define NODES_POLICY_NO_RETRY   ( 0)
+#define NODES_POLICY_RETRY_INC  (+1)
+#define NODES_POLICY_RETRY_DEC  (-1)
+#define NODES_POLICY_RETRY_BOTH (INT_MAX)
+
+static const char *nodes_policy_names[] = {
+    [NODES_POLICY_NONE]   = "none",
+    [NODES_POLICY_STATIC] = "static",
+    [NODES_POLICY_FFIT]   = "ffit",
+    [NODES_POLICY_AUTO]   = "auto",
+};
+
+/* Store the policy for the domain while parsing */
+static int nodes_policy = NODES_POLICY_DEFAULT;
+
+/* Store the number of nodes to be used while parsing */
+static int num_nodes_policy = 0;
+
 /* Optional data, in order:
  *   4 bytes uint32_t  config file size
  *   n bytes           config file in Unix text file format
@@ -507,6 +556,314 @@ afp_out:
     return rc;
 }
 
+static int nodes_policy_parse(const char *pol, long int *policy)
+{
+    int i;
+    const char *n;
+
+    for (i = 0; i < sizeof(nodes_policy_names) /
sizeof(nodes_policy_names[0]); i++) {
+        n = nodes_policy_names[i];
+
+        if (!n) continue;
+
+        if (strcmp(pol, n) == 0) {
+            *policy = i;
+            return 0;
+        }
+    }
+
+    return EINVAL;
+}
+
+static inline void __add_nodes_to_nodemap(libxl_nodemap *nodemap,
+                                          const libxl_numainfo *numa,
+                                          int nr_nodes, int howmany)
+{
+    int i;
+
+    /* XXX Once we have the node-distance information, maybe we can
+     *     use it here in trying to set nodes closer to the ones that
+     *     are already in the map. */ 
+    for (i = 0 ; i < nr_nodes && howmany > 0; i++) {
+        if (!libxl_nodemap_test(nodemap, i) && numa[i].size > 0) {
+            libxl_nodemap_set(nodemap, i);
+            howmany--;
+        }
+    }
+}
+
+/*
+ * First Fit automatic placement. Just scan all the nodes in the
+ * provided map (@nodemap) linearly and pick up the fist @nodes
+ * that fit the memory requirements (@memkb) of the domain.
+ * This also returns the actual number of used nodes (still via
+ * @nodes) and the actual nodes map to be used (still via @nodemap),
+ * as they both can change depending on the retry policy (@retry).
+ */
+static int place_domain_ffit(const libxl_numainfo *numa, uint64_t memb,
+                             int retry, int nr_nodes, int *nodes,
+                             libxl_nodemap *nodemap)
+{
+    uint64_t memb_node;
+    libxl_nodemap suitable_nodes;
+    int attempts = 1, rc = 0;
+    int i, found_nodes;
+
+    /* XXX This is one of the nastiest piece of code I''ve
+     *     ever written... Mhuahahahah!! */
+    if (retry == NODES_POLICY_RETRY_BOTH) {
+        int dec_nodes;
+        /* Order (_DEC before than _INC) should stay like this,
+         * as NODES_POLICY_RETRY_INC can broaden the node map. */
+        retry = NODES_POLICY_RETRY_INC;
+        dec_nodes = *nodes;
+        if (!place_domain_ffit(numa, memb, NODES_POLICY_RETRY_DEC,
+                               nr_nodes, &dec_nodes, nodemap)) {
+            *nodes = dec_nodes;
+            return 0;
+        }
+    }
+
+    if (libxl_nodemap_alloc(ctx, &suitable_nodes) < 0) {
+        fprintf(stderr, "libxl_nodemap_alloc failed\n");
+        return ENOMEM;
+    }
+
+    do {
+        rc = ENOSPC;
+
+        /* Avoid trying to look for silly number of nodes */
+        if (*nodes <= 0 || *nodes > nr_nodes)
+            break;
+
+        /* Determine how much free memory we want on each of the nodes
+         * that will end up in suitable_nodes. Either adding or ignoring
+         * the modulus of the integer division should be fine (the total
+         * number of nodes should be in the order of tens of them), so
+         * let''s add it as it should be more safe.
+         */
+        memb_node = (memb / (*nodes)) + (memb % (*nodes));
+
+        /* Let''s see if there are enough (valid!) nodes in the map
+         * with such an amount of memory. */
+        found_nodes = 0;
+        libxl_nodemap_set_none(&suitable_nodes);
+        libxl_for_each_set_node(i, *nodemap) {
+            if (numa[i].size > 0 && numa[i].free >= memb_node) {
+                libxl_nodemap_set(&suitable_nodes, i);
+                if (++found_nodes >= *nodes)
+                    break;
+            }
+        }
+        if (found_nodes == (*nodes)) {
+            /* We''re done, let''s commit the resulting nodemap
*/
+            libxl_nodemap_copy(nodemap, &suitable_nodes); 
+            rc = 0;
+            break;
+        }
+
+        /* Apply the asked retry policy for the next round. Notice
+         * that it would be pointless to increase nodes without also
+         * adding some nodes to the map! */
+        *nodes += retry;
+        if (retry == NODES_POLICY_RETRY_INC)
+            __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry);
+
+        attempts++;
+    } while (retry != NODES_POLICY_NO_RETRY &&
+             attempts < NODES_POLICY_FIT_ATTEMPTS);
+
+    libxl_nodemap_dispose(&suitable_nodes);
+
+    return rc;
+}
+
+/* Try to achieve optimal node-affinity for the domain */
+static int place_domain(libxl_domain_build_info *b_info)
+{
+    uint32_t need_memkb;
+    libxl_numainfo *numa;
+    libxl_cputopology *info;
+    libxl_nodemap new_nodemap;
+    int nr_nodes, nr_cpus;
+    int use_nodes = 0, use_cpus;
+    int attempts = 1, rc = 0;
+    int i, retry_policy;
+
+    if (nodes_policy == NODES_POLICY_NONE ||
+        nodes_policy == NODES_POLICY_STATIC) 
+        /* Nothing to be done: if we''re here no policy is being set,
either
+         * because domain doesn''t want one, or it asked for specific
nodes. */
+        return 0;
+
+    rc = libxl_domain_need_memory(ctx, b_info, &need_memkb);
+    if (rc < 0) {
+        fprintf(stderr, "libxl_domain_need_memory failed.\n");
+        goto out;
+    }
+
+    /* We need to know both how much free memory we have on the
+     * nodes, and to what node the various CPUS belong. */
+    numa = libxl_get_numainfo(ctx, &nr_nodes);
+    if (numa == NULL) {
+        fprintf(stderr, "libxl_get_numainfo failed.\n");
+        rc = ENOMEM;
+        goto out;
+    }
+    info = libxl_get_cpu_topology(ctx, &nr_cpus);
+    if (info == NULL) {
+        fprintf(stderr, "libxl_get_topologyinfo failed.\n");
+        rc = ENOMEM;
+        goto out_numa;
+    }
+
+    /* In order not to alter the actual b_info->nodemap during the
+     * process (what if something goes wrong in the middle?) use
+     * this copy of it and only commit the result at the end. */
+    if (libxl_nodemap_alloc(ctx, &new_nodemap) < 0) {
+        fprintf(stderr, "libxl_nodemap_alloc failed\n");
+        goto out_topology;;
+    }
+
+    /* If a nodemap has been specified, just comply with that.
+     * OTOH, if there''s no nodemap, rely on cpumap (if any), and
+     * fallback to all node if even that is empty. Also consider
+     * all the nodes to be valid if only the number (i.e., _how_
+     * _many_ of them instead of _which_ of them) of nodes the
+     * domain wants is provided.
+     *
+     * Concering retries, if a specific set of nodes is specified,
+     * just try that and give up if we fail. If, instead, a set of
+     * CPUs onto which to pin the domain is specified, try first
+     * the exact nodes those CPUs belongs to, then also try both
+     * a smaller and a bigger number of nodes. The same happens if
+     * we just have the number of nodes to be used. Finally, if
+     * there is no node-affinity, no cpu-pinning, no number of nodes,
+     * start trying with one node and increase at the configured
+     * rate (considering all the nodes to be suitable).
+     *
+     * XXX some sort of libxl_{cpu,node}map_weight can be
+     *     implemented and used here to both speedup and
+     *     restructure this ugly code below!
+     */
+    if (num_nodes_policy != 0) {
+        /* The num. of nodes is all we have */
+        retry_policy = NODES_POLICY_RETRY_BOTH;
+        libxl_nodemap_set_any(&new_nodemap);
+        use_nodes = num_nodes_policy;
+    } else {
+        /* Is a list of suitable nodes there? */
+        retry_policy = NODES_POLICY_NO_RETRY;
+        libxl_for_each_set_node(i, b_info->nodemap) {
+            libxl_nodemap_set(&new_nodemap, i);
+            use_nodes++;
+        }
+        if (use_nodes == 0) {
+            int mask_cpus = 0;
+
+            /* No list of nodes, maybe the domain is pinned somewhere? */
+            retry_policy = NODES_POLICY_RETRY_BOTH;
+            libxl_for_each_set_cpu(i, b_info->cpumap) {
+                mask_cpus++;
+                if (!libxl_nodemap_test(&new_nodemap, info[i].node)) {
+                    libxl_nodemap_set(&new_nodemap, info[i].node);
+                    use_nodes++;
+                }
+            }
+            /* All cpus just means no affinity, so reset nodes anyway */
+            if (mask_cpus == nr_cpus)
+                use_nodes = 0;
+        }
+        if (use_nodes == 0) {
+            /* No hints about placement at all */
+            retry_policy = NODES_POLICY_RETRY_INC;
+            libxl_nodemap_set_any(&new_nodemap);
+            use_nodes = 1;
+        }
+    }
+    do {
+        rc = ESRCH;
+
+        if (use_nodes > nr_nodes)
+            break;
+
+        /* Kick off the chosen placement policy.
+         *
+         * XXX This is blind wrt to what happens on the CPUs
+         *     of the various nodes. Many solutions (most of which
+         *     mainly constituted by some heuristics) can be found,
+         *     but they all require the hyppervisor to provide some
+         *     information on how loaded and busy they are, and
+         *     that will be material for future work. However, if
+         *     there are any thoughts about that, they are welcome
+         *     here. */
+        switch(nodes_policy) {
+        case NODES_POLICY_AUTO:
+        case NODES_POLICY_FFIT:
+            rc = place_domain_ffit(numa, (uint64_t) need_memkb * 1024,
+                                   retry_policy, nr_nodes, &use_nodes,
+                                   &new_nodemap);
+            break;
+        }
+
+        if (rc)
+            goto out_nodemap;
+
+        /* Check whether the (set of) node(s) selected so far is able
+         * to provide the domain with enough CPUs for his VCPUs. In
+         * case it does not respin the whole procedure with one more
+         * node available for it to use.
+         *
+         * XXX This does not work as expected! Point is there are a lot
+         *     of entries in info (as it is returned by libxl_get_cpu_topology
+         *     that does not represent an actual physical CPU, they''re
+         *     just free space on the array, as it ranges from 0 up to
+         *     max_cpu_id (nr_cpus here), which is the maximum number of
+         *     supported CPUs. For example, on my testbox, info array has
+         *     64 elements, while I only have 16 CPUs. Moreover, all the
+         *     "empty spots" are filled with 0s, including their node
field.
+         *     This means the cycle below will se all of them belonging to
+         *     node #0, which will always have a lot of CPUs (56 here!)
+         *     and will thus always (well, mostly!) be suitable for
+         *     accomodating a guest, which might not be true!
+         *     Of course we can change all this in many ways, I just
+         *     would appreciate some feedback on which of those to
+         *     go for. :-)
+         */
+        use_cpus = 0;
+        libxl_for_each_cpu(i, b_info->cpumap) {
+            /* We only want CPUs belonging to (valid!) nodes in the mask */
+            if (libxl_nodemap_test(&new_nodemap, info[i].node) &&
+                numa[info[i].node].size > 0) {
+                use_cpus++;
+            }
+        }
+
+        if (use_cpus >= b_info->max_vcpus) {
+            rc = 0;
+            break;
+        }
+        /* Add one more node and retry fitting the domain */
+        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);
+        use_nodes += 1;
+
+        attempts++;
+    } while (attempts <= NODES_POLICY_VCPU_ATTEMPTS);
+
+    /* Things went fine. Commit the map into domain''s build info */
+    if (!rc)
+        libxl_nodemap_copy(&b_info->nodemap, &new_nodemap);
+
+out_nodemap:
+    libxl_nodemap_dispose(&new_nodemap);
+out_topology:
+    libxl_cputopology_list_free(info, nr_cpus);
+out_numa:
+    libxl_numainfo_list_free(numa, nr_nodes);
+out:
+    return rc;
+}
+
 static inline int vcpupin_parse(char *cpu, libxl_cpumap *cpumap)
 {
     return affinity_parse(cpu, cpumap, libxl_get_max_cpus(ctx));
@@ -638,7 +995,7 @@ static void parse_config_data(const char
         free(buf2);
     }
 
-    if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 0)) {
+    if (!xlu_cfg_get_list (config, "nodes", &nodes, 0, 1)) {
         int i, n_nodes = 0;
 
         if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) {
@@ -646,6 +1003,7 @@ static void parse_config_data(const char
             exit(1);
         }
 
+        nodes_policy = NODES_POLICY_STATIC;
         libxl_nodemap_set_none(&b_info->nodemap);
         while ((buf = xlu_cfg_get_listitem(nodes, n_nodes)) != NULL) {
             i = atoi(buf);
@@ -660,6 +1018,37 @@ static void parse_config_data(const char
         if (n_nodes == 0)
             fprintf(stderr, "No valid node found: no affinity will be
set\n");
     }
+    else if (!xlu_cfg_get_long (config, "nodes", &l, 1)) {
+        if (l <= 0) {
+            fprintf(stderr, "Only strictly positive values for
\"nodes\"\n");
+            exit(1);
+        }
+
+        if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        libxl_nodemap_set_none(&b_info->nodemap);
+        nodes_policy = NODES_POLICY_AUTO;
+        num_nodes_policy = l;
+    }
+    else if (!xlu_cfg_get_string (config, "nodes", &buf, 1)) {
+        if (strcmp(buf, "auto")) {
+            fprintf(stderr, "ERROR: invalid value \"%s\" for
\"nodes\""
+                    " (only \"auto\" supported here)\n",
buf);
+            exit(1);
+        }
+
+        if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) {
+            fprintf(stderr, "Unable to allocate nodemap\n");
+            exit(1);
+        }
+
+        /* Automatic placement will take care */
+        libxl_nodemap_set_none(&b_info->nodemap);
+        nodes_policy = NODES_POLICY_AUTO;
+    }
     else {
         /*
          * XXX What would a sane default be? I think doing nothing
@@ -671,6 +1060,32 @@ static void parse_config_data(const char
          */
     }
 
+    if (!xlu_cfg_get_string (config, "nodes_policy", &buf, 0)) {
+        fprintf(stderr, "got a nodes policy string:
\"%s\"\n", buf);
+
+        if (nodes_policy_parse(buf, &l)) {
+            fprintf(stderr, "ERROR: invalid value for
\"nodes_policy\"\n");
+            exit(1);
+        }
+        if (l == NODES_POLICY_STATIC || l == NODES_POLICY_NONE) {
+            fprintf(stderr, "ERROR: \"%s\" can''t be
used here\n", buf);
+            exit(1);
+        }
+
+        /* Actually set the policy. If there is no nodemap, build
+         * a new one for the placement to use. Otherwise, don''t touch
it,
+         * i.e., tell the policy to try to respect that and act on
+         * those nodes only (if possible). */
+        nodes_policy = l;
+        if (!b_info->nodemap.size) {
+            if (libxl_nodemap_alloc(ctx, &b_info->nodemap)) {
+                fprintf(stderr, "Unable to allocate nodemap\n");
+                exit(1);
+            }
+            libxl_nodemap_set_none(&b_info->nodemap);
+        }
+    }
+
     if (!xlu_cfg_get_long (config, "memory", &l, 0)) {
         b_info->max_memkb = l * 1024;
         b_info->target_memkb = b_info->max_memkb;
@@ -1703,6 +2118,16 @@ start:
         goto error_out;
     }
 
+    ret = place_domain(&d_config.b_info);
+    if (ret == ESRCH || ret == ENOSPC) {
+        fprintf(stderr, "failed to place the domain, fallback to all
nodes\n");
+        libxl_nodemap_set_any(&d_config.b_info.nodemap);
+    } else if (ret < 0) {
+        fprintf(stderr, "failed to put the domain on the requested
node(s)\n");
+        ret = ERROR_FAIL;
+        goto error_out;
+    }
+
     if ( dom_info->console_autoconnect ) {
         cb = autoconnect_console;
     }else{
diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -365,10 +365,11 @@ static void dump_numa(unsigned char key)
 
 	for_each_online_node(i) {
 		paddr_t pa = (paddr_t)(NODE_DATA(i)->node_start_pfn + 1)<<
PAGE_SHIFT;
-		printk("idx%d -> NODE%d start->%lu size->%lu\n",
+		printk("idx%d -> NODE%d start->%lu size->%lu
free->%lu\n",
 			  i, NODE_DATA(i)->node_id,
 			  NODE_DATA(i)->node_start_pfn,
-			  NODE_DATA(i)->node_spanned_pages);
+			  NODE_DATA(i)->node_spanned_pages,
+                          avail_node_heap_pages(i));
 		/* sanity check phys_to_nid() */
 		printk("phys_to_nid(%"PRIpaddr") -> %d should be
%d\n", pa, phys_to_nid(pa),
 			  NODE_DATA(i)->node_id);

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms

Besides than "auto" and "ffit", as per the previous patch,
enable
Best and Worst Fit placement heuristics to `xl`, for the user to
chose them in the config file. Implementation just sits on top
of the First Fit algorithm code, with just the proper reordering
of the nodes and their free memory before invoking it (and after
it finishes).

TODO: * As usual for this series, benchmarking and testing,
        as much thorough and comprehensive as possible!

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -146,6 +146,16 @@ to try automatically fitting the guest o
 use the First Fit algorithm to try automatically fitting the
 guest on the host''s NUMA nodes
 
+=item B<bfit>
+
+use the Best Fit algorithm to try automatically fitting the
+guest on the host''s NUMA nodes
+
+=item B<wfit>
+
+use the Worst Fit algorithm to try automatically fitting the
+guest on the host''s NUMA nodes
+
 =back
 
 This option interacts with `nodes=` such that, if the number of
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -117,6 +117,8 @@ static const char *action_on_shutdown_na
  *  NONE   : no automatic placement at all;
  *  STATIC : explicit nodes specification.
  *  FFIT   : use the First Fit algorithm for automatic placement;
+ *  BFIT   : use the Best Fit algorithm for automatic placement;
+ *  WFIT   : use the Worst Fit algorithm for automatic placement;
  *  AUTO   : an not better specified automatic placement, likely
  *           to be implemented as a short circuit to (one of)
  *           the above(s);
@@ -124,7 +126,9 @@ static const char *action_on_shutdown_na
 #define NODES_POLICY_NONE    0
 #define NODES_POLICY_STATIC  1
 #define NODES_POLICY_FFIT    2
-#define NODES_POLICY_AUTO    3
+#define NODES_POLICY_BFIT    3
+#define NODES_POLICY_WFIT    4
+#define NODES_POLICY_AUTO    5
 
 /* Default behaviour wrt to automatic domain placement on nodes,
  * maximum number of attempts made for applying the desired
@@ -153,6 +157,8 @@ static const char *nodes_policy_names[] 
     [NODES_POLICY_NONE]   = "none",
     [NODES_POLICY_STATIC] = "static",
     [NODES_POLICY_FFIT]   = "ffit",
+    [NODES_POLICY_BFIT]   = "bfit",
+    [NODES_POLICY_WFIT]   = "wfit",
     [NODES_POLICY_AUTO]   = "auto",
 };
 
@@ -595,10 +601,17 @@ static inline void __add_nodes_to_nodema
 /*
  * First Fit automatic placement. Just scan all the nodes in the
  * provided map (@nodemap) linearly and pick up the fist @nodes
- * that fit the memory requirements (@memkb) of the domain.
+ * that fit the memory requirements (@memb) of the domain.
  * This also returns the actual number of used nodes (still via
  * @nodes) and the actual nodes map to be used (still via @nodemap),
  * as they both can change depending on the retry policy (@retry).
+ *
+ * The idea behing First Fit is to be efficient and quick, and it
+ * generally works better than Best Fit wrt fragmentation, although
+ * it tends to occupy "early" nodes more than "late" ones.
+ *
+ * This function is also used as the ground for implementing Best
+ * and Worst Fit placement solutions too.
  */
 static int place_domain_ffit(const libxl_numainfo *numa, uint64_t memb,
                              int retry, int nr_nodes, int *nodes,
@@ -678,6 +691,133 @@ static int place_domain_ffit(const libxl
     return rc;
 }
 
+typedef struct {
+    int idx;
+    uint32_t memfree;
+} numa_reorderer;
+
+typedef int (*reorderer_cmp)(const void *, const void *);
+
+static int reorderer_cmp_asc(const void *r1,
+                             const void *r2)
+{
+    return ((const numa_reorderer*) r1)->memfree -
+        ((const numa_reorderer*) r2)->memfree;
+}
+
+/* Turns the comparison the other way around for descending ordering */
+static int reorderer_cmp_desc(const void *r1,
+                              const void *r2)
+{
+    return ((const numa_reorderer*) r2)->memfree -
+        ((const numa_reorderer*) r1)->memfree;
+}
+
+static int __place_domain_bw_fit(const libxl_numainfo *numa, uint64_t memb,
+                                 int retry, int nr_nodes, int *nodes,
+                                 libxl_nodemap *nodemap, reorderer_cmp cmpr)
+{
+    numa_reorderer *n;
+    libxl_numainfo *numa_ord;
+    libxl_nodemap nodemap_ord;
+    int i, rc;
+
+    n = malloc(sizeof(*n) * nr_nodes);
+    if (!n) {
+        fprintf(stderr, "malloc failed\n");
+        return ENOMEM;
+    }
+    numa_ord = malloc(sizeof(*numa) * nr_nodes);
+    if (!numa_ord) {
+        fprintf(stderr, "malloc failed\n");
+        rc = ENOMEM;
+        goto out_n;
+    }
+    if (libxl_nodemap_alloc(ctx, &nodemap_ord) < 0) {
+        fprintf(stderr, "libxl_nodemap_alloc failed\n");
+        rc = ENOMEM;
+        goto out_numaord;
+    }
+
+    /* Initialize the reordering structure so that we can
+     * ''sort'' the idx basing on the values of memfree, and
+     * thus have the full trace of the permutations applied
+     * by the sorting algorithm. */
+    for (i = 0; i < nr_nodes; i++) {
+        n[i].idx = i;
+        n[i].memfree = numa[i].free;
+   } 
+
+    qsort(n, nr_nodes, sizeof(*n), cmpr);
+
+    /* Apply the permutation to the numainfo array as
+     * well as to the nodemap. */
+    for (i = 0; i < nr_nodes; i++) {
+        numa_ord[i] = numa[n[i].idx];
+        if (libxl_nodemap_test(nodemap, n[i].idx))
+            libxl_nodemap_set(&nodemap_ord, i);
+    }
+
+    /* Now let First Fit do his job on the ordered data */
+    rc = place_domain_ffit(numa_ord, memb, retry, nr_nodes,
+                           nodes, &nodemap_ord);
+
+    /* `Rollback` the permutation of the node map */
+    libxl_nodemap_set_none(nodemap);
+    for (i = 0; i < nr_nodes; i++) {
+        if (libxl_nodemap_test(&nodemap_ord, i))
+            libxl_nodemap_set(nodemap, n[i].idx);
+    }
+
+    libxl_nodemap_dispose(&nodemap_ord);
+out_numaord:
+    free(numa_ord);
+out_n:
+    free(n);
+
+    return rc;
+}
+
+/* Best Fit automatic placement. It will (try to) find the node(s)
+ * with the smallest possible amount of free memory that also satisfies
+ * the domain memory requirements (@memb).
+ *
+ * The idea behing Best Fit is to optimize memory usage, although it
+ * might introduce quite a bit of fragmentation, by leaving a large
+ * amount of small free areas.
+ *
+ * This is easily implemented by sorting the nodes array in
+ * ascending order (of free memory on each node) and then
+ * asking First Fit to run on the now-ordered data.
+ */
+static int place_domain_bfit(const libxl_numainfo *numa, uint64_t memb,
+                             int retry, int nr_nodes, int *nodes,
+                             libxl_nodemap *nodemap)
+{
+    return __place_domain_bw_fit(numa, memb, retry, nr_nodes,
+                                 nodes, nodemap, reorderer_cmp_asc);
+}
+
+/* Worst Fit automatic placement. It will (try to) find the node(s)
+ * with the highest possible amount of free memory that also satisfies
+ * domain memory requirements (@memb).
+ *
+ * The idea behind Worst Fit is that it will leave big enough free
+ * memory areas to limit the amount of fragmentation, especially
+ * compared to Best Fit.
+ *
+ * This is easily implemented by sorting the nodes array in
+ * ascending order (of free memory on each node) and then
+ * asking First Fit to run on the now-ordered data.
+ */
+static int place_domain_wfit(const libxl_numainfo *numa, uint64_t memb,
+                             int retry, int nr_nodes, int *nodes,
+                             libxl_nodemap *nodemap)
+{
+    return __place_domain_bw_fit(numa, memb, retry, nr_nodes,
+                                 nodes, nodemap, reorderer_cmp_desc);
+}
+
 /* Try to achieve optimal node-affinity for the domain */
 static int place_domain(libxl_domain_build_info *b_info)
 {
@@ -804,6 +944,16 @@ static int place_domain(libxl_domain_bui
                                    retry_policy, nr_nodes, &use_nodes,
                                    &new_nodemap);
             break;
+        case NODES_POLICY_BFIT:
+            rc = place_domain_bfit(numa, (uint64_t) need_memkb * 1024,
+                                   retry_policy, nr_nodes, &use_nodes,
+                                   &new_nodemap);
+            break;
+        case NODES_POLICY_WFIT:
+            rc = place_domain_wfit(numa, (uint64_t) need_memkb * 1024,
+                                   retry_policy, nr_nodes, &use_nodes,
+                                   &new_nodemap);
+            break;
         }
 
         if (rc)

Dario Faggioli

2012-Apr-11 13:17 UTC

head link

[PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

Add some rationale and usage documentation for the new automatic
NUMA placement feature of xl.

TODO: * Decide whether we want to have things like "Future
Steps/Roadmap"
        and/or "Performances/Benchmarks Results" here as well.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>

diff --git a/docs/misc/xl-numa-placement.txt b/docs/misc/xl-numa-placement.txt
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.txt
@@ -0,0 +1,205 @@
+               -------------------------------------
+               NUMA Guest Placement Design and Usage
+               -------------------------------------
+
+Xen deals with Non-Uniform Memory Access (NUMA) machines in many ways. For
+example each domain has its own "node affinity", i.e., a set of NUMA
nodes
+of the host from which memory for that domain is allocated. That becomes
+very important as soon as many domains start running memory-intensive
+workloads on a shared host. In fact, accessing non node-local memory
+locations costs much more than node-local ones, to the point that the
+degradation in performance is likely to be noticeable.
+
+It is then quite obvious that, any mechanism that enable the most of the
+memory accesses for the most of the most of the guest domains to stay
+local is something very important to achieve when dealing with NUMA
+platforms.
+
+
+Node Affinity and CPU Affinity
+------------------------------
+
+There is another very popular ''affinity'', besides node
affinity we are
+discussing here, which is ''(v)cpu affinity''. Moreover, to
make things
+even worse, the two are different but somehow related things. In fact,
+in both Xen and Linux worlds, ''cpu affinity'' is the set of
CPUs a domain
+(that would be a task, when talking about Linux) can be scheduled on.
+This seems to have few to do with memory accesses, but it does, as the
+CPU where a domain run is also from where it tries to access its memory,
+i.e., that is one half of what decides whether a memory access is remote
+or local --- the other half being where the location it wants to access
+is stored.
+
+Of course, if a domain is known to only run on a subset of the physical
+CPUs of the host, it is very easy to turn all its memory accesses into
+local ones, by just constructing it''s node affinity (in Xen) basing on
+what nodes these CPUs belongs to. Actually, that is exactly what is being
+done by the hypervisor by default, as soon as it finds out a domain (or
+better, the vcpus of a domain, but let''s avoid getting into too much
+details here) has a cpu affinity.
+
+This is working quite well, but it requires the user/system administrator
+to explicitly specify such property --- the cpu affinity --- while the
+domain is being created, or Xen won''t be able to exploit that for
ensuring
+accesses locality.
+
+On the other hand, as node affinity directly affects where domain''s
memory
+lives, it makes a lot of sense for it to be involved in scheduling decisions,
+as it would be great if the hypervisor would manage in scheduling all the
+vcpus of all the domains on CPUs attached to the various domains''
local
+memory. That is why, the node affinity of a domain is treated by the scheduler
+as the set of nodes on which it would be preferable to run it, although
+not at the cost of violating the scheduling algorithm behavior and
+invariants. This means it Xen will check whether a vcpu of a domain can run
+on one of the CPUs belonging to the nodes of the domain''s node
affinity,
+but will better run it somewhere else --- even on another, remote, CPU ---
+than violating the priority ordering (e.g., by kicking out from there another
+running vcpu with higher priority) it is designed to enforce.
+
+So, last but not least, what if a domain has both vcpu and node affinity, and
+they only partially match or they do not match at all (to understand how that
+can happen, see the following sections)? Well, in such case, all the domain
+memory will be allocated reflecting its node affinity, while scheduling will
+happen according to its vcpu affinities, meaning that it is easy enough to
+construct optimal, sub-optimal, neutral and even bad and awful configurations
+(which is something nice, e.g., for benchmarking purposes). The remainder
+part of this document is explaining how to do so.
+
+
+Specifying Node Affinity
+------------------------
+
+Besides being automatically computed from the vcpu affinities of a domain
+(or also from it being part of a cpupool) within Xen, it might make sense
+for the user to specify the node affinity of its domains by hand, while
+editing their config files, as another form of partitioning the host
+resources. If that is the case, this is where the "nodes" option of
the xl
+config file becomes useful. In fact, specifying something like the below
+
+        nodes = [ ''0'', ''1'',
''3'', ''4'' ]
+
+in a domain configuration file would result in Xen assigning host NUMA nodes
+number 0, 1, 3 and 4 to the domain''s node affinity, regardless of any
vcpu
+affinity setting for the same domain. The idea is, yes, the to things are
+related, and if only one is present, it makes sense to use the other for
+inferring it, but it is always possible to explicitly specify both of them,
+independently on how good or awful it could end up being.
+
+Therefore, this is what one should expect when using "nodes", perhaps
in
+conjunction with "cpus" in a domain configuration file:
+
+ * `cpus = "0, 1"` and no `nodes=` at all
+   (i.e., only vcpu affinity specified):
+     domain''s vcpus can and will run only on host CPUs 0 and 1. Also,
as
+     domain''s node affinity will be computed by Xen and set to
whatever
+     nodes host CPUs 0 and 1 belongs, all the domain''s memory accesses
will
+     be local accesses;
+
+ * `nodes = [ ''0'', ''1'' ]` and no `cpus=` at
all
+   (i.e., only node affinity present):
+     domain''s vcpus can run on any of the host CPUs, but the scheduler
(at
+     least if credit is used, as it is the only scheduler supporting this
+     right now) will try running them on the CPUs that are part of host
+     NUMA nodes 0 and 1. Memory-wise, all the domain''s memory will be
+     allocated on host NUMA nodes nodes 0 and 1. This means the most of
+     the memory accesses of the domain should be local, but that will
+     depend on the on-line load, behavior and actual scheduling of both
+     the domain in question and all the other domains on the same host;
+
+ * `nodes = [ ''0'', ''1'' ]` and `cpus =
"0"`, with CPU 0 within node 0:
+   (i.e., cpu affinity subset of node affinity):
+     domain''s vcpus can and will only run on host CPU 0. As node
affinity
+     is being explicitly set to host NUMA nodes 0 and 1 --- which includes
+     CPU 0 --- all the memory access of the domain will be local;
+
+ * `nodes = [ ''0'', ''1'' ]` and `cpus =
"0, 4", with CPU 0 in node 0 but
+   CPU 4 in, say, node 2 (i.e., cpu affinity superset of node affinity):
+     domain''s vcpus can run on host CPUs 0 and 4, with CPU 4 not being
within
+     the node affinity (explicitly set to host NUMA nodes 0 and 1). The
+     (credit) scheduler will try to keep memory accesses local by scheduling
+     the domain''s vcpus on CPU 0, but it may not achieve 100% success;
+
+ * `nodes = [ ''0'', ''1'' ]` and `cpus =
"4"`, with CPU 4 within, say, node 2
+   (i.e., cpu affinity disjointed with node affinity):
+     domain''s vcpus can and will run only on host CPU 4, i.e.,
completely
+     "outside" of the chosen node affinity. That necessarily means
all the
+     domain''s memory access will be remote.
+
+
+Automatic NUMA Placement
+------------------------
+
+Just in case one does not want to take the burden of manually specifying
+all the node (and, perhaps, CPU) affinities for all its domains, xl implements
+some automatic placement logic. This basically means the user can ask the
+toolstack to try sorting things out in the best possible way for him.
+This is instead of specifying manually a domain''s node affinity and
can be
+paired or not with any vcpu affinity (in case it is, the relationship between
+vcpu and node affinities just stays as stated above). To serve this purpose,
+a new domain config switch has been introduces, i.e., the
"nodes_policy"
+option. As the name suggests, it allows for specifying a policy to be used
+while attempting automatic placement of the new domain. Available policies
+at the time of writing are:
+
+ * "auto": automatic placement by means of a not better specified (xl
+           implementation dependant) algorithm. It is basically for those
+           who do want automatic placement, but have no idea what policy
+           or algorithm would be better... <<Just give me a sane
default!>>
+
+ * "ffit": automatic placement via the First Fit algorithm, applied
checking
+           the memory requirement of the domain against the amount of free
+           memory in the various host NUMA nodes;
+
+ * "bfit": automatic placement via the Best Fit algorithm, applied
checking
+           the memory requirement of the domain against the amount of free
+           memory in the various host NUMA nodes;
+
+ * "wfit": automatic placement via the Worst Fit algorithm, applied
checking
+           the memory requirement of the domain against the amount of free
+           memory in the various host NUMA nodes;
+
+The various algorithms have been implemented as they offer different behavior
+and performances (for different performance metrics). For instance, First Fit
+is known to be efficient and quick, and it generally works better than Best
+Fit wrt memory fragmentation, although it tends to occupy "early"
nodes more
+than "late" ones. On the other hand, Best Fit aims at optimizing
memory usage,
+although it introduces quite a bit of fragmentation, by leaving large amounts
+of small free memory areas. Finally, the idea behind Worst Fit is that it will
+leave big enough free memory chunks to limit the amount of fragmentation, but
+it (as well as Best Fit does) is more expensive in terms of execution time, as
+it needs the "list" of free memory areas to be kept sorted.
+ 
+Therefore, achieving automatic placement actually happens by properly using
+the "nodes" and "nodes_config" configuration options as
follows:
+
+ * `nodes="auto` or `nodes_policy="auto"`:
+     xl will try fitting the domain on the host NUMA nodes by using its
+     own default placing algorithm, with default parameters. Most likely,
+     all nodes will be considered suitable for the domain (unless a vcpu
+     affinity is specified, see the last entry of this list;
+
+ * `nodes_policy="ffit"` (or `"bfit"`, `"wfit"`)
and no `nodes=` at all:
+     xl will try fitting the domain on the host NUMA nodes by using the
+     requested policy. All nodes will be considered suitable for the
+     domain, and consecutive fitting attempts will be performed while
+     increasing the number of nodes on which to put the domain itself
+     (unless a vcpu affinity is specified, see the last entry of this list);
+
+ * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`,
`"wfit"`) and `nodes=2`:
+     xl will try fitting the domain on the host NUMA nodes by using the
+     requested policy and only the number of nodes specified in `nodes=`
+     (2 in this example). All the nodes will be considered suitable for
+     the domain, and consecutive attempts will be performed while
+     increasing such a value;
+
+ * `nodes_policy="auto"` (or `"ffit"`, `"bfit"`,
`"wfit"`) and `cpus="0-6":
+     xl will try fitting the domain on the host NUMA nodes to which the CPUs
+     specified as vcpu affinity (0 to 6 in this example) belong, by using the
+     requested policy. In case it fails, consecutive fitting attempts will
+     be performed with both a reduced (first) and an increased (next) number
+     of nodes).
+
+Different usage patterns --- like specifying both a policy and a list of nodes
+are accepted, but does not make much sense after all. Therefore, although xl
+will try at its best to interpret user''s will, the resulting behavior
is
+somehow unspecified.

George Dunlap

2012-Apr-11 16:08 UTC

head link

Re: [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

On 11/04/12 14:17, Dario Faggioli wrote:> In preparation for adding an xc_nodemap_t and its related
> hadling logic. Just add one indirection layer, and try to
> retain the interface as much as possible (although some
> small bits here and there have been affected).This patch is fine with me on the whole (one comment below), but in this 
kind of a patch I think you need to include exactly what it is patch 
does.  I.e.:

1. Replace xenctl_cpumap with xenctl_map
2. Implement bitmap_to_xenctl_map and the reverse
3. Re-implement cpumask_to_xenctl_map with bitmap_to_xenctl_map and vice 
versa.
4. Other than #3, no functional changes.>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.eu.com>
>
> diff --git a/tools/libxc/xc_cpupool.c b/tools/libxc/xc_cpupool.c
> --- a/tools/libxc/xc_cpupool.c
> +++ b/tools/libxc/xc_cpupool.c
> @@ -90,7 +90,7 @@ xc_cpupoolinfo_t *xc_cpupool_getinfo(xc_
>       sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_INFO;
>       sysctl.u.cpupool_op.cpupool_id = poolid;
>       set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
> -    sysctl.u.cpupool_op.cpumap.nr_cpus = local_size * 8;
> +    sysctl.u.cpupool_op.cpumap.nr_elems = local_size * 8;
>
>       err = do_sysctl_save(xch,&sysctl);
>
> @@ -184,7 +184,7 @@ xc_cpumap_t xc_cpupool_freeinfo(xc_inter
>       sysctl.cmd = XEN_SYSCTL_cpupool_op;
>       sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_FREEINFO;
>       set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local);
> -    sysctl.u.cpupool_op.cpumap.nr_cpus = mapsize * 8;
> +    sysctl.u.cpupool_op.cpumap.nr_elems = mapsize * 8;
>
>       err = do_sysctl_save(xch,&sysctl);
>
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -142,7 +142,7 @@ int xc_vcpu_setaffinity(xc_interface *xc
>
>       set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
>
> -    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
> +    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
>
>       ret = do_domctl(xch,&domctl);
>
> @@ -182,7 +182,7 @@ int xc_vcpu_getaffinity(xc_interface *xc
>       domctl.u.vcpuaffinity.vcpu = vcpu;
>
>       set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local);
> -    domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8;
> +    domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8;
>
>       ret = do_domctl(xch,&domctl);
>
> diff --git a/tools/libxc/xc_tbuf.c b/tools/libxc/xc_tbuf.c
> --- a/tools/libxc/xc_tbuf.c
> +++ b/tools/libxc/xc_tbuf.c
> @@ -134,7 +134,7 @@ int xc_tbuf_set_cpu_mask(xc_interface *x
>       bitmap_64_to_byte(bytemap,&mask64, sizeof (mask64) * 8);
>
>       set_xen_guest_handle(sysctl.u.tbuf_op.cpu_mask.bitmap, bytemap);
> -    sysctl.u.tbuf_op.cpu_mask.nr_cpus = sizeof(bytemap) * 8;
> +    sysctl.u.tbuf_op.cpu_mask.nr_elems = sizeof(bytemap) * 8;
>
>       ret = do_sysctl(xch,&sysctl);
>
> diff --git a/xen/arch/x86/platform_hypercall.c
b/xen/arch/x86/platform_hypercall.c
> --- a/xen/arch/x86/platform_hypercall.c
> +++ b/xen/arch/x86/platform_hypercall.c
> @@ -365,7 +365,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
>       {
>           uint32_t cpu;
>           uint64_t idletime, now = NOW();
> -        struct xenctl_cpumap ctlmap;
> +        struct xenctl_map ctlmap;
>           cpumask_var_t cpumap;
>           XEN_GUEST_HANDLE(uint8) cpumap_bitmap;
>           XEN_GUEST_HANDLE(uint64) idletimes;
> @@ -378,7 +378,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE(xe
>           if ( cpufreq_controller != FREQCTL_dom0_kernel )
>               break;
>
> -        ctlmap.nr_cpus  = op->u.getidletime.cpumap_nr_cpus;
> +        ctlmap.nr_elems  = op->u.getidletime.cpumap_nr_cpus;
>           guest_from_compat_handle(cpumap_bitmap,
>                                    op->u.getidletime.cpumap_bitmap);
>           ctlmap.bitmap.p = cpumap_bitmap.p; /* handle ->  handle_64
conversion */
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -31,28 +31,29 @@
>   static DEFINE_SPINLOCK(domctl_lock);
>   DEFINE_SPINLOCK(vcpu_alloc_lock);
>
> -int cpumask_to_xenctl_cpumap(
> -    struct xenctl_cpumap *xenctl_cpumap, const cpumask_t *cpumask)
> +int bitmap_to_xenctl_map(struct xenctl_map *xenctl_map,
> +                         const unsigned long *bitmap,
> +                         unsigned int nbits)
>   {
>       unsigned int guest_bytes, copy_bytes, i;
>       uint8_t zero = 0;
>       int err = 0;
> -    uint8_t *bytemap = xmalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
> +    uint8_t *bytemap = xmalloc_array(uint8_t, (nbits + 7) / 8);
>
>       if ( !bytemap )
>           return -ENOMEM;
>
> -    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
> -    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
> +    guest_bytes = (xenctl_map->nr_elems + 7) / 8;
> +    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
>
> -    bitmap_long_to_byte(bytemap, cpumask_bits(cpumask), nr_cpu_ids);
> +    bitmap_long_to_byte(bytemap, bitmap, nbits);
>
>       if ( copy_bytes != 0 )
> -        if ( copy_to_guest(xenctl_cpumap->bitmap, bytemap, copy_bytes)
)
> +        if ( copy_to_guest(xenctl_map->bitmap, bytemap, copy_bytes) )
>               err = -EFAULT;
>
>       for ( i = copy_bytes; !err&&  i<  guest_bytes; i++ )
> -        if ( copy_to_guest_offset(xenctl_cpumap->bitmap, i,&zero,
1) )
> +        if ( copy_to_guest_offset(xenctl_map->bitmap, i,&zero, 1) )
>               err = -EFAULT;
>
>       xfree(bytemap);
> @@ -60,36 +61,58 @@ int cpumask_to_xenctl_cpumap(
>       return err;
>   }
>
> -int xenctl_cpumap_to_cpumask(
> -    cpumask_var_t *cpumask, const struct xenctl_cpumap *xenctl_cpumap)
> +int xenctl_map_to_bitmap(unsigned long *bitmap,
> +                         const struct xenctl_map *xenctl_map,
> +                         unsigned int nbits)
>   {
>       unsigned int guest_bytes, copy_bytes;
>       int err = 0;
> -    uint8_t *bytemap = xzalloc_array(uint8_t, (nr_cpu_ids + 7) / 8);
> +    uint8_t *bytemap = xzalloc_array(uint8_t, (nbits + 7) / 8);
>
>       if ( !bytemap )
>           return -ENOMEM;
>
> -    guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8;
> -    copy_bytes  = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8);
> +    guest_bytes = (xenctl_map->nr_elems + 7) / 8;
> +    copy_bytes  = min_t(unsigned int, guest_bytes, (nbits + 7) / 8);
>
>       if ( copy_bytes != 0 )
>       {
> -        if ( copy_from_guest(bytemap, xenctl_cpumap->bitmap,
copy_bytes) )
> +        if ( copy_from_guest(bytemap, xenctl_map->bitmap, copy_bytes) )
>               err = -EFAULT;
> -        if ( (xenctl_cpumap->nr_cpus&  7)&& 
(guest_bytes<= sizeof(bytemap)) )
> -            bytemap[guest_bytes-1]&= ~(0xff<< 
(xenctl_cpumap->nr_cpus&  7));
> +        if ( (xenctl_map->nr_elems&  7)&& 
(guest_bytes<= sizeof(bytemap)) )
> +            bytemap[guest_bytes-1]&= ~(0xff<< 
(xenctl_map->nr_elems&  7));
>       }
>
> -    if ( err )
> -        /* nothing */;
> -    else if ( alloc_cpumask_var(cpumask) )
> -        bitmap_byte_to_long(cpumask_bits(*cpumask), bytemap, nr_cpu_ids);
> +    if ( !err )
> +        bitmap_byte_to_long(bitmap, bytemap, nbits);
> +
> +    xfree(bytemap);
> +
> +    return err;
> +}
> +
> +int cpumask_to_xenctl_cpumap(struct xenctl_map *xenctl_cpumap,
> +                             const cpumask_t *cpumask)
> +{
> +    return bitmap_to_xenctl_map(xenctl_cpumap, cpumask_bits(cpumask),
> +                                nr_cpu_ids);
> +}
> +
> +int xenctl_cpumap_to_cpumask(cpumask_var_t *cpumask,
> +                             const struct xenctl_map *xenctl_cpumap)
> +{
> +    int err = 0;
> +
> +    if ( alloc_cpumask_var(cpumask) ) {
> +        err = xenctl_map_to_bitmap(cpumask_bits(*cpumask), xenctl_cpumap,
> +                                   nr_cpu_ids);
> +        /* In case of error, cleanup is up to us, as the caller
won''t care! */
> +        if ( err )
> +            free_cpumask_var(*cpumask);
> +    }
>       else
>           err = -ENOMEM;
>
> -    xfree(bytemap);
> -
>       return err;
>   }
>
> diff --git a/xen/include/public/arch-x86/xen-mca.h
b/xen/include/public/arch-x86/xen-mca.h
> --- a/xen/include/public/arch-x86/xen-mca.h
> +++ b/xen/include/public/arch-x86/xen-mca.h
> @@ -414,7 +414,7 @@ struct xen_mc_mceinject {
>
>   struct xen_mc_inject_v2 {
>   	uint32_t flags;
> -	struct xenctl_cpumap cpumap;
> +	struct xenctl_map cpumap;
>   };
>   #endif
>
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -283,7 +283,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvc
>   /* XEN_DOMCTL_getvcpuaffinity */
>   struct xen_domctl_vcpuaffinity {
>       uint32_t  vcpu;              /* IN */
> -    struct xenctl_cpumap cpumap; /* IN/OUT */
> +    struct xenctl_map cpumap;    /* IN/OUT */
>   };
>   typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t;
>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t);
> diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h
> --- a/xen/include/public/sysctl.h
> +++ b/xen/include/public/sysctl.h
> @@ -71,8 +71,8 @@ struct xen_sysctl_tbuf_op {
>   #define XEN_SYSCTL_TBUFOP_disable      5
>       uint32_t cmd;
>       /* IN/OUT variables */
> -    struct xenctl_cpumap cpu_mask;
> -    uint32_t             evt_mask;
> +    struct xenctl_map cpu_mask;
> +    uint32_t          evt_mask;
>       /* OUT variables */
>       uint64_aligned_t buffer_mfn;
>       uint32_t size;  /* Also an IN variable! */
> @@ -531,7 +531,7 @@ struct xen_sysctl_cpupool_op {
>       uint32_t domid;       /* IN: M              */
>       uint32_t cpu;         /* IN: AR             */
>       uint32_t n_dom;       /*            OUT: I  */
> -    struct xenctl_cpumap cpumap; /*     OUT: IF */
> +    struct xenctl_map cpumap;    /*     OUT: IF */
>   };
>   typedef struct xen_sysctl_cpupool_op xen_sysctl_cpupool_op_t;
>   DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpupool_op_t);
> diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h
> --- a/xen/include/public/xen.h
> +++ b/xen/include/public/xen.h
> @@ -822,9 +822,9 @@ typedef uint8_t xen_domain_handle_t[16];
>   #endif
>
>   #ifndef __ASSEMBLY__
> -struct xenctl_cpumap {
> +struct xenctl_map {
>       XEN_GUEST_HANDLE_64(uint8) bitmap;
> -    uint32_t nr_cpus;
> +    uint32_t nr_elems;
>   };
>   #endif
>
> diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h
> --- a/xen/include/xen/cpumask.h
> +++ b/xen/include/xen/cpumask.h
> @@ -424,8 +424,8 @@ extern cpumask_t cpu_present_map;
>   #define for_each_present_cpu(cpu)  for_each_cpu(cpu,&cpu_present_map)
>
>   /* Copy to/from cpumap provided by control tools. */
> -struct xenctl_cpumap;
> -int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const cpumask_t *);
> -int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_cpumap
*);
> +struct xenctl_map;
> +int cpumask_to_xenctl_cpumap(struct xenctl_map *, const cpumask_t *);
> +int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_map *);You should probably s/cpumap/map/; in the function names as
well.>
>   #endif /* __XEN_CPUMASK_H */
> diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
> --- a/xen/include/xlat.lst
> +++ b/xen/include/xlat.lst
> @@ -2,7 +2,7 @@
>   # ! - needs translation
>   # ? - needs checking
>   ?	dom0_vga_console_info		xen.h
> -?	xenctl_cpumap			xen.h
> +?	xenctl_map			xen.h
>   ?	mmu_update			xen.h
>   !	mmuext_op			xen.h
>   !	start_info			xen.h

Dario Faggioli

2012-Apr-11 16:31 UTC

head link

Re: [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

On Wed, 2012-04-11 at 17:08 +0100, George Dunlap wrote:> On 11/04/12 14:17, Dario Faggioli wrote:
> > In preparation for adding an xc_nodemap_t and its related
> > hadling logic. Just add one indirection layer, and try to
> > retain the interface as much as possible (although some
> > small bits here and there have been affected).
> This patch is fine with me on the whole (one comment below), but in this 
> kind of a patch I think you need to include exactly what it is patch 
> does.  I.e.:
> 
> 1. Replace xenctl_cpumap with xenctl_map
> 2. Implement bitmap_to_xenctl_map and the reverse
> 3. Re-implement cpumask_to_xenctl_map with bitmap_to_xenctl_map and vice 
> versa.
> 4. Other than #3, no functional changes.
>Ok, I can do that when reposting (and for other similar patches as
well).
> > diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h
> > --- a/xen/include/xen/cpumask.h
> > +++ b/xen/include/xen/cpumask.h
> > @@ -424,8 +424,8 @@ extern cpumask_t cpu_present_map;
> >   #define for_each_present_cpu(cpu) 
for_each_cpu(cpu,&cpu_present_map)
> >
> >   /* Copy to/from cpumap provided by control tools. */
> > -struct xenctl_cpumap;
> > -int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const cpumask_t
*);
> > -int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct
xenctl_cpumap *);
> > +struct xenctl_map;
> > +int cpumask_to_xenctl_cpumap(struct xenctl_map *, const cpumask_t *);
> > +int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_map
*);
> You should probably s/cpumap/map/; in the function names as well.
>Well, I retained the old name as the callers expect to use these
functions as that old name says, i.e., passing a cpumap and wanting a
cpumask back (for the first one). Thus, I was trying to avoid affecting
callers, exploiting the fact that xenctl_cpumap is a typedef to
libxl_map after all.

However, if that is too ugly, I surely can change function names, as
well as provide one more layer of indirection (xenctl_cpumap_to_cpumask
calling xenctl_map_to_cpumask). Just let me know what you prefer. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

George Dunlap

2012-Apr-11 16:38 UTC

head link

Re: [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap

On 11/04/12 14:17, Dario Faggioli wrote:> Exactly as we have xc_cpumap_t, something similar for representing
> NUMA nodes (i.e., xc_nodemap_t and nodemap_t) could be useful. The
> same applies to libxl_cpumap, which on its turn now has its
> node-related counterpart, libxl_nodemap.
>
> This is all in preparation for making it possible explicit
> specification of the `node affinity` of a guest.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
>
> diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c
> --- a/tools/libxc/xc_misc.c
> +++ b/tools/libxc/xc_misc.c
> @@ -35,21 +35,49 @@ int xc_get_max_cpus(xc_interface *xch)
>       return max_cpus;
>   }
>
> +int xc_get_max_nodes(xc_interface *xch)
> +{
> +    static int max_nodes = 0;
> +    xc_physinfo_t physinfo;
> +
> +    if ( max_nodes )
> +        return max_nodes;
> +
> +    if ( !xc_physinfo(xch,&physinfo) )
> +        max_nodes = physinfo.max_node_id + 1;
> +
> +    return max_nodes;
> +}
> +
>   int xc_get_cpumap_size(xc_interface *xch)
>   {
>       return (xc_get_max_cpus(xch) + 7) / 8;
>   }
>
> -xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
> +int xc_get_nodemap_size(xc_interface *xch)
>   {
> -    int sz;
> +    return (xc_get_max_nodes(xch) + 7) / 8;
> +}
>
> -    sz = xc_get_cpumap_size(xch);
> +/* XXX See [*] below ... */
> +static xc_cpumap_t __xc_map_alloc(xc_interface *xch, int sz)
> +{
>       if (sz == 0)
>           return NULL;
>       return calloc(1, sz);
>   }
>
> +xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
> +{
> +    return __xc_map_alloc(xch, xc_get_cpumap_size(xch));
> +}
> +
> +xc_nodemap_t xc_nodemap_alloc(xc_interface *xch)
> +{
> +    /* XXX [*] How bad is this? Ideas? */
> +    return (xc_nodemap_t) __xc_map_alloc(xch, xc_get_nodemap_size(xch));
> +}
> +I don''t think it''s incredibly terrible if it would buy us
something; but
I don''t really see that it does.  Two functions would be a lot more 
readable, IMHO.

Other than that, looks fine.>   int xc_readconsolering(xc_interface *xch,
>                          char *buffer,
>                          unsigned int *pnr_chars,
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -330,6 +330,20 @@ int xc_get_cpumap_size(xc_interface *xch
>   xc_cpumap_t xc_cpumap_alloc(xc_interface *xch);
>
>   /*
> + * NODEMAP handling
> + */
> +typedef uint8_t *xc_nodemap_t;
> +
> +/* return maximum number of NUMA nodes the hypervisor supports */
> +int xc_get_max_nodes(xc_interface *xch);
> +
> +/* return array size for nodemap */
> +int xc_get_nodemap_size(xc_interface *xch);
> +
> +/* allocate a nodemap */
> +xc_nodemap_t xc_nodemap_alloc(xc_interface *xch);
> +
> +/*
>    * DOMAIN DEBUGGING FUNCTIONS
>    */
>
> diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
> --- a/tools/libxl/libxl_utils.c
> +++ b/tools/libxl/libxl_utils.c
> @@ -486,11 +486,27 @@ int libxl_cpumap_alloc(libxl_ctx *ctx, l
>       return libxl_map_alloc(ctx, cpumap, max_cpus);
>   }
>
> +int libxl_nodemap_alloc(libxl_ctx *ctx, libxl_nodemap *nodemap)
> +{
> +    int max_nodes;
> +
> +    max_nodes = libxl_get_max_nodes(ctx);
> +    if (max_nodes == 0)
> +        return ERROR_FAIL;
> +
> +    return libxl_map_alloc(ctx, nodemap, max_nodes);
> +}
> +
>   int libxl_get_max_cpus(libxl_ctx *ctx)
>   {
>       return xc_get_max_cpus(ctx->xch);
>   }
>
> +int libxl_get_max_nodes(libxl_ctx *ctx)
> +{
> +    return xc_get_max_nodes(ctx->xch);
> +}
> +
>   int libxl__enum_from_string(const libxl_enum_string_table *t,
>                               const char *s, int *e)
>   {
> diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
> --- a/tools/libxl/libxl_utils.h
> +++ b/tools/libxl/libxl_utils.h
> @@ -109,6 +109,35 @@ static inline int libxl_cpumap_cpu_valid
>   #define libxl_for_each_set_cpu(v, m) for (v = 0; v<  (m).size * 8;
v++) \
>                                                if
(libxl_cpumap_test(&(m), v))
>
> +int libxl_nodemap_alloc(libxl_ctx *ctx, libxl_nodemap *nodemap);
> +static inline int libxl_nodemap_test(libxl_nodemap *nodemap, int node)
> +{
> +    return libxl_map_test(nodemap, node);
> +}
> +static inline void libxl_nodemap_set(libxl_nodemap *nodemap, int node)
> +{
> +    libxl_map_set(nodemap, node);
> +}
> +static inline void libxl_nodemap_reset(libxl_nodemap *nodemap, int node)
> +{
> +    libxl_map_reset(nodemap, node);
> +}
> +static inline void libxl_nodemap_set_any(libxl_nodemap *nodemap)
> +{
> +    libxl_map_set_any(nodemap);
> +}
> +static inline void libxl_nodemap_set_none(libxl_nodemap *nodemap)
> +{
> +    libxl_map_set_none(nodemap);
> +}
> +static inline int libxl_nodemap_node_valid(libxl_nodemap *nodemap, int
node)
> +{
> +    return libxl_map_elem_valid(nodemap, node);
> +}
> +#define libxl_for_each_node(var, map) for (var = 0; var<  (map).size *
8; var++)
> +#define libxl_for_each_set_node(v, m) for (v = 0; v<  (m).size * 8;
v++) \
> +                                              if
(libxl_nodemap_test(&(m), v))
> +
>   static inline uint32_t libxl__sizekb_to_mb(uint32_t s) {
>       return (s + 1023) / 1024;
>   }
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -116,6 +116,14 @@ int xenctl_cpumap_to_cpumask(cpumask_var
>       return err;
>   }
>
> +int xenctl_nodemap_to_nodemask(nodemask_t *nodemask,
> +                               const struct xenctl_map *xenctl_nodemap)
> +{
> +    return xenctl_map_to_bitmap(nodes_addr(*nodemask),
> +                                xenctl_nodemap,
> +                                MAX_NUMNODES);
> +}
> +
>   static inline int is_free_domid(domid_t dom)
>   {
>       struct domain *d;

Dario Faggioli

2012-Apr-11 16:41 UTC

head link

Re: [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

On Wed, 2012-04-11 at 18:31 +0200, Dario Faggioli wrote:> > >   /* Copy to/from cpumap provided by control tools. */
> > > -struct xenctl_cpumap;
> > > -int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const
cpumask_t *);
> > > -int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct
xenctl_cpumap *);
> > > +struct xenctl_map;
> > > +int cpumask_to_xenctl_cpumap(struct xenctl_map *, const
cpumask_t *);
> > > +int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct
xenctl_map *);
> > You should probably s/cpumap/map/; in the function names as well.
> >
> Well, I retained the old name as the callers expect to use these
> functions as that old name says, i.e., passing a cpumap and wanting a
> cpumask back (for the first one). I obviously meant for the _second_ one here. :-P

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2012-Apr-11 16:57 UTC

head link

Re: [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap

On Wed, 2012-04-11 at 17:38 +0100, George Dunlap wrote:> > -    sz = xc_get_cpumap_size(xch);
> > +/* XXX See [*] below ... */
> > +static xc_cpumap_t __xc_map_alloc(xc_interface *xch, int sz)
> > +{
> >       if (sz == 0)
> >           return NULL;
> >       return calloc(1, sz);
> >   }
> >
> > +xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
> > +{
> > +    return __xc_map_alloc(xch, xc_get_cpumap_size(xch));
> > +}
> > +
> > +xc_nodemap_t xc_nodemap_alloc(xc_interface *xch)
> > +{
> > +    /* XXX [*] How bad is this? Ideas? */
> > +    return (xc_nodemap_t) __xc_map_alloc(xch,
xc_get_nodemap_size(xch));
> > +}
> > +
> I don''t think it''s incredibly terrible if it would buy us
something; but
> I don''t really see that it does.  
>Fine. :-)
> Two functions would be a lot more 
> readable, IMHO.
> I''m not entirely sure I see what you mean with "two
functions". The main
issue here is __xc_map_alloc returning an xc_cpumap_t. BTW, reading your
comment it came to my mind that something like the below (removing
__xc_map_alloc, which after all, is very small!) would be better... Was
it this what you had in mind when talking about two functions?

xc_cpumap_t xc_cpumap_alloc(xc_interface *xch)
{
    if (sz == 0)
        return NULL;
    return calloc(1, sz);
}

xc_nodemap_t xc_nodemap_alloc(xc_interface *xch)
{
    if (sz == 0)
        return NULL;
    return calloc(1, sz);
}

> Other than that, looks fine.
>Thanks for looking into this so quickly! :-P

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Campbell

2012-Apr-12 09:11 UTC

head link

Re: [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

On Wed, 2012-04-11 at 14:17 +0100, Dario Faggioli wrote:> Add some rationale and usage documentation for the new automatic
> NUMA placement feature of xl.
> 
> TODO: * Decide whether we want to have things like "Future
Steps/Roadmap"
>         and/or "Performances/Benchmarks Results" here as well.
I think these would be better in the list archives and on the wiki
respectively.
> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> 
> diff --git a/docs/misc/xl-numa-placement.txt
b/docs/misc/xl-numa-placement.txt
> new file mode 100644
> --- /dev/null
> +++ b/docs/misc/xl-numa-placement.txt
It looks like you are using something approximating markdown syntax
here, so you might as well name this xl-numa-placement.markdown and get
a .html version etc almost for free.
> @@ -0,0 +1,205 @@
> +               -------------------------------------
> +               NUMA Guest Placement Design and Usage
> +               -------------------------------------
> +
> +Xen deals with Non-Uniform Memory Access (NUMA) machines in many ways. For
> +example each domain has its own "node affinity", i.e., a set of
NUMA nodes
> +of the host from which memory for that domain is allocated. That becomes
> +very important as soon as many domains start running memory-intensive
> +workloads on a shared host. In fact, accessing non node-local memory
> +locations costs much more than node-local ones, to the point that the
> +degradation in performance is likely to be noticeable.
> +
> +It is then quite obvious that, any mechanism that enable the most of the
> +memory accesses for the most of the most of the guest domains to stay
> +local is something very important to achieve when dealing with NUMA
> +platforms.
> +
> +
> +Node Affinity and CPU Affinity
> +------------------------------
> +
> +There is another very popular ''affinity'', besides node
affinity we are
> +discussing here, which is ''(v)cpu affinity''. Moreover,
to make things
> +even worse, the two are different but somehow related things. In fact,
> +in both Xen and Linux worlds, ''cpu affinity'' is the set
of CPUs a domain
> +(that would be a task, when talking about Linux) can be scheduled on.
> +This seems to have few to do with memory accesses, but it does, as the
                      ^little
> +CPU where a domain run is also from where it tries to access its memory,
> +i.e., that is one half of what decides whether a memory access is remote
> +or local --- the other half being where the location it wants to access
> +is stored.
> +
> +Of course, if a domain is known to only run on a subset of the physical
> +CPUs of the host, it is very easy to turn all its memory accesses into
> +local ones, by just constructing it''s node affinity (in Xen)
basing on
                                                                ^based
> +what nodes these CPUs belongs to. Actually, that is exactly what is being
> +done by the hypervisor by default, as soon as it finds out a domain (or
> +better, the vcpus of a domain, but let''s avoid getting into too
much
> +details here) has a cpu affinity.
> +
> +This is working quite well, but it requires the user/system administrator
> +to explicitly specify such property --- the cpu affinity --- while the
> +domain is being created, or Xen won''t be able to exploit that for
ensuring
> +accesses locality.
> +
> +On the other hand, as node affinity directly affects where
domain''s memory
> +lives, it makes a lot of sense for it to be involved in scheduling
decisions,
> +as it would be great if the hypervisor would manage in scheduling all the
> +vcpus of all the domains on CPUs attached to the various domains''
local
> +memory. That is why, the node affinity of a domain is treated by the
scheduler
> +as the set of nodes on which it would be preferable to run it, although
> +not at the cost of violating the scheduling algorithm behavior and
> +invariants. This means it Xen will check whether a vcpu of a domain can
run
> +on one of the CPUs belonging to the nodes of the domain''s node
affinity,
> +but will better run it somewhere else --- even on another, remote, CPU ---
> +than violating the priority ordering (e.g., by kicking out from there
another
> +running vcpu with higher priority) it is designed to enforce.
> +
> +So, last but not least, what if a domain has both vcpu and node affinity,
and
> +they only partially match or they do not match at all (to understand how
that
> +can happen, see the following sections)? Well, in such case, all the
domain
> +memory will be allocated reflecting its node affinity, while scheduling
will
> +happen according to its vcpu affinities, meaning that it is easy enough to
> +construct optimal, sub-optimal, neutral and even bad and awful
configurations
> +(which is something nice, e.g., for benchmarking purposes). The remainder
> +part of this document is explaining how to do so.
> +
> +
> +Specifying Node Affinity
> +------------------------
> +
> +Besides being automatically computed from the vcpu affinities of a domain
> +(or also from it being part of a cpupool) within Xen, it might make sense
> +for the user to specify the node affinity of its domains by hand, while
> +editing their config files, as another form of partitioning the host
> +resources. If that is the case, this is where the "nodes" option
of the xl
> +config file becomes useful. In fact, specifying something like the below
> +
> +        nodes = [ ''0'', ''1'',
''3'', ''4'' ]
> +
> +in a domain configuration file would result in Xen assigning host NUMA
nodes
> +number 0, 1, 3 and 4 to the domain''s node affinity, regardless of
any vcpu
> +affinity setting for the same domain. The idea is, yes, the to things are
                                                               two
> +related, and if only one is present, it makes sense to use the other for
> +inferring it, but it is always possible to explicitly specify both of
them,
> +independently on how good or awful it could end up being.
> +
> +Therefore, this is what one should expect when using "nodes",
perhaps in
> +conjunction with "cpus" in a domain configuration file:
> +
> + * `cpus = "0, 1"` and no `nodes=` at all
> +   (i.e., only vcpu affinity specified):
> +     domain''s vcpus can and will run only on host CPUs 0 and 1.
Also, as
> +     domain''s node affinity will be computed by Xen and set to
whatever
> +     nodes host CPUs 0 and 1 belongs, all the domain''s memory
accesses will
> +     be local accesses;
> +
> + * `nodes = [ ''0'', ''1'' ]` and no
`cpus=` at all
> +   (i.e., only node affinity present):
> +     domain''s vcpus can run on any of the host CPUs, but the
scheduler (at
> +     least if credit is used, as it is the only scheduler supporting this
> +     right now) will try running them on the CPUs that are part of host
> +     NUMA nodes 0 and 1. Memory-wise, all the domain''s memory
will be
> +     allocated on host NUMA nodes nodes 0 and 1. This means the most of
> +     the memory accesses of the domain should be local, but that will
> +     depend on the on-line load, behavior and actual scheduling of both
> +     the domain in question and all the other domains on the same host;
> +
> + * `nodes = [ ''0'', ''1'' ]` and `cpus =
"0"`, with CPU 0 within node 0:
> +   (i.e., cpu affinity subset of node affinity):
> +     domain''s vcpus can and will only run on host CPU 0. As node
affinity
> +     is being explicitly set to host NUMA nodes 0 and 1 --- which includes
> +     CPU 0 --- all the memory access of the domain will be local;
In this case won''t some of (half?) the memory come from node 1 and
therefore be non-local to cpu 0?
> +
> + * `nodes = [ ''0'', ''1'' ]` and `cpus =
"0, 4", with CPU 0 in node 0 but
> +   CPU 4 in, say, node 2 (i.e., cpu affinity superset of node affinity):
> +     domain''s vcpus can run on host CPUs 0 and 4, with CPU 4 not
being within
> +     the node affinity (explicitly set to host NUMA nodes 0 and 1). The
> +     (credit) scheduler will try to keep memory accesses local by
scheduling
> +     the domain''s vcpus on CPU 0, but it may not achieve 100%
success;
> +
> + * `nodes = [ ''0'', ''1'' ]` and `cpus =
"4"`, with CPU 4 within, say, node 2
These examples might be a little clearer if you defined up front what
the nodes and cpus were and then used that for all of them?
> +   (i.e., cpu affinity disjointed with node affinity):
> +     domain''s vcpus can and will run only on host CPU 4, i.e.,
completely
> +     "outside" of the chosen node affinity. That necessarily
means all the
> +     domain''s memory access will be remote.
> +
> +
> +Automatic NUMA Placement
> +------------------------
> +
> +Just in case one does not want to take the burden of manually specifying
> +all the node (and, perhaps, CPU) affinities for all its domains, xl
implements
> +some automatic placement logic. This basically means the user can ask the
> +toolstack to try sorting things out in the best possible way for him.
> +This is instead of specifying manually a domain''s node affinity
and can be
> +paired or not with any vcpu affinity (in case it is, the relationship
between
> +vcpu and node affinities just stays as stated above). To serve this
purpose,
> +a new domain config switch has been introduces, i.e., the
"nodes_policy"
> +option. As the name suggests, it allows for specifying a policy to be used
> +while attempting automatic placement of the new domain. Available policies
> +at the time of writing are:
A bunch of what follows would be good to have in the xl or xl.cfg man
pages too/instead. (I started with this docs patch so I haven''t
actually
looked at the earlier ones yet, perhaps this is already the case)
> +
> + * "auto": automatic placement by means of a not better
specified (xl
> +           implementation dependant) algorithm. It is basically for those
> +           who do want automatic placement, but have no idea what policy
> +           or algorithm would be better... <<Just give me a sane
default!>>
> +
> + * "ffit": automatic placement via the First Fit algorithm,
applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> + * "bfit": automatic placement via the Best Fit algorithm,
applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> + * "wfit": automatic placement via the Worst Fit algorithm,
applied checking
> +           the memory requirement of the domain against the amount of free
> +           memory in the various host NUMA nodes;
> +
> +The various algorithms have been implemented as they offer different
behavior
> +and performances (for different performance metrics). For instance, First
Fit
> +is known to be efficient and quick, and it generally works better than
Best
> +Fit wrt memory fragmentation, although it tends to occupy
"early" nodes more
> +than "late" ones. On the other hand, Best Fit aims at optimizing
memory usage,
> +although it introduces quite a bit of fragmentation, by leaving large
amounts
> +of small free memory areas. Finally, the idea behind Worst Fit is that it
will
> +leave big enough free memory chunks to limit the amount of fragmentation,
but
> +it (as well as Best Fit does) is more expensive in terms of execution
time, as
> +it needs the "list" of free memory areas to be kept sorted.
> +
> +Therefore, achieving automatic placement actually happens by properly
using
> +the "nodes" and "nodes_config" configuration options
as follows:
> +
> + * `nodes="auto` or `nodes_policy="auto"`:
> +     xl will try fitting the domain on the host NUMA nodes by using its
> +     own default placing algorithm, with default parameters. Most likely,
> +     all nodes will be considered suitable for the domain (unless a vcpu
> +     affinity is specified, see the last entry of this list;
> +
> + * `nodes_policy="ffit"` (or `"bfit"`,
`"wfit"`) and no `nodes=` at all:
> +     xl will try fitting the domain on the host NUMA nodes by using the
> +     requested policy. All nodes will be considered suitable for the
> +     domain, and consecutive fitting attempts will be performed while
> +     increasing the number of nodes on which to put the domain itself
> +     (unless a vcpu affinity is specified, see the last entry of this
list);
> +
> + * `nodes_policy="auto"` (or `"ffit"`,
`"bfit"`, `"wfit"`) and `nodes=2`:
> +     xl will try fitting the domain on the host NUMA nodes by using the
> +     requested policy and only the number of nodes specified in `nodes=`
> +     (2 in this example).
Number of nodes rather than specifically node 2? This is different to
the examples in the preceding section?
>  All the nodes will be considered suitable for
> +     the domain, and consecutive attempts will be performed while
> +     increasing such a value;
> +
> + * `nodes_policy="auto"` (or `"ffit"`,
`"bfit"`, `"wfit"`) and `cpus="0-6":
> +     xl will try fitting the domain on the host NUMA nodes to which the
CPUs
> +     specified as vcpu affinity (0 to 6 in this example) belong, by using
the
> +     requested policy. In case it fails, consecutive fitting attempts will
> +     be performed with both a reduced (first) and an increased (next)
number
> +     of nodes).
> +
> +Different usage patterns --- like specifying both a policy and a list of
nodes
> +are accepted, but does not make much sense after all. Therefore, although
xl
> +will try at its best to interpret user''s will, the resulting
behavior is
> +somehow unspecified.

George Dunlap

2012-Apr-12 10:24 UTC

head link

Re: [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file

On 11/04/12 14:17, Dario Faggioli wrote:> Let the user specify the NUMA node affinity of a guest via the
> ''nodes='' config switch. Explicitly listing the intended
target host
> nodes is required as of now.
>
> A valid setting will directly impact the node_affinity mask
> of the guest, i.e., it will change the behaviour of the low level
> memory allocator. On the other hand, this commit does not affect
> by any means how the guest''s vCPUs are scheduled on the
host''s
> pCPUs.I would probably break this into three separate patches, starting with 
the hypervisor, then libxc, then libxl, especially since they tend to 
have different maintainers and committers.>
> TODO: * Better investigate and test interactions with cpupools.
>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
>
> diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
> --- a/docs/man/xl.cfg.pod.5
> +++ b/docs/man/xl.cfg.pod.5
> @@ -112,6 +112,15 @@ List of which cpus the guest is allowed
>   (all vcpus will run on cpus 0,2,3,5), or `cpus=["2",
"3"]` (all vcpus
>   will run on cpus 2 and 3).
>
> +=item B<nodes=[ NODE, NODE, ...]>
> +
> +List of the NUMA nodes of the host the guest should be considered
> +`affine` with. Being affine to a (set of) NUMA node(s) mainly means
> +the guest''s memory is going to be allocated on those node(s).
> +
> +A list of nodes should be specified as follows: `nodes=["0",
"3"]`
> +for the guest to be affine with nodes 0 and 3.
> +Hmm, I think that using "affine" here is technically correct, and is 
what one would use if writing a research paper; but it''s unusual to
hear
the word in more common English; it would be more common to hear someone 
describe a VM as "having affinity with".  How about something like
this:

"The list of NUMA nodes the domain is considered to have affinity with.  
Memory from the guest will be allocated from these nodes."

(Technically you''re also not supposed to end a sentence with a 
preposition, but I think "...with which the domain is considered to have 
affinity" is just to awkward.)

Also, is there a way to specify that the affinity to be to all nodes 
and/or based on the vcpu mask of the vcpus?>   =item B<memory=MBYTES>
>
>   Start the guest with MBYTES megabytes of RAM.
> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c
> --- a/tools/libxc/xc_domain.c
> +++ b/tools/libxc/xc_domain.c
> @@ -110,6 +110,44 @@ int xc_domain_shutdown(xc_interface *xch
>   }
>
>
> +int xc_domain_node_affinity(xc_interface *xch,
> +                            uint32_t domid,
> +                            xc_nodemap_t node_affinity)
> +{
> +    DECLARE_DOMCTL;
> +    DECLARE_HYPERCALL_BUFFER(uint8_t, local);
> +    int ret = -1;
> +    int nodesize;
> +
> +    nodesize = xc_get_nodemap_size(xch);
> +    if (!nodesize)
> +    {
> +        PERROR("Could not get number of nodes");
> +        goto out;
> +    }
> +
> +    local = xc_hypercall_buffer_alloc(xch, local, nodesize);
> +    if ( local == NULL )
> +    {
> +        PERROR("Could not allocate memory for domain_node_affinity
domctl hypercall");
> +        goto out;
> +    }
> +
> +    domctl.cmd = XEN_DOMCTL_node_affinity;
> +    domctl.domain = (domid_t)domid;
> +
> +    memcpy(local, node_affinity, nodesize);
> +    set_xen_guest_handle(domctl.u.node_affinity.nodemap.bitmap, local);
> +    domctl.u.node_affinity.nodemap.nr_elems = nodesize * 8;
> +
> +    ret = do_domctl(xch,&domctl);
> +
> +    xc_hypercall_buffer_free(xch, local);
> +
> + out:
> +    return ret;
> +}
> +
>   int xc_vcpu_setaffinity(xc_interface *xch,
>                           uint32_t domid,
>                           int vcpu,
> diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h
> --- a/tools/libxc/xenctrl.h
> +++ b/tools/libxc/xenctrl.h
> @@ -520,6 +520,19 @@ int xc_watchdog(xc_interface *xch,
>                  uint32_t id,
>                  uint32_t timeout);
>
> +/**
> + * This function explicitly sets the affinity of a domain with the
> + * host NUMA nodes.
> + *
> + * @parm xch a handle to an open hypervisor interface.
> + * @parm domid the domain id in which vcpus are to be created.
> + * @parm node_affinity the map of the affine nodes.
> + * @return 0 on success, -1 on failure.
> + */
> +int xc_domain_node_affinity(xc_interface *xch,
> +                            uint32_t domind,
> +                            xc_nodemap_t node_affinity);
> +Seems like it would be useful to have both a get and a
set.>   int xc_vcpu_setaffinity(xc_interface *xch,
>                           uint32_t domid,
>                           int vcpu,
> diff --git a/tools/libxl/gentest.py b/tools/libxl/gentest.py
> --- a/tools/libxl/gentest.py
> +++ b/tools/libxl/gentest.py
> @@ -20,7 +20,7 @@ def randomize_case(s):
>   def randomize_enum(e):
>       return random.choice([v.name for v in e.values])
>
> -handcoded = ["libxl_cpumap", "libxl_key_value_list",
> +handcoded = ["libxl_cpumap", "libxl_nodemap",
"libxl_key_value_list",
>                "libxl_cpuid_policy_list",
"libxl_file_reference",
>                "libxl_string_list"]
>
> @@ -119,6 +119,19 @@ static void libxl_cpumap_rand_init(libxl
>       }
>   }
>
> +static void libxl_nodemap_rand_init(libxl_nodemap *nodemap)
> +{
> +    int i;
> +    nodemap->size = rand() % 16;
> +    nodemap->map = calloc(nodemap->size, sizeof(*nodemap->map));
> +    libxl_for_each_node(i, *nodemap) {
> +        if (rand() % 2)
> +            libxl_nodemap_set(nodemap, i);
> +        else
> +            libxl_nodemap_reset(nodemap, i);
> +    }
> +}
> +
>   static void libxl_key_value_list_rand_init(libxl_key_value_list *pkvl)
>   {
>       int i, nr_kvp = rand() % 16;
> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c
> --- a/tools/libxl/libxl.c
> +++ b/tools/libxl/libxl.c
> @@ -3082,6 +3082,16 @@ int libxl_set_vcpuaffinity_all(libxl_ctx
>       return rc;
>   }
>
> +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid,
> +                            libxl_nodemap *nodemap)
> +{
> +    if (xc_domain_node_affinity(ctx->xch, domid, nodemap->map)) {
> +        LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node
affinity");
> +        return ERROR_FAIL;
> +    }
> +    return 0;
> +}
> +
>   int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap
*cpumap)
>   {
>       GC_INIT(ctx);
> diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h
> --- a/tools/libxl/libxl.h
> +++ b/tools/libxl/libxl.h
> @@ -727,6 +727,8 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct
>                              libxl_cpumap *cpumap);
>   int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid,
>                                  unsigned int max_vcpus, libxl_cpumap
*cpumap);
> +int libxl_set_node_affinity(libxl_ctx *ctx, uint32_t domid,
> +                            libxl_nodemap *nodemap);Hmm -- is there really no libxl_get_vcpuaffinity()?  That seems to be a 
missing component...>   int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_cpumap
*cpumap);
>
>   libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx);
> diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
> --- a/tools/libxl/libxl_create.c
> +++ b/tools/libxl/libxl_create.c
> @@ -121,6 +121,12 @@ int libxl__domain_build_info_setdefault(
>           libxl_cpumap_set_any(&b_info->cpumap);
>       }
>
> +    if (!b_info->nodemap.size) {
> +        if (libxl_nodemap_alloc(CTX,&b_info->nodemap))
> +            return ERROR_NOMEM;
> +        libxl_nodemap_set_none(&b_info->nodemap);
> +    }
> +
>       if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT)
>           b_info->max_memkb = 32 * 1024;
>       if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT)
> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
> --- a/tools/libxl/libxl_dom.c
> +++ b/tools/libxl/libxl_dom.c
> @@ -67,6 +67,7 @@ int libxl__build_pre(libxl__gc *gc, uint
>       char *xs_domid, *con_domid;
>       xc_domain_max_vcpus(ctx->xch, domid, info->max_vcpus);
>       libxl_set_vcpuaffinity_all(ctx, domid,
info->max_vcpus,&info->cpumap);
> +    libxl_set_node_affinity(ctx, domid,&info->nodemap);
>       xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb +
LIBXL_MAXMEM_CONSTANT);
>       if (info->type == LIBXL_DOMAIN_TYPE_PV)
>           xc_domain_set_memmap_limit(ctx->xch, domid,
> diff --git a/tools/libxl/libxl_json.c b/tools/libxl/libxl_json.c
> --- a/tools/libxl/libxl_json.c
> +++ b/tools/libxl/libxl_json.c
> @@ -119,6 +119,26 @@ out:
>       return s;
>   }
>
> +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand,
> +                                       libxl_nodemap *nodemap)
> +{
> +    yajl_gen_status s;
> +    int i;
> +
> +    s = yajl_gen_array_open(hand);
> +    if (s != yajl_gen_status_ok) goto out;
> +
> +    libxl_for_each_node(i, *nodemap) {
> +        if (libxl_nodemap_test(nodemap, i)) {
> +            s = yajl_gen_integer(hand, i);
> +            if (s != yajl_gen_status_ok) goto out;
> +        }
> +    }
> +    s = yajl_gen_array_close(hand);
> +out:
> +    return s;
> +}
> +
>   yajl_gen_status libxl_key_value_list_gen_json(yajl_gen hand,
>                                                 libxl_key_value_list *pkvl)
>   {
> diff --git a/tools/libxl/libxl_json.h b/tools/libxl/libxl_json.h
> --- a/tools/libxl/libxl_json.h
> +++ b/tools/libxl/libxl_json.h
> @@ -27,6 +27,7 @@ yajl_gen_status libxl_domid_gen_json(yaj
>   yajl_gen_status libxl_uuid_gen_json(yajl_gen hand, libxl_uuid *p);
>   yajl_gen_status libxl_mac_gen_json(yajl_gen hand, libxl_mac *p);
>   yajl_gen_status libxl_cpumap_gen_json(yajl_gen hand, libxl_cpumap *p);
> +yajl_gen_status libxl_nodemap_gen_json(yajl_gen hand, libxl_nodemap *p);
>   yajl_gen_status libxl_cpuid_policy_list_gen_json(yajl_gen hand,
>                                                    libxl_cpuid_policy_list
*p);
>   yajl_gen_status libxl_string_list_gen_json(yajl_gen hand,
libxl_string_list *p);
> diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
> --- a/tools/libxl/libxl_types.idl
> +++ b/tools/libxl/libxl_types.idl
> @@ -11,6 +11,7 @@ libxl_domid = Builtin("domid", json_fn >  
libxl_uuid = Builtin("uuid", passby=PASS_BY_REFERENCE)
>   libxl_mac = Builtin("mac", passby=PASS_BY_REFERENCE)
>   libxl_cpumap = Builtin("cpumap",
dispose_fn="libxl_cpumap_dispose", passby=PASS_BY_REFERENCE)
> +libxl_nodemap = Builtin("nodemap",
dispose_fn="libxl_nodemap_dispose", passby=PASS_BY_REFERENCE)
>   libxl_cpuid_policy_list = Builtin("cpuid_policy_list",
dispose_fn="libxl_cpuid_dispose", passby=PASS_BY_REFERENCE)
>
>   libxl_string_list = Builtin("string_list",
dispose_fn="libxl_string_list_dispose", passby=PASS_BY_REFERENCE)
> @@ -233,6 +234,7 @@ libxl_domain_build_info = Struct("domain
>       ("max_vcpus",       integer),
>       ("cur_vcpus",       integer),
>       ("cpumap",          libxl_cpumap),
> +    ("nodemap",         libxl_nodemap),
>       ("tsc_mode",        libxl_tsc_mode),
>       ("max_memkb",       MemKB),
>       ("target_memkb",    MemKB),
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -515,7 +515,7 @@ static void parse_config_data(const char
>       const char *buf;
>       long l;
>       XLU_Config *config;
> -    XLU_ConfigList *cpus, *vbds, *nics, *pcis, *cvfbs, *cpuids;
> +    XLU_ConfigList *cpus, *nodes, *vbds, *nics, *pcis, *cvfbs, *cpuids;
>       int pci_power_mgmt = 0;
>       int pci_msitranslate = 1;
>       int pci_permissive = 0;
> @@ -628,6 +628,39 @@ static void parse_config_data(const char
>           free(buf2);
>       }
>
> +    if (!xlu_cfg_get_list (config, "nodes",&nodes, 0, 0)) {
> +        int i, n_nodes = 0;
> +
> +        if (libxl_nodemap_alloc(ctx,&b_info->nodemap)) {
> +            fprintf(stderr, "Unable to allocate nodemap\n");
> +            exit(1);
> +        }
> +
> +        libxl_nodemap_set_none(&b_info->nodemap);
> +        while ((buf = xlu_cfg_get_listitem(nodes, n_nodes)) != NULL) {
> +            i = atoi(buf);
> +            if (!libxl_nodemap_node_valid(&b_info->nodemap, i)) {
> +                fprintf(stderr, "node %d illegal\n", i);
> +                exit(1);
> +            }
> +            libxl_nodemap_set(&b_info->nodemap, i);
> +            n_nodes++;
> +        }
> +
> +        if (n_nodes == 0)
> +            fprintf(stderr, "No valid node found: no affinity will be
set\n");
> +    }
> +    else {
> +        /*
> +         * XXX What would a sane default be? I think doing nothing
> +         *     (i.e., relying on cpu-affinity/cpupool ==>  the current
> +         *     behavior) should be just fine.
> +         *     That would mean we''re saying to the user "if
you want us
> +         *     to take care of NUMA, please tell us, maybe just putting
> +         *     ''nodes=auto'', but tell us... otherwise we
do as usual".
> +         */I think that for this patch, doing nothing is fine (which means removing 
the whole else clause).  But once we have the auto placement, I think 
that "nodes=auto" should be the default.> +    }
> +
>       if (!xlu_cfg_get_long (config, "memory",&l, 0)) {
>           b_info->max_memkb = l * 1024;
>           b_info->target_memkb = b_info->max_memkb;
> diff --git a/tools/python/xen/lowlevel/xl/xl.c
b/tools/python/xen/lowlevel/xl/xl.c
> --- a/tools/python/xen/lowlevel/xl/xl.c
> +++ b/tools/python/xen/lowlevel/xl/xl.c
> @@ -243,6 +243,18 @@ int attrib__libxl_cpumap_set(PyObject *v
>       return 0;
>   }
>
> +int attrib__libxl_nodemap_set(PyObject *v, libxl_nodemap *pptr)
> +{
> +    int i;
> +    long node;
> +
> +    for (i = 0; i<  PyList_Size(v); i++) {
> +        node = PyInt_AsLong(PyList_GetItem(v, i));
> +        libxl_nodemap_set(pptr, node);
> +    }
> +    return 0;
> +}
> +
>   int attrib__libxl_file_reference_set(PyObject *v, libxl_file_reference
*pptr)
>   {
>       return genwrap__string_set(v,&pptr->path);
> diff --git a/xen/common/domain.c b/xen/common/domain.c
> --- a/xen/common/domain.c
> +++ b/xen/common/domain.c
> @@ -222,6 +222,7 @@ struct domain *domain_create(
>
>       spin_lock_init(&d->node_affinity_lock);
>       d->node_affinity = NODE_MASK_ALL;
> +    d->has_node_affinity = 0;
>
>       spin_lock_init(&d->shutdown_lock);
>       d->shutdown_code = -1;
> @@ -337,7 +338,7 @@ void domain_update_node_affinity(struct
>   {
>       cpumask_var_t cpumask;
>       cpumask_var_t online_affinity;
> -    const cpumask_t *online;
> +    const cpumask_t *online = cpupool_online_cpumask(d->cpupool);
>       nodemask_t nodemask = NODE_MASK_NONE;
>       struct vcpu *v;
>       unsigned int node;
> @@ -350,9 +351,22 @@ void domain_update_node_affinity(struct
>           return;
>       }
>
> -    online = cpupool_online_cpumask(d->cpupool);
> +    spin_lock(&d->node_affinity_lock);
>
> -    spin_lock(&d->node_affinity_lock);
> +    /*
> +     * If someone explicitly told us our NUMA affinity, avoid messing
> +     * that up. Notice, however, that vcpu affinity is still what
> +     * drives all scheduling decisions.
> +     *
> +     * XXX I''m quite sure this is fine wrt to the various
v->cpu_affinity
> +     *     (at least, that was the intended design!). Could it be useful
> +     *     to cross-check d->node_affinity against `online`? The basic
> +     *     idea here is "Hey, if you shoot yourself in the foot...
You''ve
> +     *     shot in the foot!", but, you know...
> +     */
> +    if ( d->has_node_affinity )
> +        goto out;
> +
>
>       for_each_vcpu ( d, v )
>       {
> @@ -365,6 +379,8 @@ void domain_update_node_affinity(struct
>               node_set(node, nodemask);
>
>       d->node_affinity = nodemask;
> +
> +out:
>       spin_unlock(&d->node_affinity_lock);
>
>       free_cpumask_var(online_affinity);
> @@ -372,6 +388,31 @@ void domain_update_node_affinity(struct
>   }
>
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity)
> +{
> +    spin_lock(&d->node_affinity_lock);
> +
> +    /*
> +     * Being/becoming affine to _no_ nodes is not going to work,
> +     * let''s take it as the `reset node affinity` command.
> +     */
> +    if ( nodes_empty(*affinity) )
> +    {
> +        d->has_node_affinity = 0;
> +        spin_unlock(&d->node_affinity_lock);
> +        domain_update_node_affinity(d);
> +        return 0;
> +    }
> +
> +    d->has_node_affinity = 1;
> +    d->node_affinity = *affinity;
> +
> +    spin_unlock(&d->node_affinity_lock);
> +
> +    return 0;
> +}
> +
> +
>   struct domain *get_domain_by_id(domid_t dom)
>   {
>       struct domain *d;
> diff --git a/xen/common/domctl.c b/xen/common/domctl.c
> --- a/xen/common/domctl.c
> +++ b/xen/common/domctl.c
> @@ -621,6 +621,27 @@ long do_domctl(XEN_GUEST_HANDLE(xen_domc
>       }
>       break;
>
> +    case XEN_DOMCTL_node_affinity:
> +    {
> +        domid_t dom = op->domain;
> +        struct domain *d = rcu_lock_domain_by_id(dom);
> +        nodemask_t new_affinity;
> +
> +        ret = -ESRCH;
> +        if ( d == NULL )
> +            break;
> +
> +        /* XXX We need an xsm_* for this I guess, right? */
Yes. :-)> +
> +        ret = xenctl_nodemap_to_nodemask(&new_affinity,
> +&op->u.node_affinity.nodemap);
> +        if ( !ret )
> +            ret = domain_set_node_affinity(d,&new_affinity);
> +
> +        rcu_unlock_domain(d);
> +    }
> +    break;
> +
>       case XEN_DOMCTL_setvcpuaffinity:
>       case XEN_DOMCTL_getvcpuaffinity:
>       {
> diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c
> --- a/xen/common/keyhandler.c
> +++ b/xen/common/keyhandler.c
> @@ -217,6 +217,14 @@ static void cpuset_print(char *set, int
>       *set++ = ''\0'';
>   }
>
> +static void nodeset_print(char *set, int size, const nodemask_t *mask)
> +{
> +    *set++ = ''['';
> +    set += nodelist_scnprintf(set, size-2, mask);
> +    *set++ = '']'';
> +    *set++ = ''\0'';
> +}
> +
>   static void periodic_timer_print(char *str, int size, uint64_t period)
>   {
>       if ( period == 0 )
> @@ -272,6 +280,9 @@ static void dump_domains(unsigned char k
>
>           dump_pageframe_info(d);
>
> +        nodeset_print(tmpstr, sizeof(tmpstr),&d->node_affinity);
> +        printk("NODE affinity for domain %d: %s\n",
d->domain_id, tmpstr);
> +
>           printk("VCPU information and callbacks for domain
%u:\n",
>                  d->domain_id);
>           for_each_vcpu ( d, v )
> diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
> --- a/xen/include/public/domctl.h
> +++ b/xen/include/public/domctl.h
> @@ -278,6 +278,15 @@ typedef struct xen_domctl_getvcpuinfo xe
>   DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t);
>
>
> +/* Set the NUMA node(s) with which the guest is `affine`. */
> +/* XEN_DOMCTL_node_affinity */
> +struct xen_domctl_node_affinity {
> +    struct xenctl_map nodemap;   /* IN */
> +};
> +typedef struct xen_domctl_node_affinity xen_domctl_node_affinity_t;
> +DEFINE_XEN_GUEST_HANDLE(xen_domctl_node_affinity_t);
> +
> +
>   /* Get/set which physical cpus a vcpu can execute on. */
>   /* XEN_DOMCTL_setvcpuaffinity */
>   /* XEN_DOMCTL_getvcpuaffinity */
> @@ -914,6 +923,7 @@ struct xen_domctl {
>   #define XEN_DOMCTL_set_access_required           64
>   #define XEN_DOMCTL_audit_p2m                     65
>   #define XEN_DOMCTL_set_virq_handler              66
> +#define XEN_DOMCTL_node_affinity                 67
>   #define XEN_DOMCTL_gdbsx_guestmemio            1000
>   #define XEN_DOMCTL_gdbsx_pausevcpu             1001
>   #define XEN_DOMCTL_gdbsx_unpausevcpu           1002
> @@ -927,6 +937,7 @@ struct xen_domctl {
>           struct xen_domctl_getpageframeinfo  getpageframeinfo;
>           struct xen_domctl_getpageframeinfo2 getpageframeinfo2;
>           struct xen_domctl_getpageframeinfo3 getpageframeinfo3;
> +        struct xen_domctl_node_affinity     node_affinity;
>           struct xen_domctl_vcpuaffinity      vcpuaffinity;
>           struct xen_domctl_shadow_op         shadow_op;
>           struct xen_domctl_max_mem           max_mem;
> diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h
> --- a/xen/include/xen/nodemask.h
> +++ b/xen/include/xen/nodemask.h
> @@ -8,8 +8,9 @@
>    * See detailed comments in the file linux/bitmap.h describing the
>    * data type on which these nodemasks are based.
>    *
> - * For details of nodemask_scnprintf() and nodemask_parse(),
> - * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c.
> + * For details of nodemask_scnprintf(), nodelist_scnpintf() and
> + * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse()
> + * in lib/bitmap.c.
>    *
>    * The available nodemask operations are:
>    *
> @@ -48,6 +49,7 @@
>    * unsigned long *nodes_addr(mask)     Array of unsigned long''s
in mask
>    *
>    * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing
> + * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for
printing
>    * int nodemask_parse(ubuf, ulen, mask)        Parse ascii string as
nodemask
>    *
>    * for_each_node_mask(node, mask)      for-loop node over mask
> @@ -280,6 +282,14 @@ static inline int __first_unset_node(con
>
>   #define nodes_addr(src) ((src).bits)
>
> +#define nodelist_scnprintf(buf, len, src) \
> +                       __nodelist_scnprintf((buf), (len), (src),
MAX_NUMNODES)
> +static inline int __nodelist_scnprintf(char *buf, int len,
> +                                       const nodemask_t *srcp, int nbits)
> +{
> +       return bitmap_scnlistprintf(buf, len, srcp->bits, nbits);
> +}
> +
>   #if 0
>   #define nodemask_scnprintf(buf, len, src) \
>                          __nodemask_scnprintf((buf), (len),&(src),
MAX_NUMNODES)
> diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
> --- a/xen/include/xen/sched.h
> +++ b/xen/include/xen/sched.h
> @@ -346,8 +346,12 @@ struct domain
>       /* Various mem_events */
>       struct mem_event_per_domain *mem_event;
>
> -    /* Currently computed from union of all vcpu cpu-affinity masks. */
> +    /*
> +     * Can be specified by the user. If that is not the case, it is
> +     * computed from the union of all the vcpu cpu-affinity masks.
> +     */
>       nodemask_t node_affinity;
> +    int has_node_affinity;There''s something that seems a bit clunky about this; but I
can''t really
think of a better way.  At every least I''d change the name of the 
element here to something more descriptive; perhaps
"auto_node_affinity"
(which would invert the meaning)?>       unsigned int last_alloc_node;
>       spinlock_t node_affinity_lock;
>   };
> @@ -416,6 +420,7 @@ static inline void get_knownalive_domain
>       ASSERT(!(atomic_read(&d->refcnt)&  DOMAIN_DESTROYED));
>   }
>
> +int domain_set_node_affinity(struct domain *d, const nodemask_t
*affinity);
>   void domain_update_node_affinity(struct domain *d);
>
>   struct domain *domain_create(

George Dunlap

2012-Apr-12 10:29 UTC

head link

Re: [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line

On 11/04/12 14:17, Dario Faggioli wrote:> For feature parity with vcpu affinity, allow for specifying
> node affinity not only at domain creation time, but at run-time
> too.
>
> Of course this is not going to be equally effective, as it will
> only affect future memory allocations without touching what''s
> already there. However, in future we might want to change this,
> and use this as an interface for sort-of manual "domain node
> migration".I think this is a good idea.

I think if you take my suggestion to rework the previous patch into 
hypervisor, libxc, and libxl components, you should also move the xl 
stuff out of that patch and into this one (so that the xl and libxl 
changes relating to node affinity are each in their own
patches).>
> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com>
>
> diff --git a/docs/man/xl.pod.1 b/docs/man/xl.pod.1
> --- a/docs/man/xl.pod.1
> +++ b/docs/man/xl.pod.1
> @@ -556,6 +556,26 @@ different run state is appropriate.  Pin
>   this, by ensuring certain VCPUs can only run on certain physical
>   CPUs.
>
> +=item B<node-affinity>  I<domain-id>  I<nodes>
> +
> +Set the NUMA node affinity for the domain, i.e., the set of NUMA
> +nodes of the host from which the memory of the domain will be
> +allocated. More specificallly, the domain''s memory will be split
> +in equal (well, as equal as possible) parts among all the nodes
> +it is affine with, The keyword B<all>  can be used to have the
> +domain affine to all NUMA nodes in the host.
> +
> +Normally NUMA node affinity of a domain is automatically computed
> +from its VCPU affinity. The default behaviour is to have it equal
> +to all the nodes the PCPUs onto which the VCPUs of the domain are
> +pinned belong to. Manually specifying it can be used to restrict
> +this to a specific subset of the host NUMA nodes, for improved
> +locality of memory accesses by the domain. Notice, however, that
> +this will not affect the memory that has already been allocated.
> +For having the full amount of memory allocated on specific node(s)
> +at domain creation time, the domain''s configuration file is what
> +should be used.
> +
>   =item B<vncviewer>  [I<OPTIONS>] I<domain-id>
>
>   Attach to domain''s VNC server, forking a vncviewer process.
> diff --git a/tools/libxl/libxl_utils.c b/tools/libxl/libxl_utils.c
> --- a/tools/libxl/libxl_utils.c
> +++ b/tools/libxl/libxl_utils.c
> @@ -442,7 +442,7 @@ void libxl_map_dispose(struct libxl_map
>       free(map->map);
>   }
>
> -static int libxl_map_alloc(libxl_ctx *ctx, struct libxl_map *map, int
n_elems)
> +int libxl_map_alloc(libxl_ctx *ctx, struct libxl_map *map, int n_elems)
>   {
>       int sz;
>
> diff --git a/tools/libxl/libxl_utils.h b/tools/libxl/libxl_utils.h
> --- a/tools/libxl/libxl_utils.h
> +++ b/tools/libxl/libxl_utils.h
> @@ -64,6 +64,7 @@ int libxl_devid_to_device_nic(libxl_ctx
>   int libxl_vdev_to_device_disk(libxl_ctx *ctx, uint32_t domid, const char
*vdev,
>                                  libxl_device_disk *disk);
>
> +int libxl_map_alloc(libxl_ctx *ctx, struct libxl_map *map, int n_elems);
>   int libxl_map_test(struct libxl_map *map, int elem);
>   void libxl_map_set(struct libxl_map *map, int elem);
>   void libxl_map_reset(struct libxl_map *map, int elem);
> @@ -79,6 +80,10 @@ static inline int libxl_map_elem_valid(s
>   {
>       return elem>= 0&&  elem<  (map->size * 8);
>   }
> +#define libxl_for_each_elem(v, m) for (v = 0; v<  (m).size * 8; v++)
> +#define libxl_for_each_set_elem(v, m) for (v = 0; v<  (m).size * 8;
v++) \
> +                                              if (libxl_map_test(&(m),
v))
> +
>
>   int libxl_cpumap_alloc(libxl_ctx *ctx, libxl_cpumap *cpumap);
>   static inline int libxl_cpumap_test(libxl_cpumap *cpumap, int cpu)
> diff --git a/tools/libxl/xl.h b/tools/libxl/xl.h
> --- a/tools/libxl/xl.h
> +++ b/tools/libxl/xl.h
> @@ -54,6 +54,7 @@ int main_config_update(int argc, char **
>   int main_button_press(int argc, char **argv);
>   int main_vcpupin(int argc, char **argv);
>   int main_vcpuset(int argc, char **argv);
> +int main_nodeaffinity(int argc, char **argv);
>   int main_memmax(int argc, char **argv);
>   int main_memset(int argc, char **argv);
>   int main_sched_credit(int argc, char **argv);
> diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
> --- a/tools/libxl/xl_cmdimpl.c
> +++ b/tools/libxl/xl_cmdimpl.c
> @@ -448,65 +448,75 @@ static void split_string_into_string_lis
>       free(s);
>   }
>
> -static int vcpupin_parse(char *cpu, libxl_cpumap *cpumap)
> +static int affinity_parse(char *str, struct libxl_map *map, int n_elems)
>   {
> -    libxl_cpumap exclude_cpumap;
> -    uint32_t cpuida, cpuidb;
> +    struct libxl_map exclude_map;
> +    uint32_t stra, strb;
>       char *endptr, *toka, *tokb, *saveptr = NULL;
> -    int i, rc = 0, rmcpu;
> -
> -    if (!strcmp(cpu, "all")) {
> -        libxl_cpumap_set_any(cpumap);
> +    int i, rc = 0, rmelem;
> +
> +    if (!strcmp(str, "all")) {
> +        libxl_map_set_any(map);
>           return 0;
>       }
>
> -    if (libxl_cpumap_alloc(ctx,&exclude_cpumap)) {
> -        fprintf(stderr, "Error: Failed to allocate cpumap.\n");
> +    if (libxl_map_alloc(ctx,&exclude_map, n_elems)) {
> +        fprintf(stderr, "Error: Failed to allocate
libxl_map.\n");
>           return ENOMEM;
>       }
>
> -    for (toka = strtok_r(cpu, ",",&saveptr); toka;
> +    for (toka = strtok_r(str, ",",&saveptr); toka;
>            toka = strtok_r(NULL, ",",&saveptr)) {
> -        rmcpu = 0;
> +        rmelem = 0;
>           if (*toka == ''^'') {
>               /* This (These) Cpu(s) will be removed from the map */
>               toka++;
> -            rmcpu = 1;
> +            rmelem = 1;
>           }
>           /* Extract a valid (range of) cpu(s) */
> -        cpuida = cpuidb = strtoul(toka,&endptr, 10);
> +        stra = strb = strtoul(toka,&endptr, 10);
>           if (endptr == toka) {
>               fprintf(stderr, "Error: Invalid argument.\n");
>               rc = EINVAL;
> -            goto vcpp_out;
> +            goto afp_out;
>           }
>           if (*endptr == ''-'') {
>               tokb = endptr + 1;
> -            cpuidb = strtoul(tokb,&endptr, 10);
> -            if (endptr == tokb || cpuida>  cpuidb) {
> +            strb = strtoul(tokb,&endptr, 10);
> +            if (endptr == tokb || stra>  strb) {
>                   fprintf(stderr, "Error: Invalid argument.\n");
>                   rc = EINVAL;
> -                goto vcpp_out;
> +                goto afp_out;
>               }
>           }
> -        while (cpuida<= cpuidb) {
> -            rmcpu == 0 ? libxl_cpumap_set(cpumap, cpuida) :
> -                         libxl_cpumap_set(&exclude_cpumap, cpuida);
> -            cpuida++;
> +        while (stra<= strb) {
> +            rmelem == 0 ? libxl_map_set(map, stra) :
> +                          libxl_map_set(&exclude_map, stra);
> +            stra++;
>           }
>       }
>
>       /* Clear all the cpus from the removal list */
> -    libxl_for_each_set_cpu(i, exclude_cpumap) {
> -        libxl_cpumap_reset(cpumap, i);
> -    }
> -
> -vcpp_out:
> -    libxl_cpumap_dispose(&exclude_cpumap);
> +    libxl_for_each_set_elem(i, exclude_map) {
> +        libxl_map_reset(map, i);
> +    }
> +
> +afp_out:
> +    libxl_map_dispose(&exclude_map);
>
>       return rc;
>   }
>
> +static inline int vcpupin_parse(char *cpu, libxl_cpumap *cpumap)
> +{
> +    return affinity_parse(cpu, cpumap, libxl_get_max_cpus(ctx));
> +}
> +
> +static inline int nodeaffinity_parse(char *nodes, libxl_nodemap *nodemap)
> +{
> +    return affinity_parse(nodes, nodemap, libxl_get_max_nodes(ctx));
> +}
> +
>   static void parse_config_data(const char *configfile_filename_report,
>                                 const char *configfile_data,
>                                 int configfile_len,
> @@ -3873,6 +3883,40 @@ int main_vcpuset(int argc, char **argv)
>       return 0;
>   }
>
> +static void nodeaffinity(const char *d, char *nodes)
> +{
> +    libxl_nodemap nodemap;
> +
> +    find_domain(d);
> +
> +    if (libxl_nodemap_alloc(ctx,&nodemap))
> +        goto nodeaffinity_out;
> +
> +    if (!strcmp(nodes, "all"))
> +        libxl_nodemap_set_any(&nodemap);
> +    else if (nodeaffinity_parse(nodes,&nodemap))
> +        goto nodeaffinity_out1;
> +
> +    if (libxl_set_node_affinity(ctx, domid,&nodemap) == -1)
> +        fprintf(stderr, "Could not set node affinity for dom
`%d''.\n", domid);
> +
> +  nodeaffinity_out1:
> +    libxl_nodemap_dispose(&nodemap);
> +  nodeaffinity_out:
> +    ;
> +}
> +
> +int main_nodeaffinity(int argc, char **argv)
> +{
> +    int opt;
> +
> +    if ((opt = def_getopt(argc, argv, "",
"node-affinity", 2)) != -1)
> +        return opt;
> +
> +    nodeaffinity(argv[optind], argv[optind+1]);
> +    return 0;
> +}
> +
>   static void output_xeninfo(void)
>   {
>       const libxl_version_info *info;
> diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c
> --- a/tools/libxl/xl_cmdtable.c
> +++ b/tools/libxl/xl_cmdtable.c
> @@ -195,6 +195,11 @@ struct cmd_spec cmd_table[] = {
>         "Set the number of active VCPUs allowed for the domain",
>         "<Domain>  <vCPUs>",
>       },
> +    { "node-affinity",
> +&main_nodeaffinity, 0,
> +      "Set the NUMA node affinity for the domain",
> +      "<Domain>  <Nodes|all>",
> +    },
>       { "list-vm",
>         &main_list_vm, 0,
>         "List the VMs,without DOM0",

Dario Faggioli

2012-Apr-12 10:32 UTC

head link

Re: [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

On Thu, 2012-04-12 at 10:11 +0100, Ian Campbell wrote:> On Wed, 2012-04-11 at 14:17 +0100, Dario Faggioli wrote:
> > Add some rationale and usage documentation for the new automatic
> > NUMA placement feature of xl.
> > 
> > TODO: * Decide whether we want to have things like "Future
Steps/Roadmap"
> >         and/or "Performances/Benchmarks Results" here as
well.
> 
> I think these would be better in the list archives and on the wiki
> respectively.
> Ok, fine. I already posted the link in this thread and will continue to
do so, as I''ll put together a blog post and a wiki page about
benchmarks.

As for future steps/roadmap, let''s first see what comes out from this
series... :-)
> > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
> > 
> > diff --git a/docs/misc/xl-numa-placement.txt
b/docs/misc/xl-numa-placement.txt
> > new file mode 100644
> > --- /dev/null
> > +++ b/docs/misc/xl-numa-placement.txt
> 
> It looks like you are using something approximating markdown syntax
> here, so you might as well name this xl-numa-placement.markdown and get
> a .html version etc almost for free.
> Actually, that was another question I had and forgot to ask, i.e., what
format should this file come with. I sort of took inspiration from
xl-disk-configuration.txt and went for a plain text file, but I of
course can go for a full-fledged markdown syntax. Thanks.
> > +Of course, if a domain is known to only run on a subset of the
physical
> > +CPUs of the host, it is very easy to turn all its memory accesses
into
> > +local ones, by just constructing it''s node affinity (in Xen)
basing on
> 
>                                                                 ^based
> Ok, to this ans to all typos/english howlers as well. Thanks a lot for
looking into this! :-)
> > + * `nodes = [ ''0'', ''1'' ]` and
`cpus = "0"`, with CPU 0 within node 0:
> > +   (i.e., cpu affinity subset of node affinity):
> > +     domain''s vcpus can and will only run on host CPU 0. As
node affinity
> > +     is being explicitly set to host NUMA nodes 0 and 1 --- which
includes
> > +     CPU 0 --- all the memory access of the domain will be local;
> 
> In this case won''t some of (half?) the memory come from node 1 and
> therefore be non-local to cpu 0?
> Oops, yep, you''re right, that''s not what I meant to write!
> > +
> > + * `nodes = [ ''0'', ''1'' ]` and
`cpus = "0, 4", with CPU 0 in node 0 but
> > +   CPU 4 in, say, node 2 (i.e., cpu affinity superset of node
affinity):
> > +     domain''s vcpus can run on host CPUs 0 and 4, with CPU 4
not being within
> > +     the node affinity (explicitly set to host NUMA nodes 0 and 1).
The
> > +     (credit) scheduler will try to keep memory accesses local by
scheduling
> > +     the domain''s vcpus on CPU 0, but it may not achieve
100% success;
> > +
> > + * `nodes = [ ''0'', ''1'' ]` and
`cpus = "4"`, with CPU 4 within, say, node 2
> 
> These examples might be a little clearer if you defined up front what
> the nodes and cpus were and then used that for all of them?
> Good, idea, I will do that.
> A bunch of what follows would be good to have in the xl or xl.cfg man
> pages too/instead. (I started with this docs patch so I haven''t
actually
> looked at the earlier ones yet, perhaps this is already the case)
> Single patches that introduces the various features tries to document
them as well, but not with this level of details. I''m fine with putting
there whatever you think it could fit, just le me know, perhaps on the
comments on those patches, or whatever you like. 
> > +
> > + * "auto": automatic placement by means of a not better
specified (xl
> > +           implementation dependant) algorithm. It is basically for
those
> > +           who do want automatic placement, but have no idea what
policy
> > +           or algorithm would be better... <<Just give me a
sane default!>>
> > +
> > + * "ffit": automatic placement via the First Fit algorithm,
applied checking
> > +           the memory requirement of the domain against the amount of
free
> > +           memory in the various host NUMA nodes;
> > +
> > + * "bfit": automatic placement via the Best Fit algorithm,
applied checking
> > +           the memory requirement of the domain against the amount of
free
> > +           memory in the various host NUMA nodes;
> > +
> > + * "wfit": automatic placement via the Worst Fit algorithm,
applied checking
> > +           the memory requirement of the domain against the amount of
free
> > +           memory in the various host NUMA nodes;
> >
> > <snip>
> >
> > + * `nodes_policy="auto"` (or `"ffit"`,
`"bfit"`, `"wfit"`) and `nodes=2`:
> > +     xl will try fitting the domain on the host NUMA nodes by using
the
> > +     requested policy and only the number of nodes specified in
`nodes=`
> > +     (2 in this example).
> 
> Number of nodes rather than specifically node 2? This is different to
> the examples in the preceding section?
> It is. I''ll try to clarify things as per your suggestion. However,
talking about syntax, here''s what the series allows "nodes"
and
"nodes_policy" to be:

 * "nodes=": - a list (`[ ''0'', ''3''
]`), and in this case the elements
               of the list are specific nodes you want to use;
             - an integer (`2`), and in this case that is the _number_ 
               of nodes you want to use, with the algorithm free to
               arbitrary decide which ones to pick;
             - the string `"auto"`, and in this case you tell the 
               algorithm: <<please, do whatever you like and make me  
               happy>> :-)

 * "nodes_policy=" - the string `"auto"`, the same as above
                   - the strings `"ffit"`, `"bfit"` and
`"wfit"`, with
                     the meaning reported by the doc in he patch.

There is some overlapping but I wanted to make it possible for one to
write just things like:

nodes = [ ''0'', ''3'' ]

or: 

nodes = "auto"

or:

nodes_policy = "wfit"
nodes = 2

without introducing too much different options. On the down side, this
could obviously lead to awkward or nonsensical combinations... I tried
to intercept the worst of them during config file parsing, and can
surely push this farther.

So the important question here is, besides from the fact I''ll try to
clarify things better, do you think the interface is both comprehensive
and clear enough? Or should we think to something different?

Thanks a lot again and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

David Vrabel

2012-Apr-12 10:48 UTC

head link

Re: [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file

On 12/04/12 11:24, George Dunlap wrote:> On 11/04/12 14:17, Dario Faggioli wrote:
>> 
>> --- a/docs/man/xl.cfg.pod.5
>> +++ b/docs/man/xl.cfg.pod.5
>> @@ -112,6 +112,15 @@ List of which cpus the guest is allowed
>>   (all vcpus will run on cpus 0,2,3,5), or `cpus=["2",
"3"]` (all vcpus
>>   will run on cpus 2 and 3).
>>
>> +=item B<nodes=[ NODE, NODE, ...]>
>> +
>> +List of the NUMA nodes of the host the guest should be considered
>> +`affine` with. Being affine to a (set of) NUMA node(s) mainly means
>> +the guest''s memory is going to be allocated on those node(s).
>> +
>> +A list of nodes should be specified as follows: `nodes=["0",
"3"]`
>> +for the guest to be affine with nodes 0 and 3.
>> +
> Hmm, I think that using "affine" here is technically correct, and
is
> what one would use if writing a research paper; but it''s unusual
to hear
> the word in more common English; it would be more common to hear someone
> describe a VM as "having affinity with".  How about something
like this:
> 
> "The list of NUMA nodes the domain is considered to have affinity
with.
> Memory from the guest will be allocated from these nodes."
That''s clunky sentence, the "considered" doesn''t add
anything.  Or perhaps:

"The NUMA nodes preferred by the guest.  Memory for the guest will be
allocated on these nodes."

The need for quotes around the node numbers is odd.  Can that
requirement be removed?

David

Ian Jackson

2012-Apr-12 11:32 UTC

head link

Formatting of emails which are comments on patches.

Someone wrote:
) On 11/04/12 14:17, Dario Faggioli wrote:
) ) [9 lines of quoted prose]
) [3 lines of replying prose]
) )
) ) [1 line of quoted prose]
) )
) ) Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com)
) )
) ) [16 lines of diff, quoted without any elision]
) [4 lines of prose]
) 
) [2 lines of prose]
) 
) [3 lines of prose]
) 
) [2 lines of prose]
) ) [71 lines of diff, quoted without any elision]
) [1 line of prose]
) ) [64 lines of diff, quoted without any elision]

etc.  (I have replaced ">" with ")" to try to subvert
email programs
which do something special with quoted text so that you can see what
it looks like to me.)

This is unnecessarily hard to read.  What it looks like on my screen
can be seen here:
  http://www.chiark.greenend.org.uk/~ijackson/2012/email-formatting.png

Can I ask people to:

 * Trim diffs when replying to patches.  There is no need to quote the
   whole patch.  Just quote the parts which are needed to understand
   your comments.

 * Leave a blank line between quoted text of any kind and your own 
   added text.

That will help your comments stand out, and be read.  This is
particularly important if the quoted patches come to more than 80
columns, because then there are lots of breaks in the ">" down the
left hand side due to wrap damage and they are hard to tell from
interspersed comments.

Thanks and sorry to be picky.

Ian.

George Dunlap

2012-Apr-12 11:42 UTC

head link

Re: Formatting of emails which are comments on patches.

On 12/04/12 12:32, Ian Jackson wrote:> Can I ask people to:
>
>   * Trim diffs when replying to patches.  There is no need to quote the
>     whole patch.  Just quote the parts which are needed to understand
>     your comments.
>
>   * Leave a blank line between quoted text of any kind and your own
>     added text.
>
> That will help your comments stand out, and be read.  This is
> particularly important if the quoted patches come to more than 80
> columns, because then there are lots of breaks in the ">" down
the
> left hand side due to wrap damage and they are hard to tell from
> interspersed comments.
>
> Thanks and sorry to be picky.No problem -- sorry for not doing my own trimming; gmail does a good job 
trimming stuff automatically, so I forget what it looks like.

  -George

George Dunlap

2012-Apr-27 14:45 UTC

head link

Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`

On 11/04/12 14:17, Dario Faggioli wrote:> Basic idea is: cpu-affinity tells where a vcpu can run,
> while node-affinity tells where (in terms of on what NUMA nodes,
> but that of course imply a set of CPUs) all the vcpus of the
> domain should/prefer to run... Problems starts when you have
> both of them speaking at the same time!
>
> Respecting vcpu-affinity should remain the primary concern, thus
> what we do here is add some two-step logic here and there in the
> scheduler (i.e., where vcpu-affinity related decisions are being
> undertaken). During the first step, we check the `&&` of vcpu and
> node affinity masks, aiming at giving some precedence to the CPUs
> that are both suitable and preferrable for the domain. However,
> if that fails in finding a valid CPU, the node-affinity is just
> ignored and, in the second step, we just fallback to vcpu-affinity,
> as the original behaviour was.
>
> Both the introduced overhead and the benefits this provides has
> to be investigated and compared against each other, possibly in
> varying conditions and running different workloads.
>
> Finally, although still at the RFC level, efforts have been made
> to write the code in a flexible enough fashion so that, if we
> ever want to introduce further balancing steps, it shouldn''t be
> too much of a pain.
>
> TODO: * Better investigate and test interactions with cpupools.
>        * Test, verify and benchmark. Then test, verify and benchmark
>          again, and again and again. What I know as of know is that
>          it does not explode that easily, but whether it works properly
>          and, more important, brings any benefit, this has to be
>          proven (and yes, I''m out running these tests and benchs
already,
>          but do not esitate to manifest your will for helping with
>          that :-D).
>        * Add some counter/stats, e.g., to serve the purpose outlined
>          right above.
>
> XXX I''m benchmarking this just right now, and peeking at the
results
>      they don''t seem too bad. Numbers are mostly close to the case
where
>      the VM is already pinned to a node when created. I''ll post
the
>      results as soon as I can, and I''ll be happy to investigate
any issue,
>      if you feel like the approach could be the right one.Hey Dario,

Sorry for the long delay in reviewing this.

Overall I think the approach is good.  A few points:>   /*
> + * Node Balancing
> + */
> +#define CSCHED_BALANCE_NODE_AFFINITY    1
> +#define CSCHED_BALANCE_CPU_AFFINITY     0
> +#define CSCHED_BALANCE_START_STEP       CSCHED_BALANCE_NODE_AFFINITY
> +#define CSCHED_BALANCE_END_STEP         CSCHED_BALANCE_CPU_AFFINITY
> +
> +This thing of defining "START_STEP" and "END_STEP" seems a
bit fragile.
I think it would be better to always start at 0, and go until 
CSCHED_BALANCE_MAX.> +    /*
> +     * Let''s cache domain''s dom->node_affinity here
as an
> +     * optimization for a couple of hot paths. In fact,
> +     * knowing  whether or not dom->node_affinity has changed
> +     * would allow us to avoid rebuilding node_affinity_cpumask
> +     * (below) duing node balancing and/or scheduling.
> +     */
> +    nodemask_t node_affinity_cache;
> +    /* Basing on what dom->node_affinity says,
> +     * on what CPUs would we like to run most? */
> +    cpumask_t node_affinity_cpumask;I think the comments here need to be more clear.  The main points are:
* node_affinity_cpumask is the dom->node_affinity translated from a 
nodemask into a cpumask
* Because doing the nodemask -> cpumask translation may be expensive, 
node_affinity_cache stores the last translated value, so we can avoid 
doing the translation if nothing has changed.
> +#define csched_balance(__step) \
> +    for ( (__step) = CSCHED_BALANCE_START_STEP; \
> +          (__step)>= CSCHED_BALANCE_END_STEP; \
> +          (__step)-- )I don''t like this.  For one, it''s fragile; if for some reason
you
switched NODE_AFFINITY and CPU_AFFINITY above, suddenly this loop would 
go the wrong direction.

I don''t think there''s any reason not to just use
"for(step=0; step <
CSCHED_BALANCE_MAX; step++)" -- it''s not ugly and it means you
know
exactly what''s going on without having to go search for the macro.
> +
> +/*
> + * Sort-of conversion between node-affinity and vcpu-affinity for the
domain,
> + * i.e., a cpumask containing all the cpus from all the set nodes in the
> + * node-affinity mask of the domain.This needs to be clearer -- vcpu-affinity doesn''t have anything to do 
with this function, and there''s nothing "sort-of" about the
conversion. :-)

I think you mean to say, "Create a cpumask from the node affinity
mask."> + *
> + * Notice that we completely forget about serializing this (both here and
> + * in the various sites where node_affinity_cpumask is used) against
> + * d->node_affinity_lock. That would be hard to do properly, as that
lock
> + * is a non IRQ-safe one, and it would be nearly impossible to access it
> + * from the scheduling code. However, although this leaves some room for
> + * races with code paths that updates d->node_affinity, it all should
still
> + * be fine, considering:
> + *  (1) d->node_affinity updates are going to be quite rare;
> + *  (2) this balancing logic is kind-of "best effort" anyway;
> + *  (3) given (1), any inconsistencies are likely to be fixed by the next
> + *      call to this same function without risking going into instability.
> + *
> + * XXX Validate (via testing/benchmarking) whether this is true, as it
> + *     likely sounds to be, or it causes unexpected issues.
Ack.> +/* Put together the cpumask we are going to use for this balancing step */
> +static int
> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask)
> +{
> +    struct domain *d = vc->domain;
> +    struct csched_dom *sdom = CSCHED_DOM(d);
> +
> +    /*
> +     * Only two possible steps exists for now: node and vcpu affinity.
> +     * If this domain does not have a node-affinity or is affine to all
the
> +     * nodes, just return th vcpu-affinity mask (for *this* vcpu) and
> +     * stop the balancing process.
> +     */
> +    if ( !d->has_node_affinity || nodes_full(d->node_affinity) ||
> +         step == CSCHED_BALANCE_CPU_AFFINITY )
> +    {
> +        cpumask_copy(mask, vc->cpu_affinity);
> +        return CSCHED_BALANCE_CPU_AFFINITY;
> +    }Hmm -- this mechanism seems kind of fragile.  It seems like returning -1 
or something, and having the caller call "continue", would be easier
for
future coders (potentially you or I) to reason about.> +    /*
> +     * XXX As of now (with only two possible steps, this is easy and
readable
> +     *     enough. If more steps become necessary at some point in time,
we
> +     *     can keep an array of cpumask_t in dom_data and return the
proper
> +     *     element via step-indexing such an array. Of course, we can turn
> +     *     this like that even right now... Thoughts?
> +     */Beware of early optimization. :-)  Just make sure to mark this down for 
something to look at for profiling.>   static inline void
> +__cpumask_tickle(cpumask_t *mask, const cpumask_t *idle_mask)
> +{
> +    CSCHED_STAT_CRANK(tickle_idlers_some);
> +    if ( opt_tickle_one_idle )
> +    {
> +        this_cpu(last_tickle_cpu) > +           
cpumask_cycle(this_cpu(last_tickle_cpu), idle_mask);
> +        cpumask_set_cpu(this_cpu(last_tickle_cpu), mask);
> +    }
> +    else
> +        cpumask_or(mask, mask, idle_mask);
> +}I don''t see any reason to make this into a function -- it''s
only called
once, and it''s not that long.  Unless you''re concerned about
too many
indentations making the lines too short?> +            csched_balance(balance_step)
>               {
Again, I''d make this a for() loop.>      sdom->dom = dom;
> +    /*
> +     *XXX This would be ''The Right Thing'', but as it is
still too
> +     *    early and d->node_affinity has not settled yet, maybe we
> +     *    can just init the two masks with something like all-nodes
> +     *    and all-cpus and rely on the first balancing call for
> +     *    having them updated?
> +     */
> +    csched_build_balance_cpumask(sdom);We might as well do what you''ve got here, unless it''s likely
to produce
garbage.  This isn''t exactly a hot path. :-)

Other than that, looks good -- Thanks!

  -George

George Dunlap

2012-May-01 15:45 UTC

head link

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

Hey Dario,

Thanks for the work on this -- this is obviously a very complicated area 
to try to make sense of.  Of course, that makes it even more complicated 
to review -- sorry this has taken so long.  (comments inline)

On 11/04/12 14:17, Dario Faggioli wrote:> +
> +=item B<auto>
> +
> +use a not better specified (xl-implementation dependant) algorithm
> +to try automatically fitting the guest on the host''s NUMA nodesUum, you mean, "Use the default placement algorithm"? :-)  We should 
specify which one this will be here.> +
> +/* Automatic placement policies symbols, with the following meaning:
> + *  NONE   : no automatic placement at all;
> + *  STATIC : explicit nodes specification.
> + *  FFIT   : use the First Fit algorithm for automatic placement;
> + *  AUTO   : an not better specified automatic placement, likely
> + *           to be implemented as a short circuit to (one of)
> + *           the above(s);
> + */
> +#define NODES_POLICY_NONE    0
> +#define NODES_POLICY_STATIC  1
> +#define NODES_POLICY_FFIT    2
> +#define NODES_POLICY_AUTO    3This is minor, but it seems like "auto" should be 1, so if we add 
another policy, the policies are all together, without having to move 
AUTO around every time.> +
> +/* Store the policy for the domain while parsing */
> +static int nodes_policy = NODES_POLICY_DEFAULT;
> +
> +/* Store the number of nodes to be used while parsing */
> +static int num_nodes_policy = 0;Why are "nodes_policy" and "num_nodes_policy" not passed in
along with
b_info?> +static int nodes_policy_parse(const char *pol, long int *policy)
> +{
> +    int i;
> +    const char *n;
> +
> +    for (i = 0; i<  sizeof(nodes_policy_names) /
sizeof(nodes_policy_names[0]); i++) {I personally prefer an explicit NODES_POLICY_MAX, but I''ll let the
libxl
maintainers decide on that.> +
> +        /* Determine how much free memory we want on each of the nodes
> +         * that will end up in suitable_nodes. Either adding or ignoring
> +         * the modulus of the integer division should be fine (the total
> +         * number of nodes should be in the order of tens of them), so
> +         * let''s add it as it should be more safe.
> +         */
> +        memb_node = (memb / (*nodes)) + (memb % (*nodes));Wouldn''t it just be easier to do a "round-up" function here,
like this:
  memb_node = ( (memb + *nodes -1) / (*nodes) )

Also, is it really necessary for a VM to have an equal amount of memory 
on every node? It seems like it would be better to have 75% on one node 
and 25% on a second node than to have 25% on four nodes, for example.  
Furthermore, insisting on an even amount fragments the node memory 
further -- i.e., if we chose to have 25% on four nodes instead of 75% on 
one and 25% on another, that will make it harder for another VM to fit 
on a single node as well.

The need for NODES_POLICY_RETRY_DEC is an artifact of the above; if 
nodes were allowed to be assymetric, you wouldn''t need to *decrease*
the
number of nodes to find *more* memory.> +        /* Apply the asked retry policy for the next round. Notice
> +         * that it would be pointless to increase nodes without also
> +         * adding some nodes to the map! */
> +        *nodes += retry;
> +        if (retry == NODES_POLICY_RETRY_INC)
> +            __add_nodes_to_nodemap(nodemap, numa, nr_nodes, retry);Hmm -- if I''m reading this right, the only time the nodemap
won''t be all
nodes is if (1) the user specified nodes, or (2) there''s a cpumask in 
effect.  If we''re going to override that setting, wouldn''t it
make sense
to just expand to all numa nodes?

Hmm -- though I suppose what you''d really want to try is adding each 
node in turn, rather than one at a time (i.e., if the cpus are pinned to 
nodes 2 and 3, and [2,3] doesn''t work, try [1,2,3] and [2,3,4] before 
trying [1,2,3,4].  But that''s starting to get really complicated -- I 
wonder if it''s better to just fail and let the user change the pinnings
/ configuration node mapping.> +
> +    /* If a nodemap has been specified, just comply with that.
> +     * OTOH, if there''s no nodemap, rely on cpumap (if any), and
> +     * fallback to all node if even that is empty. Also consider
> +     * all the nodes to be valid if only the number (i.e., _how_
> +     * _many_ of them instead of _which_ of them) of nodes the
> +     * domain wants is provided.
I feel like this comment could be made more clear...> +     *
> +     * Concering retries, if a specific set of nodes is specified,
> +     * just try that and give up if we fail. If, instead, a set of
> +     * CPUs onto which to pin the domain is specified, try first
> +     * the exact nodes those CPUs belongs to, then also try both
> +     * a smaller and a bigger number of nodes. The same happens if
> +     * we just have the number of nodes to be used. Finally, if
> +     * there is no node-affinity, no cpu-pinning, no number of nodes,
> +     * start trying with one node and increase at the configured
> +     * rate (considering all the nodes to be suitable).This should be above the "do {} while()" loop below.  (Also, see
comment
on increasing node map above.)> +
> +        if (use_cpus>= b_info->max_vcpus) {
> +            rc = 0;
> +            break;
> +        }Hmm -- there''s got to be a better way to find out the minimum number of
nodes to house a given number of vcpus than just starting at 1 and 
re-trying until we have enough.> +        /* Add one more node and retry fitting the domain */
> +        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes, 1);
Same comment as above.>
> +    ret = place_domain(&d_config.b_info);
> +    if (ret == ESRCH || ret == ENOSPC) {
> +        fprintf(stderr, "failed to place the domain, fallback to all
nodes\n");
> +        libxl_nodemap_set_any(&d_config.b_info.nodemap);Hmm -- is this the right fallback?  I think in general, if the user asks 
for X to be done, and X cannot be done for whatever reason, the right 
thing is to tell the user that X cannot be done and let them sort it 
out, rather than resorting to Y (unless that''s been set).

Is it the case that if any node placement fails, then they''ll all fail?
Or to put it the other way, is it the case that if there is *some* 
placement, that any placement algorithm will eventually find it?  If so, 
then I think this fallback may make sense, as there''s nothing for the 
user to really do.> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
> --- a/xen/arch/x86/numa.c
> +++ b/xen/arch/x86/numa.c
> @@ -365,10 +365,11 @@ static void dump_numa(unsigned char key)
>
>          for_each_online_node(i) {
>                  paddr_t pa = (paddr_t)(NODE_DATA(i)->node_start_pfn +
1)<<  PAGE_SHIFT;
> -               printk("idx%d ->  NODE%d start->%lu
size->%lu\n",
> +               printk("idx%d ->  NODE%d start->%lu size->%lu
free->%lu\n",
>                            i, NODE_DATA(i)->node_id,
>                            NODE_DATA(i)->node_start_pfn,
> -                         NODE_DATA(i)->node_spanned_pages);
> +                         NODE_DATA(i)->node_spanned_pages,
> +                          avail_node_heap_pages(i));
>                  /* sanity check phys_to_nid() */
>                  printk("phys_to_nid(%"PRIpaddr") ->  %d
should be %d\n", pa, phys_to_nid(pa),
>                            NODE_DATA(i)->node_id);This should be in its own patch.

   -George

Ian Campbell

2012-May-03 08:10 UTC

head link

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

On Thu, 2012-05-03 at 02:03 +0100, Dario Faggioli wrote:> On Wed, 2012-05-02 at 18:30 +0200, Dario Faggioli wrote:
> > > > +
> > > > +        if (use_cpus>= b_info->max_vcpus) {
> > > > +            rc = 0;
> > > > +            break;
> > > > +        }
> > > Hmm -- there''s got to be a better way to find out the
minimum number
> > > of nodes to house a given number of vcpus than just starting at 1
> > > and re-trying until we have enough.
> >
> > ...
> >
> > However, if what you mean is I could check beforehand whether or not
> > the
> > user provided configuration will give us enough CPUs and avoid testing
> > scenarios that are guaranteed to fail, then I agree and I''ll
reshape
> > the
> > code to look like that. 
> >
> Actually, thinking more about this, I''m not even sure I can do
what I
> stated above. In fact, I don''t think I can assume the number of
CPUs
> each node is made up of to be known, without actually checking it for
> the specific node(s) I want to try using. What if some CPUs are off-line
> and stuff like that?
> 
> I might have to recheck, but reading topology information takes
> on/offline-ness into account, which is something I lose if I assume some
> static x number of CPUs in each node. Also, are we sure archs with
> different number of CPUs in different nodes won''t ever exist? :-P
> Finally, consider this is likely going to change when we will have some
> mechanism for taking the actual load on the CPUs into account, so maybe
> it''s not worth spending much time on it right now.
> 
> So, given what we currently export wrt topology, and all the above,
I''m
> not sure I can see any cleverer way of figuring our how many CPUs/nodes
> I need than explicitly counting them (for what it counts, xend is also
> doing the same AFAIUI :-) ).
If xend does it this way then this is IMHO also appropriate for (lib)xl
in 4.2 at this stage in the release. For 4.3 we can figure out the right
way to do it.

Ian.

George Dunlap

2012-May-03 10:16 UTC

head link

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

On 03/05/12 02:03, Dario Faggioli wrote:> Actually, thinking more about this, I''m not even sure I can do
what I
> stated above. In fact, I don''t think I can assume the number of
CPUs
> each node is made up of to be known, without actually checking it for
> the specific node(s) I want to try using. What if some CPUs are off-line
> and stuff like that?Hmm, yeah, off-line cpus are certainly going to be an issue.  If each 
node has an equal number of cores, we should at least be able to find a 
*minimum* number of nodes, and then increase it from there; then in the 
common case (where there are no offline cpus), we''ll only have a single
pass.

But I think the priority right now is getting something useable for the 
4.2 release; so I think getting to a reasonable functionality quickly 
(i.e., with a minimum amount of effort from you) makes sense.  If that 
means leaving it as is for the moment, that''s fine.  This
won''t be a hot
path, so a little bit of inefficiency should be acceptable.  We can 
always come back and revisit it later.

  -George

George Dunlap

2012-May-03 13:41 UTC

head link

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

On 02/05/12 17:30, Dario Faggioli wrote:>>> +
>>> +/* Store the policy for the domain while parsing */
>>> +static int nodes_policy = NODES_POLICY_DEFAULT;
>>> +
>>> +/* Store the number of nodes to be used while parsing */
>>> +static int num_nodes_policy = 0;
>> Why are "nodes_policy" and "num_nodes_policy" not
passed in along with
>> b_info?
>>
> That was my first implementation. Then I figured out that I want to do
> the placement in _xl_, not in _libxl_, so I really don''t need to
muck up
> build info with placement related stuff. Should I use b_info anyway,
> even if I don''t need these fields while in libxl?Ah right -- yeah, probably since b_info is a libxl structure, you 
shouldn''t add it in there.  But in that case you should probably add 
another xl-specific structure and pass it through, rather than having 
global variables, I think.  It''s only used in the handful of placement 
functions, right?> Sounds definitely nicer. I just did it like that because I found a very
> similar example in xl itself, but I''m open about changing this to
> whatever you and libxl maintainers reach a consensus on. :-)Right.  This is always a bit tricky, balancing your own taste for how to 
do things, and following the style of the code that you''re
modifying.>> Also, is it really necessary for a VM to have an equal amount of memory
>> on every node? It seems like it would be better to have 75% on one node
>> and 25% on a second node than to have 25% on four nodes, for example.
>> Furthermore, insisting on an even amount fragments the node memory
>> further -- i.e., if we chose to have 25% on four nodes instead of 75%
on
>> one and 25% on another, that will make it harder for another VM to fit
>> on a single node as well.
>>
> Ok, that is something quite important to discuss. What you propose makes
> a lot of sense, although some issues comes to my mind:
>
> - which percent should I try, and in what order? I mean, 75%/25%
>     sounds reasonable, but maybe also 80%/20% or even 60%/40% helps your
>     point.I had in mind no constraints at all on the ratios -- basically, if you 
can find N nodes such that the sum of free memory is enough to create 
the VM, even 99%/1%, then go for that rather than looking for N+1.  
Obviously finding a more balanced option would be better. One option 
would be to scan through finding all sets of N nodes that will satisfy 
the criteria, and then choose the most "balanced" one.  That might be 
more than we need for 4.2, so another option would be to look for evenly 
balanced nodes first, then if we don''t find a set, look for any set.  
(That certainly fits with the "first fit"
description!)> - suppose I go for 75%/25%, what about the scheduling oof the VM?Haha -- yeah, for a research paper, you''d probably implement some kind 
of lottery scheduling algorithm that would schedule it on one node 75% 
of the time and another node 25% of the time. :-)  But I think that just 
making the node affinity equal to both of them will be good enough for 
now.  There will be some variability in performance, but there will be 
some of that anyway depending on what node''s memory the guest happens
to
use more.

This actually kind of a different issue, but I''ll bring it up now 
because it''s related.  (Something to toss in for thinking about in 4.3 
really.)  Suppose there are 4 cores and 16GiB per node, and a VM has 8 
vcpus and 8GiB of RAM.  The algorithm you have here will attempt to put 
4GiB on each of two nodes (since it will take 2 nodes to get 8 cores).  
However, it''s pretty common for larger VMs to have way more vcpus than 
they actually use at any one time.  So it might actually have better 
performance to put all 8GiB on one node, and set the node affinity 
accordingly.  In the rare event that more than 4 vcpus are active, a 
handful of vcpus will have all remote accesses, but the majority of the 
time, all of the cpus will have local accesses.  (OTOH, maybe that 
should be only a policy thing that we should recommend in the 
documentation...)
> Please, don''t get me wrong, I see your point and really think it
makes
> sense. I''ve actually thought along the same line for a while, but
then I
> couldn''t find an answers to the questions above.
>
> That''s why, kind of falling back with Xen''s default
"striped" approach
> (although on as less nodes as possible, which is _much_ better than the
> actual Xen''s default!). It looked simple enough to write, read and
> understand, while still providing statistically consistent performances.Dude, this is open source.  Be opinionated. ;-)

What do you think of my suggestions above?>> Hmm -- if I''m reading this right, the only time the nodemap
won''t be
>> all nodes is if (1) the user specified nodes, or (2) there''s a
>> cpumask in effect. If we''re going to override that setting,
wouldn''t
>> it make sense to just expand to all numa nodes? 
> As you wish, the whole "what to do if what I''ve been provided
with
> doesn''t work" is in the *wild guess* status, meaning I tried
to figure
> out what would be best to do, but I might well be far from the actual
> correct solution, provided there is one.
>
> Trying to enlarge the nodemap step by step is potentially yielding
> better performances, but is probably not so near to the "least
surprise"
> principle one should use when designing UIs. :-(
>
>> Hmm -- though I suppose what you''d really want to try is
adding each
>> node in turn, rather than one at a time (i.e., if the cpus are pinned
to
>> nodes 2 and 3, and [2,3] doesn''t work, try [1,2,3] and [2,3,4]
before
>> trying [1,2,3,4].
>>
> Yep, that makes a real lot of sense, thanks! I can definitely try doing
> that, although it will complicate the code a bit...
>
>> But that''s starting to get really complicated -- I
>> wonder if it''s better to just fail and let the user change the
pinnings
>> / configuration node mapping.
>>
> Well, that will probably be the least surprising behaviour.
>
> Again, just let me know what you think it''s best among the various
> alternatives and I''ll go for it.I think if the user specifies a nodemap, and that nodemap doesn''t have 
enough memory, we should throw an error.

If there''s a node_affinity set, no memory on that node, but memory on a
*different* node, what will Xen do?  It will allocate memory on some 
other node, right?

So ATM even if you specify a cpumask, you''ll get memory on the masked 
nodes first, and then memory elsewhere (probably in a fairly random 
manner); but as much of the memory as possible will be on the masked 
nodes.  I wonder then if we shouldnt'' just keep that behavior -- i.e., 
if there''s a cpumask specified, just return the nodemask from that
mask,
and let Xen put as much as possible on that node and let the rest fall 
where it may.

What do you think?>>> +
>>> +        if (use_cpus>= b_info->max_vcpus) {
>>> +            rc = 0;
>>> +            break;
>>> +        }
>> Hmm -- there''s got to be a better way to find out the minimum
number of
>> nodes to house a given number of vcpus than just starting at 1 and
>> re-trying until we have enough.
>>> +        /* Add one more node and retry fitting the domain */
>>> +        __add_nodes_to_nodemap(&new_nodemap, numa, nr_nodes,
1);
>> Same comment as above.
>>
> I''m not sure I''m getting this. The whole point here is
let''s consider
> free memory on the various nodes first, and then adjust the result if
> some other constraints are being violated.Sorry, wrong above -- I meant the other comment about 
__add_nodes_to_nodemap(). :-)> However, if what you mean is I could check beforehand whether or not the
> user provided configuration will give us enough CPUs and avoid testing
> scenarios that are guaranteed to fail, then I agree and I''ll
reshape the
> code to look like that. This triggers the heuristics re-designing stuff
> from above again, as one have to decide what to do if user asks for
> "nodes=[1,3]" and I discover (earlier) that I need one more node
for
> having enough CPUs (I mean, what node should I try first?).No, that''s not exactly what I meant.  Suppose there are 4 cores per 
node, and a VM has 16 vcpus, and NUMA is just set to auto, with no other 
parameters.  If I''m reading your code right, what it will do is first 
try to find a set of 1 node that will satisfy the constraints, then 2 
nodes, then 3, nodes, then 4, &c.  Since there are at most 4 cores per 
node, we know that 1, 2, and 3 nodes are going to fail, regardless of 
how much memory there is or how many cpus are offline.  So why not just 
start with 4, if the user hasn''t specified anything?  Then if 4
doesn''t
work (either because there''s not enough memory, or some of the cpus are
offline), then we can start bumping it up to 5, 6, &c.

That''s what I was getting at -- but again, if it makes it too 
complicated, trading a bit of extra passes for a significant chunk of 
your debugging time is OK. :-)
> So, I''m not entirely sure I answered your question but the point
is your
> idea above is the best one: if you ask something and we don''t
manage in
> getting it done, just stop and let you figure things out.
> I''ve only one question about this approach, what if the automatic
> placement is/becomes the default? I mean, avoiding any kind of fallback
> (which again, makes sense to me in case the user is explicitly asking
> something specific) would mean a completely NUMA-unaware VM creation can
> be aborted even if the user did not say anything... How do we deal with
> this?Well, if the user didn''t specify anything, then we can''t
contradict
anything he specified, right? :-)  If the user doesn''t specify
anything,
and the default is "numa=auto", then I think we''re free to do
whatever
we think is best regarding NUMA placement; in fact, I think we should 
try to avoid failing VM creation if it''s at all possible.  I just meant
what I think we should do if the user asked for specific NUMA nodes or a 
specific number of nodes.  (I think that cpu masks should probably 
behave as it does now -- set the numa_affinity, but not fail domain 
creation if there''s not enough memory on those nodes.)

It seems like we have a number of issues here that would be good for 
more people to come in on -- what if I attempt to summarize the 
high-level decisions we''re talking about so that it''s easier
for more
people to comment on them?

  -George
>
>>> diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
>>> --- a/xen/arch/x86/numa.c
>>> +++ b/xen/arch/x86/numa.c
>>> ...
>>>
>> This should be in its own patch.
>>
> Ok.
>
> Thanks  lot again for taking a look!
>
> Regards,
> Dario
>

Xen devel - Apr 2012 - [PATCH 00 of 10 [RFC]] Automatically place guest on host's NUMA nodes with xl

[PATCH 00 of 10 [RFC]] Automatically place guest on host''s NUMA nodes with xl

[PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

[PATCH 02 of 10 [RFC]] libxl: Generalize libxl_cpumap to just libxl_map

[PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap

[PATCH 04 of 10 [RFC]] libxl: Introduce libxl_get_numainfo() calling xc_numainfo()

[PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file

[PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line

[PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`

[PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

[PATCH 09 of 10 [RFC]] xl: Introduce Best and Worst Fit guest placement algorithms

[PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

Re: [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

Re: [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

Re: [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap

Re: [PATCH 01 of 10 [RFC]] libxc: Generalize xenctl_cpumap to just xenctl_map

Re: [PATCH 03 of 10 [RFC]] libxc, libxl: Introduce xc_nodemap_t and libxl_nodemap

Re: [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

Re: [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file

Re: [PATCH 06 of 10 [RFC]] xl: Allow user to set or change node affinity on-line

Re: [PATCH 10 of 10 [RFC]] xl: Some automatic NUMA placement documentation

Re: [PATCH 05 of 10 [RFC]] xl: Explicit node affinity specification for guests via config file

Formatting of emails which are comments on patches.

Re: Formatting of emails which are comments on patches.

Re: [PATCH 07 of 10 [RFC]] sched_credit: Let the scheduler know about `node affinity`

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes

Re: [PATCH 08 of 10 [RFC]] xl: Introduce First Fit memory-wise placement of guests on nodes