Hello Everyone, Here it is the take 2 of the NUMA aware credit scheduling series. Sorry it took a bit, but I had to take care of those nasty bugs causing scheduling anomalies, as they were getting in the way and messing up the numbers when trying to evaluate performances of this! :-) I also rewrote most of the core of the two step vcpu and node affinity balancing algorithm, as per George''s suggestion during last round, to try to squeeze a little bit more of performances improvement. As already and repeatedly said, what the series does is providing the (credit) scheduler with the knowledge of a domain''s node-affinity. It will then always try to run domain''s vCPUs on one of those nodes first. Only if that turns out to be impossible, it falls back to the old behaviour. (BTW, for any update on the status of my "quest" about improving NUMA support in Xen, look http://wiki.xen.org/wiki/Xen_NUMA_Roadmap.) I rerun my usual benchmark, SpecJBB2005, plus some others, i.e., some configurations of sysbench and lmbench. A little bit more about them follows: * SpecJBB is all about throughput, so pinning is likely the ideal solution. * Sysbench-memory is the time it takes for writing a fixed amount of memory (and then it is the throughput that is measured). What we expect is locality to be important, but at the same time the potential imbalances due to pinning could have a say in it. * Lmbench-proc is the time it takes for a process to fork a fixed amount of children. This is much more about latency than throughput, with locality of memory accesses playing a smaller role and, again, imbalances due to pinning being a potential issue. On a 2 nodes, 16 cores system, where I can have 2 to 10 VMs (2 vCPUs each) executing the benchmarks concurrently, here''s what I get: ---------------------------------------------------- | SpecJBB2005, throughput (the higher the better) | ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | | 2 | 43451.853 | 49876.750 | 49693.653 | | 6 | 29368.589 | 33782.132 | 33692.936 | | 10 | 19138.934 | 21950.696 | 21413.311 | ---------------------------------------------------- | Sysbench memory, throughput (the higher the better) ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | | 2 | 484.42167 | 552.32667 | 552.86167 | | 6 | 404.43667 | 440.00056 | 449.42611 | | 10 | 296.45600 | 315.51733 | 331.49067 | ---------------------------------------------------- | LMBench proc, latency (the lower the better) | ---------------------------------------------------- | #VMs | No affinity | Pinning | NUMA scheduling | ---------------------------------------------------- | 2 | 824.00437 | 749.51892 | 741.42952 | | 6 | 942.39442 | 985.02761 | 974.94700 | | 10 | 1254.3121 | 1363.0792 | 1301.2917 | ---------------------------------------------------- Which, reasoning in terms of %-performances increase/decrease, means NUMA aware scheduling does as follows, as compared to no affinity at all and to pinning: ---------------------------------- | SpecJBB2005 (throughput) | ---------------------------------- | #VMs | No affinity | Pinning | | 2 | +14.36% | -0.36% | | 6 | +14.72% | -0.26% | | 10 | +11.88% | -2.44% | ---------------------------------- | Sysbench memory (throughput) | ---------------------------------- | #VMs | No affinity | Pinning | | 2 | +14.12% | +0.09% | | 6 | +11.12% | +2.14% | | 10 | +11.81% | +5.06% | ---------------------------------- | LMBench proc (latency) | ---------------------------------- | #VMs | No affinity | Pinning | ---------------------------------- | 2 | +10.02% | +1.07% | | 6 | +3.45% | +1.02% | | 10 | +2.94% | +4.53% | ---------------------------------- Numbers seem to tell we''re being successful in taking advantage of both the improved locality (when compared to no affinity) and the greater flexibility the NUMA aware scheduling approach gives us (when compared to pinning). In fact, when throughput only is concerned (SpecJBB case), it behaves almost on par with pinning, and a lot better than no affinity at all. Moreover, we''re even able to do better than them both, when latency comes a little bit more into the game and the imbalances caused by pinning would make things worse than not having any affinity, like in the sysbench and, especially, in the LMBench case. Here are the patches included in the series. I ''*''-ed ones already received one or more acks during v1. However, there are patches that were significantly reworked since then. In that case, I just ignored that, and left them with my SOB only, as I think they definitely need to be re-reviewd. :-) * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap * [ 2/10] xen, libxc: introduce node maps and masks [ 3/10] xen: sched_credit: let the scheduler know about node-affinity [ 4/10] xen: allow for explicitly specifying node-affinity * [ 5/10] libxc: allow for explicitly specifying node-affinity * [ 6/10] libxl: allow for explicitly specifying node-affinity [ 7/10] libxl: optimize the calculation of how many VCPUs can run on a candidate * [ 8/10] libxl: automatic placement deals with node-affinity * [ 9/10] xl: add node-affinity to the output of `xl list` [10/10] docs: rearrange and update NUMA placement documentation Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
More specifically: 1. replaces xenctl_cpumap with xenctl_bitmap 2. provides bitmap_to_xenctl_bitmap and the reverse; 3. re-implement cpumask_to_xenctl_bitmap with bitmap_to_xenctl_bitmap and the reverse; Other than #3, no functional changes. Interface only slightly afected. This is in preparation of introducing NUMA node-affinity maps. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> diff --git a/tools/libxc/xc_cpupool.c b/tools/libxc/xc_cpupool.c --- a/tools/libxc/xc_cpupool.c +++ b/tools/libxc/xc_cpupool.c @@ -90,7 +90,7 @@ xc_cpupoolinfo_t *xc_cpupool_getinfo(xc_ sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_INFO; sysctl.u.cpupool_op.cpupool_id = poolid; set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local); - sysctl.u.cpupool_op.cpumap.nr_cpus = local_size * 8; + sysctl.u.cpupool_op.cpumap.nr_elems = local_size * 8; err = do_sysctl_save(xch, &sysctl); @@ -184,7 +184,7 @@ xc_cpumap_t xc_cpupool_freeinfo(xc_inter sysctl.cmd = XEN_SYSCTL_cpupool_op; sysctl.u.cpupool_op.op = XEN_SYSCTL_CPUPOOL_OP_FREEINFO; set_xen_guest_handle(sysctl.u.cpupool_op.cpumap.bitmap, local); - sysctl.u.cpupool_op.cpumap.nr_cpus = mapsize * 8; + sysctl.u.cpupool_op.cpumap.nr_elems = mapsize * 8; err = do_sysctl_save(xch, &sysctl); diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -142,7 +142,7 @@ int xc_vcpu_setaffinity(xc_interface *xc set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8; + domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8; ret = do_domctl(xch, &domctl); @@ -182,7 +182,7 @@ int xc_vcpu_getaffinity(xc_interface *xc domctl.u.vcpuaffinity.vcpu = vcpu; set_xen_guest_handle(domctl.u.vcpuaffinity.cpumap.bitmap, local); - domctl.u.vcpuaffinity.cpumap.nr_cpus = cpusize * 8; + domctl.u.vcpuaffinity.cpumap.nr_elems = cpusize * 8; ret = do_domctl(xch, &domctl); diff --git a/tools/libxc/xc_tbuf.c b/tools/libxc/xc_tbuf.c --- a/tools/libxc/xc_tbuf.c +++ b/tools/libxc/xc_tbuf.c @@ -134,7 +134,7 @@ int xc_tbuf_set_cpu_mask(xc_interface *x bitmap_64_to_byte(bytemap, &mask64, sizeof (mask64) * 8); set_xen_guest_handle(sysctl.u.tbuf_op.cpu_mask.bitmap, bytemap); - sysctl.u.tbuf_op.cpu_mask.nr_cpus = sizeof(bytemap) * 8; + sysctl.u.tbuf_op.cpu_mask.nr_elems = sizeof(bytemap) * 8; ret = do_sysctl(xch, &sysctl); diff --git a/xen/arch/x86/cpu/mcheck/mce.c b/xen/arch/x86/cpu/mcheck/mce.c --- a/xen/arch/x86/cpu/mcheck/mce.c +++ b/xen/arch/x86/cpu/mcheck/mce.c @@ -1457,8 +1457,7 @@ long do_mca(XEN_GUEST_HANDLE_PARAM(xen_m cpumap = &cpu_online_map; else { - ret = xenctl_cpumap_to_cpumask(&cmv, - &op->u.mc_inject_v2.cpumap); + ret = xenctl_bitmap_to_cpumask(&cmv, &op->u.mc_inject_v2.cpumap); if ( ret ) break; cpumap = cmv; diff --git a/xen/arch/x86/platform_hypercall.c b/xen/arch/x86/platform_hypercall.c --- a/xen/arch/x86/platform_hypercall.c +++ b/xen/arch/x86/platform_hypercall.c @@ -375,7 +375,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PA { uint32_t cpu; uint64_t idletime, now = NOW(); - struct xenctl_cpumap ctlmap; + struct xenctl_bitmap ctlmap; cpumask_var_t cpumap; XEN_GUEST_HANDLE(uint8) cpumap_bitmap; XEN_GUEST_HANDLE(uint64) idletimes; @@ -388,11 +388,11 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PA if ( cpufreq_controller != FREQCTL_dom0_kernel ) break; - ctlmap.nr_cpus = op->u.getidletime.cpumap_nr_cpus; + ctlmap.nr_elems = op->u.getidletime.cpumap_nr_cpus; guest_from_compat_handle(cpumap_bitmap, op->u.getidletime.cpumap_bitmap); ctlmap.bitmap.p = cpumap_bitmap.p; /* handle -> handle_64 conversion */ - if ( (ret = xenctl_cpumap_to_cpumask(&cpumap, &ctlmap)) != 0 ) + if ( (ret = xenctl_bitmap_to_cpumask(&cpumap, &ctlmap)) != 0 ) goto out; guest_from_compat_handle(idletimes, op->u.getidletime.idletime); @@ -411,7 +411,7 @@ ret_t do_platform_op(XEN_GUEST_HANDLE_PA op->u.getidletime.now = now; if ( ret == 0 ) - ret = cpumask_to_xenctl_cpumap(&ctlmap, cpumap); + ret = cpumask_to_xenctl_bitmap(&ctlmap, cpumap); free_cpumask_var(cpumap); if ( ret == 0 && __copy_field_to_guest(u_xenpf_op, op, u.getidletime) ) diff --git a/xen/common/cpupool.c b/xen/common/cpupool.c --- a/xen/common/cpupool.c +++ b/xen/common/cpupool.c @@ -493,7 +493,7 @@ int cpupool_do_sysctl(struct xen_sysctl_ op->cpupool_id = c->cpupool_id; op->sched_id = c->sched->sched_id; op->n_dom = c->n_dom; - ret = cpumask_to_xenctl_cpumap(&op->cpumap, c->cpu_valid); + ret = cpumask_to_xenctl_bitmap(&op->cpumap, c->cpu_valid); cpupool_put(c); } break; @@ -588,7 +588,7 @@ int cpupool_do_sysctl(struct xen_sysctl_ case XEN_SYSCTL_CPUPOOL_OP_FREEINFO: { - ret = cpumask_to_xenctl_cpumap( + ret = cpumask_to_xenctl_bitmap( &op->cpumap, &cpupool_free_cpus); } break; diff --git a/xen/common/domctl.c b/xen/common/domctl.c --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -32,28 +32,29 @@ static DEFINE_SPINLOCK(domctl_lock); DEFINE_SPINLOCK(vcpu_alloc_lock); -int cpumask_to_xenctl_cpumap( - struct xenctl_cpumap *xenctl_cpumap, const cpumask_t *cpumask) +int bitmap_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_bitmap, + const unsigned long *bitmap, + unsigned int nbits) { unsigned int guest_bytes, copy_bytes, i; uint8_t zero = 0; int err = 0; - uint8_t *bytemap = xmalloc_array(uint8_t, (nr_cpu_ids + 7) / 8); + uint8_t *bytemap = xmalloc_array(uint8_t, (nbits + 7) / 8); if ( !bytemap ) return -ENOMEM; - guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8; - copy_bytes = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8); + guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8; + copy_bytes = min_t(unsigned int, guest_bytes, (nbits + 7) / 8); - bitmap_long_to_byte(bytemap, cpumask_bits(cpumask), nr_cpu_ids); + bitmap_long_to_byte(bytemap, bitmap, nbits); if ( copy_bytes != 0 ) - if ( copy_to_guest(xenctl_cpumap->bitmap, bytemap, copy_bytes) ) + if ( copy_to_guest(xenctl_bitmap->bitmap, bytemap, copy_bytes) ) err = -EFAULT; for ( i = copy_bytes; !err && i < guest_bytes; i++ ) - if ( copy_to_guest_offset(xenctl_cpumap->bitmap, i, &zero, 1) ) + if ( copy_to_guest_offset(xenctl_bitmap->bitmap, i, &zero, 1) ) err = -EFAULT; xfree(bytemap); @@ -61,36 +62,58 @@ int cpumask_to_xenctl_cpumap( return err; } -int xenctl_cpumap_to_cpumask( - cpumask_var_t *cpumask, const struct xenctl_cpumap *xenctl_cpumap) +int xenctl_bitmap_to_bitmap(unsigned long *bitmap, + const struct xenctl_bitmap *xenctl_bitmap, + unsigned int nbits) { unsigned int guest_bytes, copy_bytes; int err = 0; - uint8_t *bytemap = xzalloc_array(uint8_t, (nr_cpu_ids + 7) / 8); + uint8_t *bytemap = xzalloc_array(uint8_t, (nbits + 7) / 8); if ( !bytemap ) return -ENOMEM; - guest_bytes = (xenctl_cpumap->nr_cpus + 7) / 8; - copy_bytes = min_t(unsigned int, guest_bytes, (nr_cpu_ids + 7) / 8); + guest_bytes = (xenctl_bitmap->nr_elems + 7) / 8; + copy_bytes = min_t(unsigned int, guest_bytes, (nbits + 7) / 8); if ( copy_bytes != 0 ) { - if ( copy_from_guest(bytemap, xenctl_cpumap->bitmap, copy_bytes) ) + if ( copy_from_guest(bytemap, xenctl_bitmap->bitmap, copy_bytes) ) err = -EFAULT; - if ( (xenctl_cpumap->nr_cpus & 7) && (guest_bytes == copy_bytes) ) - bytemap[guest_bytes-1] &= ~(0xff << (xenctl_cpumap->nr_cpus & 7)); + if ( (xenctl_bitmap->nr_elems & 7) && (guest_bytes == copy_bytes) ) + bytemap[guest_bytes-1] &= ~(0xff << (xenctl_bitmap->nr_elems & 7)); } - if ( err ) - /* nothing */; - else if ( alloc_cpumask_var(cpumask) ) - bitmap_byte_to_long(cpumask_bits(*cpumask), bytemap, nr_cpu_ids); + if ( !err ) + bitmap_byte_to_long(bitmap, bytemap, nbits); + + xfree(bytemap); + + return err; +} + +int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_cpumap, + const cpumask_t *cpumask) +{ + return bitmap_to_xenctl_bitmap(xenctl_cpumap, cpumask_bits(cpumask), + nr_cpu_ids); +} + +int xenctl_bitmap_to_cpumask(cpumask_var_t *cpumask, + const struct xenctl_bitmap *xenctl_cpumap) +{ + int err = 0; + + if ( alloc_cpumask_var(cpumask) ) { + err = xenctl_bitmap_to_bitmap(cpumask_bits(*cpumask), xenctl_cpumap, + nr_cpu_ids); + /* In case of error, cleanup is up to us, as the caller won''t care! */ + if ( err ) + free_cpumask_var(*cpumask); + } else err = -ENOMEM; - xfree(bytemap); - return err; } @@ -583,7 +606,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe { cpumask_var_t new_affinity; - ret = xenctl_cpumap_to_cpumask( + ret = xenctl_bitmap_to_cpumask( &new_affinity, &op->u.vcpuaffinity.cpumap); if ( !ret ) { @@ -593,7 +616,7 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe } else { - ret = cpumask_to_xenctl_cpumap( + ret = cpumask_to_xenctl_bitmap( &op->u.vcpuaffinity.cpumap, v->cpu_affinity); } } diff --git a/xen/common/trace.c b/xen/common/trace.c --- a/xen/common/trace.c +++ b/xen/common/trace.c @@ -384,7 +384,7 @@ int tb_control(xen_sysctl_tbuf_op_t *tbc { cpumask_var_t mask; - rc = xenctl_cpumap_to_cpumask(&mask, &tbc->cpu_mask); + rc = xenctl_bitmap_to_cpumask(&mask, &tbc->cpu_mask); if ( !rc ) { cpumask_copy(&tb_cpu_mask, mask); diff --git a/xen/include/public/arch-x86/xen-mca.h b/xen/include/public/arch-x86/xen-mca.h --- a/xen/include/public/arch-x86/xen-mca.h +++ b/xen/include/public/arch-x86/xen-mca.h @@ -414,7 +414,7 @@ struct xen_mc_mceinject { struct xen_mc_inject_v2 { uint32_t flags; - struct xenctl_cpumap cpumap; + struct xenctl_bitmap cpumap; }; #endif diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -284,7 +284,7 @@ DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvc /* XEN_DOMCTL_getvcpuaffinity */ struct xen_domctl_vcpuaffinity { uint32_t vcpu; /* IN */ - struct xenctl_cpumap cpumap; /* IN/OUT */ + struct xenctl_bitmap cpumap; /* IN/OUT */ }; typedef struct xen_domctl_vcpuaffinity xen_domctl_vcpuaffinity_t; DEFINE_XEN_GUEST_HANDLE(xen_domctl_vcpuaffinity_t); diff --git a/xen/include/public/sysctl.h b/xen/include/public/sysctl.h --- a/xen/include/public/sysctl.h +++ b/xen/include/public/sysctl.h @@ -71,7 +71,7 @@ struct xen_sysctl_tbuf_op { #define XEN_SYSCTL_TBUFOP_disable 5 uint32_t cmd; /* IN/OUT variables */ - struct xenctl_cpumap cpu_mask; + struct xenctl_bitmap cpu_mask; uint32_t evt_mask; /* OUT variables */ uint64_aligned_t buffer_mfn; @@ -532,7 +532,7 @@ struct xen_sysctl_cpupool_op { uint32_t domid; /* IN: M */ uint32_t cpu; /* IN: AR */ uint32_t n_dom; /* OUT: I */ - struct xenctl_cpumap cpumap; /* OUT: IF */ + struct xenctl_bitmap cpumap; /* OUT: IF */ }; typedef struct xen_sysctl_cpupool_op xen_sysctl_cpupool_op_t; DEFINE_XEN_GUEST_HANDLE(xen_sysctl_cpupool_op_t); diff --git a/xen/include/public/xen.h b/xen/include/public/xen.h --- a/xen/include/public/xen.h +++ b/xen/include/public/xen.h @@ -851,9 +851,9 @@ typedef uint8_t xen_domain_handle_t[16]; #endif #ifndef __ASSEMBLY__ -struct xenctl_cpumap { +struct xenctl_bitmap { XEN_GUEST_HANDLE_64(uint8) bitmap; - uint32_t nr_cpus; + uint32_t nr_elems; }; #endif diff --git a/xen/include/xen/cpumask.h b/xen/include/xen/cpumask.h --- a/xen/include/xen/cpumask.h +++ b/xen/include/xen/cpumask.h @@ -424,8 +424,8 @@ extern cpumask_t cpu_present_map; #define for_each_present_cpu(cpu) for_each_cpu(cpu, &cpu_present_map) /* Copy to/from cpumap provided by control tools. */ -struct xenctl_cpumap; -int cpumask_to_xenctl_cpumap(struct xenctl_cpumap *, const cpumask_t *); -int xenctl_cpumap_to_cpumask(cpumask_var_t *, const struct xenctl_cpumap *); +struct xenctl_bitmap; +int cpumask_to_xenctl_bitmap(struct xenctl_bitmap *, const cpumask_t *); +int xenctl_bitmap_to_cpumask(cpumask_var_t *, const struct xenctl_bitmap *); #endif /* __XEN_CPUMASK_H */ diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst --- a/xen/include/xlat.lst +++ b/xen/include/xlat.lst @@ -2,7 +2,7 @@ # ! - needs translation # ? - needs checking ? dom0_vga_console_info xen.h -? xenctl_cpumap xen.h +? xenctl_bitmap xen.h ? mmu_update xen.h ! mmuext_op xen.h ! start_info xen.h
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
Following suit from cpumap and cpumask implementations. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> diff --git a/tools/libxc/xc_misc.c b/tools/libxc/xc_misc.c --- a/tools/libxc/xc_misc.c +++ b/tools/libxc/xc_misc.c @@ -54,6 +54,11 @@ int xc_get_cpumap_size(xc_interface *xch return (xc_get_max_cpus(xch) + 7) / 8; } +int xc_get_nodemap_size(xc_interface *xch) +{ + return (xc_get_max_nodes(xch) + 7) / 8; +} + xc_cpumap_t xc_cpumap_alloc(xc_interface *xch) { int sz; @@ -64,6 +69,16 @@ xc_cpumap_t xc_cpumap_alloc(xc_interface return calloc(1, sz); } +xc_nodemap_t xc_nodemap_alloc(xc_interface *xch) +{ + int sz; + + sz = xc_get_nodemap_size(xch); + if (sz == 0) + return NULL; + return calloc(1, sz); +} + int xc_readconsolering(xc_interface *xch, char *buffer, unsigned int *pnr_chars, diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -330,12 +330,20 @@ int xc_get_cpumap_size(xc_interface *xch /* allocate a cpumap */ xc_cpumap_t xc_cpumap_alloc(xc_interface *xch); - /* +/* * NODEMAP handling */ +typedef uint8_t *xc_nodemap_t; + /* return maximum number of NUMA nodes the hypervisor supports */ int xc_get_max_nodes(xc_interface *xch); +/* return array size for nodemap */ +int xc_get_nodemap_size(xc_interface *xch); + +/* allocate a nodemap */ +xc_nodemap_t xc_nodemap_alloc(xc_interface *xch); + /* * DOMAIN DEBUGGING FUNCTIONS */ diff --git a/xen/common/domctl.c b/xen/common/domctl.c --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -117,6 +117,30 @@ int xenctl_bitmap_to_cpumask(cpumask_var return err; } +int nodemask_to_xenctl_bitmap(struct xenctl_bitmap *xenctl_nodemap, + const nodemask_t *nodemask) +{ + return bitmap_to_xenctl_bitmap(xenctl_nodemap, cpumask_bits(nodemask), + MAX_NUMNODES); +} + +int xenctl_bitmap_to_nodemask(nodemask_t *nodemask, + const struct xenctl_bitmap *xenctl_nodemap) +{ + int err = 0; + + if ( alloc_nodemask_var(nodemask) ) { + err = xenctl_bitmap_to_bitmap(nodes_addr(*nodemask), xenctl_nodemap, + MAX_NUMNODES); + if ( err ) + free_nodemask_var(*nodemask); + } + else + err = -ENOMEM; + + return err; +} + static inline int is_free_domid(domid_t dom) { struct domain *d; diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h --- a/xen/include/xen/nodemask.h +++ b/xen/include/xen/nodemask.h @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const } #endif +/* + * nodemask_var_t: struct nodemask for stack usage. + * + * See definition of cpumask_var_t in include/xen//cpumask.h. + */ +#if MAX_NUMNODES > 2 * BITS_PER_LONG +#include <xen/xmalloc.h> + +typedef nodemask_t *nodemask_var_t; + +#define nr_nodemask_bits (BITS_TO_LONGS(MAX_NUMNODES) * BITS_PER_LONG) + +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask) +{ + *(void **)mask = _xmalloc(nr_nodemask_bits / 8, sizeof(long)); + return *mask != NULL; +} + +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask) +{ + *(void **)mask = _xzalloc(nr_nodemask_bits / 8, sizeof(long)); + return *mask != NULL; +} + +static inline void free_nodemask_var(nodemask_var_t mask) +{ + xfree(mask); +} +#else +typedef nodemask_t nodemask_var_t; + +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask) +{ + return 1; +} + +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask) +{ + nodes_clear(*mask); + return 1; +} + +static inline void free_nodemask_var(nodemask_var_t mask) +{ +} +#endif + #if MAX_NUMNODES > 1 #define for_each_node_mask(node, mask) \ for ((node) = first_node(mask); \
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
As vcpu-affinity tells where VCPUs must run, node-affinity tells where they should or, better, prefer. While respecting vcpu-affinity remains mandatory, node-affinity is not that strict, it only expresses a preference, although honouring it is almost always true that will bring significant performances benefit (especially as compared to not having any affinity at all). This change modifies the VCPU load balancing algorithm (for the credit scheduler only), introducing a two steps logic. During the first step, we use the node-affinity mask. The aim is giving precedence to the CPUs where it is known to be preferable for the domain to run. If that fails in finding a valid PCPU, the node-affinity is just ignored and, in the second step, we fall back to using cpu-affinity only. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1: * CPU masks variables moved off from the stack, as requested during review. As per the comments in the code, having them in the private (per-scheduler instance) struct could have been enough, but it would be racy (again, see comments). For that reason, use a global bunch of them of (via per_cpu()); * George suggested a different load balancing logic during v1''s review. I think he was right and then I changed the old implementation in a way that resembles exactly that. I rewrote most of this patch to introduce a more sensible and effective noda-affinity handling logic. diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -111,6 +111,33 @@ /* + * Node Balancing + */ +#define CSCHED_BALANCE_CPU_AFFINITY 0 +#define CSCHED_BALANCE_NODE_AFFINITY 1 +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY + +/* + * When building for high number of CPUs, cpumask_var_t + * variables on stack are better avoided. However, we need them, + * in order to be able to consider both vcpu and node affinity. + * We also don''t want to xmalloc()/xfree() them, as that would + * happen in critical code paths. Therefore, let''s (pre)allocate + * some scratch space for them. + * + * Having one mask for each instance of the scheduler seems + * enough, and that would suggest putting it wihin `struct + * csched_private'' below. However, we don''t always hold the + * private scheduler lock when the mask itself would need to + * be used, leaving room for races. For that reason, we define + * and use a cpumask_t for each CPU. As preemption is not an + * issue here (we''re holding the runqueue spin-lock!), that is + * both enough and safe. + */ +DEFINE_PER_CPU(cpumask_t, csched_balance_mask); +#define scratch_balance_mask (this_cpu(csched_balance_mask)) + +/* * Boot parameters */ static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS; @@ -159,6 +186,9 @@ struct csched_dom { struct list_head active_vcpu; struct list_head active_sdom_elem; struct domain *dom; + /* cpumask translated from the domain''s node-affinity. + * Basically, the CPUs we prefer to be scheduled on. */ + cpumask_var_t node_affinity_cpumask; uint16_t active_vcpu_count; uint16_t weight; uint16_t cap; @@ -239,6 +269,42 @@ static inline void list_del_init(&svc->runq_elem); } +#define for_each_csched_balance_step(__step) \ + for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- ) + +/* + * Each csched-balance step has to use its own cpumask. This function + * determines which one, given the step, and copies it in mask. Notice + * that, in case of node-affinity balancing step, it also filters out from + * the node-affinity mask the cpus that are not part of vc''s cpu-affinity, + * as we do not want to end up running a vcpu where it would like, but + * is not allowed to! + * + * As an optimization, if a domain does not have any node-affinity at all + * (namely, its node affinity is automatically computed), not only the + * computed mask will reflect its vcpu-affinity, but we also return -1 to + * let the caller know that he can skip the step or quit the loop (if he + * wants). + */ +static int +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) +{ + if ( step == CSCHED_BALANCE_NODE_AFFINITY ) + { + struct domain *d = vc->domain; + struct csched_dom *sdom = CSCHED_DOM(d); + + cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity); + + if ( cpumask_full(sdom->node_affinity_cpumask) ) + return -1; + } + else /* step == CSCHED_BALANCE_CPU_AFFINITY */ + cpumask_copy(mask, vc->cpu_affinity); + + return 0; +} + static void burn_credits(struct csched_vcpu *svc, s_time_t now) { s_time_t delta; @@ -266,67 +332,94 @@ static inline void struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu)); struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); cpumask_t mask, idle_mask; - int idlers_empty; + int balance_step, idlers_empty; ASSERT(cur); - cpumask_clear(&mask); - idlers_empty = cpumask_empty(prv->idlers); /* - * If the pcpu is idle, or there are no idlers and the new - * vcpu is a higher priority than the old vcpu, run it here. - * - * If there are idle cpus, first try to find one suitable to run - * new, so we can avoid preempting cur. If we cannot find a - * suitable idler on which to run new, run it here, but try to - * find a suitable idler on which to run cur instead. + * Node and vcpu-affinity balancing loop. To speed things up, in case + * no node-affinity at all is present, scratch_balance_mask reflects + * the vcpu-affinity, and ret is -1, so that we then can quit the + * loop after only one step. */ - if ( cur->pri == CSCHED_PRI_IDLE - || (idlers_empty && new->pri > cur->pri) ) + for_each_csched_balance_step( balance_step ) { - if ( cur->pri != CSCHED_PRI_IDLE ) - SCHED_STAT_CRANK(tickle_idlers_none); - cpumask_set_cpu(cpu, &mask); - } - else if ( !idlers_empty ) - { - /* Check whether or not there are idlers that can run new */ - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); + int ret, new_idlers_empty; + + cpumask_clear(&mask); /* - * If there are no suitable idlers for new, and it''s higher - * priority than cur, ask the scheduler to migrate cur away. - * We have to act like this (instead of just waking some of - * the idlers suitable for cur) because cur is running. + * If the pcpu is idle, or there are no idlers and the new + * vcpu is a higher priority than the old vcpu, run it here. * - * If there are suitable idlers for new, no matter priorities, - * leave cur alone (as it is running and is, likely, cache-hot) - * and wake some of them (which is waking up and so is, likely, - * cache cold anyway). + * If there are idle cpus, first try to find one suitable to run + * new, so we can avoid preempting cur. If we cannot find a + * suitable idler on which to run new, run it here, but try to + * find a suitable idler on which to run cur instead. */ - if ( cpumask_empty(&idle_mask) && new->pri > cur->pri ) + if ( cur->pri == CSCHED_PRI_IDLE + || (idlers_empty && new->pri > cur->pri) ) { - SCHED_STAT_CRANK(tickle_idlers_none); - SCHED_VCPU_STAT_CRANK(cur, kicked_away); - SCHED_VCPU_STAT_CRANK(cur, migrate_r); - SCHED_STAT_CRANK(migrate_kicked_away); - set_bit(_VPF_migrating, &cur->vcpu->pause_flags); + if ( cur->pri != CSCHED_PRI_IDLE ) + SCHED_STAT_CRANK(tickle_idlers_none); cpumask_set_cpu(cpu, &mask); } - else if ( !cpumask_empty(&idle_mask) ) + else if ( !idlers_empty ) { - /* Which of the idlers suitable for new shall we wake up? */ - SCHED_STAT_CRANK(tickle_idlers_some); - if ( opt_tickle_one_idle ) + /* Are there idlers suitable for new (for this balance step)? */ + ret = csched_balance_cpumask(new->vcpu, balance_step, + &scratch_balance_mask); + cpumask_and(&idle_mask, prv->idlers, &scratch_balance_mask); + new_idlers_empty = cpumask_empty(&idle_mask); + + /* + * Let''s not be too harsh! If there aren''t idlers suitable + * for new in its node-affinity mask, make sure we check its + * vcpu-affinity as well, before tacking final decisions. + */ + if ( new_idlers_empty + && (balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret) ) + continue; + + /* + * If there are no suitable idlers for new, and it''s higher + * priority than cur, ask the scheduler to migrate cur away. + * We have to act like this (instead of just waking some of + * the idlers suitable for cur) because cur is running. + * + * If there are suitable idlers for new, no matter priorities, + * leave cur alone (as it is running and is, likely, cache-hot) + * and wake some of them (which is waking up and so is, likely, + * cache cold anyway). + */ + if ( new_idlers_empty && new->pri > cur->pri ) { - this_cpu(last_tickle_cpu) - cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); - cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); + SCHED_STAT_CRANK(tickle_idlers_none); + SCHED_VCPU_STAT_CRANK(cur, kicked_away); + SCHED_VCPU_STAT_CRANK(cur, migrate_r); + SCHED_STAT_CRANK(migrate_kicked_away); + set_bit(_VPF_migrating, &cur->vcpu->pause_flags); + cpumask_set_cpu(cpu, &mask); } - else - cpumask_or(&mask, &mask, &idle_mask); + else if ( !new_idlers_empty ) + { + /* Which of the idlers suitable for new shall we wake up? */ + SCHED_STAT_CRANK(tickle_idlers_some); + if ( opt_tickle_one_idle ) + { + this_cpu(last_tickle_cpu) + cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); + cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); + } + else + cpumask_or(&mask, &mask, &idle_mask); + } } + + /* Did we find anyone (or csched_balance_cpumask() says we''re done)? */ + if ( !cpumask_empty(&mask) || ret ) + break; } if ( !cpumask_empty(&mask) ) @@ -475,15 +568,28 @@ static inline int } static inline int -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu) +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask) { /* * Don''t pick up work that''s in the peer''s scheduling tail or hot on - * peer PCPU. Only pick up work that''s allowed to run on our CPU. + * peer PCPU. Only pick up work that prefers and/or is allowed to run + * on our CPU. */ return !vc->is_running && !__csched_vcpu_is_cache_hot(vc) && - cpumask_test_cpu(dest_cpu, vc->cpu_affinity); + cpumask_test_cpu(dest_cpu, mask); +} + +static inline int +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) +{ + /* + * Consent to migration if cpu is one of the idlers in the VCPU''s + * affinity mask. In fact, if that is not the case, it just means it + * was some other CPU that was tickled and should hence come and pick + * VCPU up. Migrating it to cpu would only make things worse. + */ + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask); } static int @@ -493,85 +599,98 @@ static int cpumask_t idlers; cpumask_t *online; struct csched_pcpu *spc = NULL; + int ret, balance_step; int cpu; - /* - * Pick from online CPUs in VCPU''s affinity mask, giving a - * preference to its current processor if it''s in there. - */ online = cpupool_scheduler_cpumask(vc->domain->cpupool); - cpumask_and(&cpus, online, vc->cpu_affinity); - cpu = cpumask_test_cpu(vc->processor, &cpus) - ? vc->processor - : cpumask_cycle(vc->processor, &cpus); - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); + for_each_csched_balance_step( balance_step ) + { + /* Pick an online CPU from the proper affinity mask */ + ret = csched_balance_cpumask(vc, balance_step, &cpus); + cpumask_and(&cpus, &cpus, online); - /* - * Try to find an idle processor within the above constraints. - * - * In multi-core and multi-threaded CPUs, not all idle execution - * vehicles are equal! - * - * We give preference to the idle execution vehicle with the most - * idling neighbours in its grouping. This distributes work across - * distinct cores first and guarantees we don''t do something stupid - * like run two VCPUs on co-hyperthreads while there are idle cores - * or sockets. - * - * Notice that, when computing the "idleness" of cpu, we may want to - * discount vc. That is, iff vc is the currently running and the only - * runnable vcpu on cpu, we add cpu to the idlers. - */ - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) - cpumask_set_cpu(cpu, &idlers); - cpumask_and(&cpus, &cpus, &idlers); - cpumask_clear_cpu(cpu, &cpus); + /* If present, prefer vc''s current processor */ + cpu = cpumask_test_cpu(vc->processor, &cpus) + ? vc->processor + : cpumask_cycle(vc->processor, &cpus); + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); - while ( !cpumask_empty(&cpus) ) - { - cpumask_t cpu_idlers; - cpumask_t nxt_idlers; - int nxt, weight_cpu, weight_nxt; - int migrate_factor; + /* + * Try to find an idle processor within the above constraints. + * + * In multi-core and multi-threaded CPUs, not all idle execution + * vehicles are equal! + * + * We give preference to the idle execution vehicle with the most + * idling neighbours in its grouping. This distributes work across + * distinct cores first and guarantees we don''t do something stupid + * like run two VCPUs on co-hyperthreads while there are idle cores + * or sockets. + * + * Notice that, when computing the "idleness" of cpu, we may want to + * discount vc. That is, iff vc is the currently running and the only + * runnable vcpu on cpu, we add cpu to the idlers. + */ + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) + cpumask_set_cpu(cpu, &idlers); + cpumask_and(&cpus, &cpus, &idlers); + /* If there are idlers and cpu is still not among them, pick one */ + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) ) + cpu = cpumask_cycle(cpu, &cpus); + cpumask_clear_cpu(cpu, &cpus); - nxt = cpumask_cycle(cpu, &cpus); + while ( !cpumask_empty(&cpus) ) + { + cpumask_t cpu_idlers; + cpumask_t nxt_idlers; + int nxt, weight_cpu, weight_nxt; + int migrate_factor; - if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) - { - /* We''re on the same socket, so check the busy-ness of threads. - * Migrate if # of idlers is less at all */ - ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); - migrate_factor = 1; - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu)); - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt)); - } - else - { - /* We''re on different sockets, so check the busy-ness of cores. - * Migrate only if the other core is twice as idle */ - ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); - migrate_factor = 2; - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); + nxt = cpumask_cycle(cpu, &cpus); + + if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) + { + /* We''re on the same socket, so check the busy-ness of threads. + * Migrate if # of idlers is less at all */ + ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); + migrate_factor = 1; + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, + cpu)); + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, + nxt)); + } + else + { + /* We''re on different sockets, so check the busy-ness of cores. + * Migrate only if the other core is twice as idle */ + ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); + migrate_factor = 2; + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); + } + + weight_cpu = cpumask_weight(&cpu_idlers); + weight_nxt = cpumask_weight(&nxt_idlers); + /* smt_power_savings: consolidate work rather than spreading it */ + if ( sched_smt_power_savings ? + weight_cpu > weight_nxt : + weight_cpu * migrate_factor < weight_nxt ) + { + cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); + spc = CSCHED_PCPU(nxt); + cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); + cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); + } + else + { + cpumask_andnot(&cpus, &cpus, &nxt_idlers); + } } - weight_cpu = cpumask_weight(&cpu_idlers); - weight_nxt = cpumask_weight(&nxt_idlers); - /* smt_power_savings: consolidate work rather than spreading it */ - if ( sched_smt_power_savings ? - weight_cpu > weight_nxt : - weight_cpu * migrate_factor < weight_nxt ) - { - cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); - spc = CSCHED_PCPU(nxt); - cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); - cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); - } - else - { - cpumask_andnot(&cpus, &cpus, &nxt_idlers); - } + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */ + if ( cpumask_test_cpu(cpu, &idlers) || ret ) + break; } if ( commit && spc ) @@ -913,6 +1032,13 @@ csched_alloc_domdata(const struct schedu if ( sdom == NULL ) return NULL; + if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) ) + { + xfree(sdom); + return NULL; + } + cpumask_setall(sdom->node_affinity_cpumask); + /* Initialize credit and weight */ INIT_LIST_HEAD(&sdom->active_vcpu); sdom->active_vcpu_count = 0; @@ -944,6 +1070,9 @@ csched_dom_init(const struct scheduler * static void csched_free_domdata(const struct scheduler *ops, void *data) { + struct csched_dom *sdom = data; + + free_cpumask_var(sdom->node_affinity_cpumask); xfree(data); } @@ -1240,9 +1369,10 @@ csched_tick(void *_cpu) } static struct csched_vcpu * -csched_runq_steal(int peer_cpu, int cpu, int pri) +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) { const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); + struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu)); const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu); struct csched_vcpu *speer; struct list_head *iter; @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu, if ( speer->pri <= pri ) break; - /* Is this VCPU is runnable on our PCPU? */ + /* Is this VCPU runnable on our PCPU? */ vc = speer->vcpu; BUG_ON( is_idle_vcpu(vc) ); - if (__csched_vcpu_is_migrateable(vc, cpu)) + /* + * Retrieve the correct mask for this balance_step or, if we''re + * dealing with node-affinity and the vcpu has no node affinity + * at all, just skip this vcpu. That is needed if we want to + * check if we have any node-affine work to steal first (wrt + * any vcpu-affine work). + */ + if ( csched_balance_cpumask(vc, balance_step, + &scratch_balance_mask) ) + continue; + + if ( __csched_vcpu_is_migrateable(vc, cpu, &scratch_balance_mask) + && __csched_vcpu_should_migrate(cpu, &scratch_balance_mask, + prv->idlers) ) { /* We got a candidate. Grab it! */ TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu, @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat struct csched_vcpu *speer; cpumask_t workers; cpumask_t *online; - int peer_cpu; + int peer_cpu, peer_node, bstep; + int node = cpu_to_node(cpu); BUG_ON( cpu != snext->vcpu->processor ); online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu)); @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat SCHED_STAT_CRANK(load_balance_other); /* - * Peek at non-idling CPUs in the system, starting with our - * immediate neighbour. + * Let''s look around for work to steal, taking both vcpu-affinity + * and node-affinity into account. More specifically, we check all + * the non-idle CPUs'' runq, looking for: + * 1. any node-affine work to steal first, + * 2. if not finding anything, any vcpu-affine work to steal. */ - cpumask_andnot(&workers, online, prv->idlers); - cpumask_clear_cpu(cpu, &workers); - peer_cpu = cpu; + for_each_csched_balance_step( bstep ) + { + /* + * We peek at the non-idling CPUs in a node-wise fashion. In fact, + * it is more likely that we find some node-affine work on our same + * node, not to mention that migrating vcpus within the same node + * could well expected to be cheaper than across-nodes (memory + * stays local, there might be some node-wide cache[s], etc.). + */ + peer_node = node; + do + { + /* Find out what the !idle are in this node */ + cpumask_andnot(&workers, online, prv->idlers); + cpumask_and(&workers, &workers, &node_to_cpumask(peer_node)); + cpumask_clear_cpu(cpu, &workers); - while ( !cpumask_empty(&workers) ) - { - peer_cpu = cpumask_cycle(peer_cpu, &workers); - cpumask_clear_cpu(peer_cpu, &workers); + if ( cpumask_empty(&workers) ) + goto next_node; - /* - * Get ahold of the scheduler lock for this peer CPU. - * - * Note: We don''t spin on this lock but simply try it. Spinning could - * cause a deadlock if the peer CPU is also load balancing and trying - * to lock this CPU. - */ - if ( !pcpu_schedule_trylock(peer_cpu) ) - { - SCHED_STAT_CRANK(steal_trylock_failed); - continue; - } + peer_cpu = cpumask_first(&workers); + do + { + /* + * Get ahold of the scheduler lock for this peer CPU. + * + * Note: We don''t spin on this lock but simply try it. Spinning + * could cause a deadlock if the peer CPU is also load + * balancing and trying to lock this CPU. + */ + if ( !pcpu_schedule_trylock(peer_cpu) ) + { + SCHED_STAT_CRANK(steal_trylock_failed); + peer_cpu = cpumask_cycle(peer_cpu, &workers); + continue; + } - /* - * Any work over there to steal? - */ - speer = cpumask_test_cpu(peer_cpu, online) ? - csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL; - pcpu_schedule_unlock(peer_cpu); - if ( speer != NULL ) - { - *stolen = 1; - return speer; - } + /* Any work over there to steal? */ + speer = cpumask_test_cpu(peer_cpu, online) ? + csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL; + pcpu_schedule_unlock(peer_cpu); + + /* As soon as one vcpu is found, balancing ends */ + if ( speer != NULL ) + { + *stolen = 1; + return speer; + } + + peer_cpu = cpumask_cycle(peer_cpu, &workers); + + } while( peer_cpu != cpumask_first(&workers) ); + + next_node: + peer_node = cycle_node(peer_node, node_online_map); + } while( peer_node != node ); } out: diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h --- a/xen/include/xen/nodemask.h +++ b/xen/include/xen/nodemask.h @@ -41,6 +41,8 @@ * int last_node(mask) Number highest set bit, or MAX_NUMNODES * int first_unset_node(mask) First node not set in mask, or * MAX_NUMNODES. + * int cycle_node(node, mask) Next node cycling from ''node'', or + * MAX_NUMNODES * * nodemask_t nodemask_of_node(node) Return nodemask with bit ''node'' set * NODE_MASK_ALL Initializer - all bits set @@ -254,6 +256,16 @@ static inline int __first_unset_node(con find_first_zero_bit(maskp->bits, MAX_NUMNODES)); } +#define cycle_node(n, src) __cycle_node((n), &(src), MAX_NUMNODES) +static inline int __cycle_node(int n, const nodemask_t *maskp, int nbits) +{ + int nxt = __next_node(n, maskp, nbits); + + if (nxt == nbits) + nxt = __first_node(maskp, nbits); + return nxt; +} + #define NODE_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(MAX_NUMNODES) #if MAX_NUMNODES <= BITS_PER_LONG
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
Make it possible to pass the node-affinity of a domain to the hypervisor from the upper layers, instead of always being computed automatically. Note that this also required generalizing the Flask hooks for setting and getting the affinity, so that they now deal with both vcpu and node affinity. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- Changes from v1: * added the missing dummy hook for nodeaffinity; * let the permission renaming affect flask policies too. diff --git a/tools/flask/policy/policy/flask/access_vectors b/tools/flask/policy/policy/flask/access_vectors --- a/tools/flask/policy/policy/flask/access_vectors +++ b/tools/flask/policy/policy/flask/access_vectors @@ -47,8 +47,8 @@ class domain transition max_vcpus destroy - setvcpuaffinity - getvcpuaffinity + setaffinity + getaffinity scheduler getdomaininfo getvcpuinfo diff --git a/tools/flask/policy/policy/mls b/tools/flask/policy/policy/mls --- a/tools/flask/policy/policy/mls +++ b/tools/flask/policy/policy/mls @@ -70,11 +70,11 @@ mlsconstrain domain transition (( h1 dom h2 ) and (( l1 eq l2 ) or (t1 == mls_priv))); # all the domain "read" ops -mlsconstrain domain { getvcpuaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext } +mlsconstrain domain { getaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext } ((l1 dom l2) or (t1 == mls_priv)); # all the domain "write" ops -mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setvcpuaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext } +mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext } ((l1 eq l2) or (t1 == mls_priv)); # This is incomplete - similar constraints must be written for all classes diff --git a/tools/flask/policy/policy/modules/xen/xen.if b/tools/flask/policy/policy/modules/xen/xen.if --- a/tools/flask/policy/policy/modules/xen/xen.if +++ b/tools/flask/policy/policy/modules/xen/xen.if @@ -55,9 +55,9 @@ define(`create_domain_build_label'', ` # manage_domain(priv, target) # Allow managing a running domain define(`manage_domain'', ` - allow $1 $2:domain { getdomaininfo getvcpuinfo getvcpuaffinity + allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity getaddrsize pause unpause trigger shutdown destroy - setvcpuaffinity setdomainmaxmem }; + setaffinity setdomainmaxmem }; '') # migrate_domain_out(priv, target) diff --git a/xen/common/domain.c b/xen/common/domain.c --- a/xen/common/domain.c +++ b/xen/common/domain.c @@ -222,6 +222,7 @@ struct domain *domain_create( spin_lock_init(&d->node_affinity_lock); d->node_affinity = NODE_MASK_ALL; + d->auto_node_affinity = 1; spin_lock_init(&d->shutdown_lock); d->shutdown_code = -1; @@ -362,11 +363,26 @@ void domain_update_node_affinity(struct cpumask_or(cpumask, cpumask, online_affinity); } - for_each_online_node ( node ) - if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) - node_set(node, nodemask); + if ( d->auto_node_affinity ) + { + /* Node-affinity is automaically computed from all vcpu-affinities */ + for_each_online_node ( node ) + if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) + node_set(node, nodemask); - d->node_affinity = nodemask; + d->node_affinity = nodemask; + } + else + { + /* Node-affinity is provided by someone else, just filter out cpus + * that are either offline or not in the affinity of any vcpus. */ + for_each_node_mask ( node, d->node_affinity ) + if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) ) + node_clear(node, d->node_affinity); + } + + sched_set_node_affinity(d, &d->node_affinity); + spin_unlock(&d->node_affinity_lock); free_cpumask_var(online_affinity); @@ -374,6 +390,36 @@ void domain_update_node_affinity(struct } +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity) +{ + /* Being affine with no nodes is just wrong */ + if ( nodes_empty(*affinity) ) + return -EINVAL; + + spin_lock(&d->node_affinity_lock); + + /* + * Being/becoming explicitly affine to all nodes is not particularly + * useful. Let''s take it as the `reset node affinity` command. + */ + if ( nodes_full(*affinity) ) + { + d->auto_node_affinity = 1; + goto out; + } + + d->auto_node_affinity = 0; + d->node_affinity = *affinity; + +out: + spin_unlock(&d->node_affinity_lock); + + domain_update_node_affinity(d); + + return 0; +} + + struct domain *get_domain_by_id(domid_t dom) { struct domain *d; diff --git a/xen/common/domctl.c b/xen/common/domctl.c --- a/xen/common/domctl.c +++ b/xen/common/domctl.c @@ -609,6 +609,40 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe } break; + case XEN_DOMCTL_setnodeaffinity: + case XEN_DOMCTL_getnodeaffinity: + { + domid_t dom = op->domain; + struct domain *d = rcu_lock_domain_by_id(dom); + + ret = -ESRCH; + if ( d == NULL ) + break; + + ret = xsm_nodeaffinity(op->cmd, d); + if ( ret ) + goto nodeaffinity_out; + + if ( op->cmd == XEN_DOMCTL_setnodeaffinity ) + { + nodemask_t new_affinity; + + ret = xenctl_bitmap_to_nodemask(&new_affinity, + &op->u.nodeaffinity.nodemap); + if ( !ret ) + ret = domain_set_node_affinity(d, &new_affinity); + } + else + { + ret = nodemask_to_xenctl_bitmap(&op->u.nodeaffinity.nodemap, + &d->node_affinity); + } + + nodeaffinity_out: + rcu_unlock_domain(d); + } + break; + case XEN_DOMCTL_setvcpuaffinity: case XEN_DOMCTL_getvcpuaffinity: { diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c --- a/xen/common/keyhandler.c +++ b/xen/common/keyhandler.c @@ -217,6 +217,14 @@ static void cpuset_print(char *set, int *set++ = ''\0''; } +static void nodeset_print(char *set, int size, const nodemask_t *mask) +{ + *set++ = ''[''; + set += nodelist_scnprintf(set, size-2, mask); + *set++ = '']''; + *set++ = ''\0''; +} + static void periodic_timer_print(char *str, int size, uint64_t period) { if ( period == 0 ) @@ -272,6 +280,9 @@ static void dump_domains(unsigned char k dump_pageframe_info(d); + nodeset_print(tmpstr, sizeof(tmpstr), &d->node_affinity); + printk("NODE affinity for domain %d: %s\n", d->domain_id, tmpstr); + printk("VCPU information and callbacks for domain %u:\n", d->domain_id); for_each_vcpu ( d, v ) diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c --- a/xen/common/sched_credit.c +++ b/xen/common/sched_credit.c @@ -269,6 +269,33 @@ static inline void list_del_init(&svc->runq_elem); } +/* + * Translates node-affinity mask into a cpumask, so that we can use it during + * actual scheduling. That of course will contain all the cpus from all the + * set nodes in the original node-affinity mask. + * + * Note that any serialization needed to access mask safely is complete + * responsibility of the caller of this function/hook. + */ +static void csched_set_node_affinity( + const struct scheduler *ops, + struct domain *d, + nodemask_t *mask) +{ + struct csched_dom *sdom; + int node; + + /* Skip idle domain since it doesn''t even have a node_affinity_cpumask */ + if ( unlikely(is_idle_domain(d)) ) + return; + + sdom = CSCHED_DOM(d); + cpumask_clear(sdom->node_affinity_cpumask); + for_each_node_mask( node, *mask ) + cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, + &node_to_cpumask(node)); +} + #define for_each_csched_balance_step(__step) \ for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- ) @@ -296,7 +323,8 @@ csched_balance_cpumask(const struct vcpu cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity); - if ( cpumask_full(sdom->node_affinity_cpumask) ) + if ( cpumask_full(sdom->node_affinity_cpumask) || + d->auto_node_affinity == 1 ) return -1; } else /* step == CSCHED_BALANCE_CPU_AFFINITY */ @@ -1896,6 +1924,8 @@ const struct scheduler sched_credit_def .adjust = csched_dom_cntl, .adjust_global = csched_sys_cntl, + .set_node_affinity = csched_set_node_affinity, + .pick_cpu = csched_cpu_pick, .do_schedule = csched_schedule, diff --git a/xen/common/schedule.c b/xen/common/schedule.c --- a/xen/common/schedule.c +++ b/xen/common/schedule.c @@ -590,6 +590,11 @@ int cpu_disable_scheduler(unsigned int c return ret; } +void sched_set_node_affinity(struct domain *d, nodemask_t *mask) +{ + SCHED_OP(DOM2OP(d), set_node_affinity, d, mask); +} + int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) { cpumask_t online_affinity; diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h --- a/xen/include/public/domctl.h +++ b/xen/include/public/domctl.h @@ -279,6 +279,16 @@ typedef struct xen_domctl_getvcpuinfo xe DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t); +/* Get/set the NUMA node(s) with which the guest has affinity with. */ +/* XEN_DOMCTL_setnodeaffinity */ +/* XEN_DOMCTL_getnodeaffinity */ +struct xen_domctl_nodeaffinity { + struct xenctl_bitmap nodemap;/* IN */ +}; +typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t; +DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t); + + /* Get/set which physical cpus a vcpu can execute on. */ /* XEN_DOMCTL_setvcpuaffinity */ /* XEN_DOMCTL_getvcpuaffinity */ @@ -907,6 +917,8 @@ struct xen_domctl { #define XEN_DOMCTL_audit_p2m 65 #define XEN_DOMCTL_set_virq_handler 66 #define XEN_DOMCTL_set_broken_page_p2m 67 +#define XEN_DOMCTL_setnodeaffinity 68 +#define XEN_DOMCTL_getnodeaffinity 69 #define XEN_DOMCTL_gdbsx_guestmemio 1000 #define XEN_DOMCTL_gdbsx_pausevcpu 1001 #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 @@ -920,6 +932,7 @@ struct xen_domctl { struct xen_domctl_getpageframeinfo getpageframeinfo; struct xen_domctl_getpageframeinfo2 getpageframeinfo2; struct xen_domctl_getpageframeinfo3 getpageframeinfo3; + struct xen_domctl_nodeaffinity nodeaffinity; struct xen_domctl_vcpuaffinity vcpuaffinity; struct xen_domctl_shadow_op shadow_op; struct xen_domctl_max_mem max_mem; diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h --- a/xen/include/xen/nodemask.h +++ b/xen/include/xen/nodemask.h @@ -8,8 +8,9 @@ * See detailed comments in the file linux/bitmap.h describing the * data type on which these nodemasks are based. * - * For details of nodemask_scnprintf() and nodemask_parse(), - * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. + * For details of nodemask_scnprintf(), nodelist_scnpintf() and + * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse() + * in lib/bitmap.c. * * The available nodemask operations are: * @@ -50,6 +51,7 @@ * unsigned long *nodes_addr(mask) Array of unsigned long''s in mask * * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing + * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for printing * int nodemask_parse(ubuf, ulen, mask) Parse ascii string as nodemask * * for_each_node_mask(node, mask) for-loop node over mask @@ -292,6 +294,14 @@ static inline int __cycle_node(int n, co #define nodes_addr(src) ((src).bits) +#define nodelist_scnprintf(buf, len, src) \ + __nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES) +static inline int __nodelist_scnprintf(char *buf, int len, + const nodemask_t *srcp, int nbits) +{ + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); +} + #if 0 #define nodemask_scnprintf(buf, len, src) \ __nodemask_scnprintf((buf), (len), &(src), MAX_NUMNODES) diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h --- a/xen/include/xen/sched-if.h +++ b/xen/include/xen/sched-if.h @@ -184,6 +184,8 @@ struct scheduler { struct xen_domctl_scheduler_op *); int (*adjust_global) (const struct scheduler *, struct xen_sysctl_scheduler_op *); + void (*set_node_affinity) (const struct scheduler *, + struct domain *, nodemask_t *); void (*dump_settings) (const struct scheduler *); void (*dump_cpu_state) (const struct scheduler *, int); diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h --- a/xen/include/xen/sched.h +++ b/xen/include/xen/sched.h @@ -359,8 +359,12 @@ struct domain /* Various mem_events */ struct mem_event_per_domain *mem_event; - /* Currently computed from union of all vcpu cpu-affinity masks. */ + /* + * Can be specified by the user. If that is not the case, it is + * computed from the union of all the vcpu cpu-affinity masks. + */ nodemask_t node_affinity; + int auto_node_affinity; unsigned int last_alloc_node; spinlock_t node_affinity_lock; }; @@ -429,6 +433,7 @@ static inline void get_knownalive_domain ASSERT(!(atomic_read(&d->refcnt) & DOMAIN_DESTROYED)); } +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity); void domain_update_node_affinity(struct domain *d); struct domain *domain_create( @@ -543,6 +548,7 @@ void sched_destroy_domain(struct domain int sched_move_domain(struct domain *d, struct cpupool *c); long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *); long sched_adjust_global(struct xen_sysctl_scheduler_op *); +void sched_set_node_affinity(struct domain *, nodemask_t *); int sched_id(void); void sched_tick_suspend(void); void sched_tick_resume(void); diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h --- a/xen/include/xsm/xsm.h +++ b/xen/include/xsm/xsm.h @@ -56,6 +56,7 @@ struct xsm_operations { int (*domain_create) (struct domain *d, u32 ssidref); int (*max_vcpus) (struct domain *d); int (*destroydomain) (struct domain *d); + int (*nodeaffinity) (int cmd, struct domain *d); int (*vcpuaffinity) (int cmd, struct domain *d); int (*scheduler) (struct domain *d); int (*getdomaininfo) (struct domain *d); @@ -229,6 +230,11 @@ static inline int xsm_destroydomain (str return xsm_call(destroydomain(d)); } +static inline int xsm_nodeaffinity (int cmd, struct domain *d) +{ + return xsm_call(nodeaffinity(cmd, d)); +} + static inline int xsm_vcpuaffinity (int cmd, struct domain *d) { return xsm_call(vcpuaffinity(cmd, d)); diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c --- a/xen/xsm/dummy.c +++ b/xen/xsm/dummy.c @@ -54,6 +54,11 @@ static int dummy_destroydomain (struct d return 0; } +static int dummy_nodeaffinity (int cmd, struct domain *d) +{ + return 0; +} + static int dummy_vcpuaffinity (int cmd, struct domain *d) { return 0; @@ -634,6 +639,7 @@ void xsm_fixup_ops (struct xsm_operation set_to_dummy_if_null(ops, domain_create); set_to_dummy_if_null(ops, max_vcpus); set_to_dummy_if_null(ops, destroydomain); + set_to_dummy_if_null(ops, nodeaffinity); set_to_dummy_if_null(ops, vcpuaffinity); set_to_dummy_if_null(ops, scheduler); set_to_dummy_if_null(ops, getdomaininfo); diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c --- a/xen/xsm/flask/hooks.c +++ b/xen/xsm/flask/hooks.c @@ -521,17 +521,19 @@ static int flask_destroydomain(struct do DOMAIN__DESTROY); } -static int flask_vcpuaffinity(int cmd, struct domain *d) +static int flask_affinity(int cmd, struct domain *d) { u32 perm; switch ( cmd ) { case XEN_DOMCTL_setvcpuaffinity: - perm = DOMAIN__SETVCPUAFFINITY; + case XEN_DOMCTL_setnodeaffinity: + perm = DOMAIN__SETAFFINITY; break; case XEN_DOMCTL_getvcpuaffinity: - perm = DOMAIN__GETVCPUAFFINITY; + case XEN_DOMCTL_getnodeaffinity: + perm = DOMAIN__GETAFFINITY; break; default: return -EPERM; @@ -1473,7 +1475,8 @@ static struct xsm_operations flask_ops .domain_create = flask_domain_create, .max_vcpus = flask_max_vcpus, .destroydomain = flask_destroydomain, - .vcpuaffinity = flask_vcpuaffinity, + .nodeaffinity = flask_affinity, + .vcpuaffinity = flask_affinity, .scheduler = flask_scheduler, .getdomaininfo = flask_getdomaininfo, .getvcpucontext = flask_getvcpucontext, diff --git a/xen/xsm/flask/include/av_perm_to_string.h b/xen/xsm/flask/include/av_perm_to_string.h --- a/xen/xsm/flask/include/av_perm_to_string.h +++ b/xen/xsm/flask/include/av_perm_to_string.h @@ -37,8 +37,8 @@ S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition") S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus") S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy") - S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY, "setvcpuaffinity") - S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY, "getvcpuaffinity") + S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity") + S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity") S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler") S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo") S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo") diff --git a/xen/xsm/flask/include/av_permissions.h b/xen/xsm/flask/include/av_permissions.h --- a/xen/xsm/flask/include/av_permissions.h +++ b/xen/xsm/flask/include/av_permissions.h @@ -38,8 +38,8 @@ #define DOMAIN__TRANSITION 0x00000020UL #define DOMAIN__MAX_VCPUS 0x00000040UL #define DOMAIN__DESTROY 0x00000080UL -#define DOMAIN__SETVCPUAFFINITY 0x00000100UL -#define DOMAIN__GETVCPUAFFINITY 0x00000200UL +#define DOMAIN__SETAFFINITY 0x00000100UL +#define DOMAIN__GETAFFINITY 0x00000200UL #define DOMAIN__SCHEDULER 0x00000400UL #define DOMAIN__GETDOMAININFO 0x00000800UL #define DOMAIN__GETVCPUINFO 0x00001000UL
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 05 of 10 v2] libxc: allow for explicitly specifying node-affinity
By providing the proper get/set interface and wiring them to the new domctl-s from the previous commit. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c --- a/tools/libxc/xc_domain.c +++ b/tools/libxc/xc_domain.c @@ -110,6 +110,83 @@ int xc_domain_shutdown(xc_interface *xch } +int xc_domain_node_setaffinity(xc_interface *xch, + uint32_t domid, + xc_nodemap_t nodemap) +{ + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(uint8_t, local); + int ret = -1; + int nodesize; + + nodesize = xc_get_nodemap_size(xch); + if (!nodesize) + { + PERROR("Could not get number of nodes"); + goto out; + } + + local = xc_hypercall_buffer_alloc(xch, local, nodesize); + if ( local == NULL ) + { + PERROR("Could not allocate memory for setnodeaffinity domctl hypercall"); + goto out; + } + + domctl.cmd = XEN_DOMCTL_setnodeaffinity; + domctl.domain = (domid_t)domid; + + memcpy(local, nodemap, nodesize); + set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local); + domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8; + + ret = do_domctl(xch, &domctl); + + xc_hypercall_buffer_free(xch, local); + + out: + return ret; +} + +int xc_domain_node_getaffinity(xc_interface *xch, + uint32_t domid, + xc_nodemap_t nodemap) +{ + DECLARE_DOMCTL; + DECLARE_HYPERCALL_BUFFER(uint8_t, local); + int ret = -1; + int nodesize; + + nodesize = xc_get_nodemap_size(xch); + if (!nodesize) + { + PERROR("Could not get number of nodes"); + goto out; + } + + local = xc_hypercall_buffer_alloc(xch, local, nodesize); + if ( local == NULL ) + { + PERROR("Could not allocate memory for getnodeaffinity domctl hypercall"); + goto out; + } + + domctl.cmd = XEN_DOMCTL_getnodeaffinity; + domctl.domain = (domid_t)domid; + + set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local); + domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8; + + ret = do_domctl(xch, &domctl); + + memcpy(nodemap, local, nodesize); + + xc_hypercall_buffer_free(xch, local); + + out: + return ret; +} + int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu, diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h --- a/tools/libxc/xenctrl.h +++ b/tools/libxc/xenctrl.h @@ -521,6 +521,32 @@ int xc_watchdog(xc_interface *xch, uint32_t id, uint32_t timeout); +/** + * This function explicitly sets the host NUMA nodes the domain will + * have affinity with. + * + * @parm xch a handle to an open hypervisor interface. + * @parm domid the domain id one wants to set the affinity of. + * @parm nodemap the map of the affine nodes. + * @return 0 on success, -1 on failure. + */ +int xc_domain_node_setaffinity(xc_interface *xch, + uint32_t domind, + xc_nodemap_t nodemap); + +/** + * This function retrieves the host NUMA nodes the domain has + * affinity with. + * + * @parm xch a handle to an open hypervisor interface. + * @parm domid the domain id one wants to get the node affinity of. + * @parm nodemap the map of the affine nodes. + * @return 0 on success, -1 on failure. + */ +int xc_domain_node_getaffinity(xc_interface *xch, + uint32_t domind, + xc_nodemap_t nodemap); + int xc_vcpu_setaffinity(xc_interface *xch, uint32_t domid, int vcpu,
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
By introducing a nodemap in libxl_domain_build_info and providing the get/set methods to deal with it. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c --- a/tools/libxl/libxl.c +++ b/tools/libxl/libxl.c @@ -4142,6 +4142,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx return rc; } +int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, + libxl_bitmap *nodemap) +{ + if (xc_domain_node_setaffinity(ctx->xch, domid, nodemap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node affinity"); + return ERROR_FAIL; + } + return 0; +} + +int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, + libxl_bitmap *nodemap) +{ + if (xc_domain_node_getaffinity(ctx->xch, domid, nodemap->map)) { + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting node affinity"); + return ERROR_FAIL; + } + return 0; +} + int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap) { GC_INIT(ctx); diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h --- a/tools/libxl/libxl.h +++ b/tools/libxl/libxl.h @@ -861,6 +861,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct libxl_bitmap *cpumap); int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, unsigned int max_vcpus, libxl_bitmap *cpumap); +int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, + libxl_bitmap *nodemap); +int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, + libxl_bitmap *nodemap); int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap); libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx); diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c --- a/tools/libxl/libxl_create.c +++ b/tools/libxl/libxl_create.c @@ -184,6 +184,12 @@ int libxl__domain_build_info_setdefault( libxl_defbool_setdefault(&b_info->numa_placement, true); + if (!b_info->nodemap.size) { + if (libxl_node_bitmap_alloc(CTX, &b_info->nodemap, 0)) + return ERROR_FAIL; + libxl_bitmap_set_any(&b_info->nodemap); + } + if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT) b_info->max_memkb = 32 * 1024; if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT) diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -230,6 +230,7 @@ int libxl__build_pre(libxl__gc *gc, uint if (rc) return rc; } + libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -261,6 +261,7 @@ libxl_domain_build_info = Struct("domain ("max_vcpus", integer), ("avail_vcpus", libxl_bitmap), ("cpumap", libxl_bitmap), + ("nodemap", libxl_bitmap), ("numa_placement", libxl_defbool), ("tsc_mode", libxl_tsc_mode), ("max_memkb", MemKB),
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
For choosing the best NUMA placement candidate, we need to figure out how many VCPUs are runnable on each of them. That requires going through all the VCPUs of all the domains and check their affinities. With this change, instead of doing the above for each candidate, we do it once for all, populating an array while counting. This way, when we later are evaluating candidates, all we need is summing up the right elements of the array itself. This reduces the complexity of the overall algorithm, as it moves a potentially expensive operation (for_each_vcpu_of_each_domain {}) outside from the core placement loop, so that it is performed only once instead of (potentially) tens or hundreds of times. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -165,15 +165,27 @@ static uint32_t nodemap_to_free_memkb(li return free_memkb; } -/* Retrieve the number of vcpus able to run on the cpus of the nodes - * that are part of the nodemap. */ -static int nodemap_to_nr_vcpus(libxl__gc *gc, libxl_cputopology *tinfo, +/* Retrieve the number of vcpus able to run on the nodes in nodemap */ +static int nodemap_to_nr_vcpus(libxl__gc *gc, int vcpus_on_node[], const libxl_bitmap *nodemap) { + int i, nr_vcpus = 0; + + libxl_for_each_set_bit(i, *nodemap) + nr_vcpus += vcpus_on_node[i]; + + return nr_vcpus; +} + +/* Number of vcpus able to run on the cpus of the various nodes + * (reported by filling the array vcpus_on_node[]). */ +static int nr_vcpus_on_nodes(libxl__gc *gc, libxl_cputopology *tinfo, + const libxl_bitmap *suitable_cpumap, + int vcpus_on_node[]) +{ libxl_dominfo *dinfo = NULL; libxl_bitmap vcpu_nodemap; int nr_doms, nr_cpus; - int nr_vcpus = 0; int i, j, k; dinfo = libxl_list_domain(CTX, &nr_doms); @@ -193,19 +205,17 @@ static int nodemap_to_nr_vcpus(libxl__gc if (vinfo == NULL) continue; - /* For each vcpu of each domain ... */ for (j = 0; j < nr_dom_vcpus; j++) { + /* For each vcpu of each domain, increment the elements of + * the array corresponding to the nodes where the vcpu runs */ + libxl_bitmap_set_none(&vcpu_nodemap); + libxl_for_each_set_bit(k, vinfo[j].cpumap) { + int node = tinfo[k].node; - /* Build up a map telling on which nodes the vcpu is runnable on */ - libxl_bitmap_set_none(&vcpu_nodemap); - libxl_for_each_set_bit(k, vinfo[j].cpumap) - libxl_bitmap_set(&vcpu_nodemap, tinfo[k].node); - - /* And check if that map has any intersection with our nodemap */ - libxl_for_each_set_bit(k, vcpu_nodemap) { - if (libxl_bitmap_test(nodemap, k)) { - nr_vcpus++; - break; + if (libxl_bitmap_test(suitable_cpumap, k) && + !libxl_bitmap_test(&vcpu_nodemap, node)) { + libxl_bitmap_set(&vcpu_nodemap, node); + vcpus_on_node[node]++; } } } @@ -215,7 +225,7 @@ static int nodemap_to_nr_vcpus(libxl__gc libxl_bitmap_dispose(&vcpu_nodemap); libxl_dominfo_list_free(dinfo, nr_doms); - return nr_vcpus; + return 0; } /* @@ -270,7 +280,7 @@ int libxl__get_numa_candidate(libxl__gc libxl_numainfo *ninfo = NULL; int nr_nodes = 0, nr_suit_nodes, nr_cpus = 0; libxl_bitmap suitable_nodemap, nodemap; - int rc = 0; + int *vcpus_on_node, rc = 0; libxl_bitmap_init(&nodemap); libxl_bitmap_init(&suitable_nodemap); @@ -281,6 +291,8 @@ int libxl__get_numa_candidate(libxl__gc if (ninfo == NULL) return ERROR_FAIL; + GCNEW_ARRAY(vcpus_on_node, nr_nodes); + /* * The good thing about this solution is that it is based on heuristics * (implemented in numa_cmpf() ), but we at least can evaluate it on @@ -330,6 +342,19 @@ int libxl__get_numa_candidate(libxl__gc goto out; /* + * Later on, we will try to figure out how many vcpus are runnable on + * each candidate (as a part of choosing the best one of them). That + * requires going through all the vcpus of all the domains and check + * their affinities. So, instead of doing that for each candidate, + * let''s count here the number of vcpus runnable on each node, so that + * all we have to do later is summing up the right elements of the + * vcpus_on_node array. + */ + rc = nr_vcpus_on_nodes(gc, tinfo, suitable_cpumap, vcpus_on_node); + if (rc) + goto out; + + /* * If the minimum number of NUMA nodes is not explicitly specified * (i.e., min_nodes == 0), we try to figure out a sensible number of nodes * from where to start generating candidates, if possible (or just start @@ -414,7 +439,8 @@ int libxl__get_numa_candidate(libxl__gc * current best one (if any). */ libxl__numa_candidate_put_nodemap(gc, &new_cndt, &nodemap); - new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, tinfo, &nodemap); + new_cndt.nr_vcpus = nodemap_to_nr_vcpus(gc, vcpus_on_node, + &nodemap); new_cndt.free_memkb = nodes_free_memkb; new_cndt.nr_nodes = libxl_bitmap_count_set(&nodemap); new_cndt.nr_cpus = nodes_cpus;
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity
Which basically means the following two things: 1) during domain creation, it is the node-affinity of the domain --rather than the vcpu-affinities of its VCPUs-- that is affected by automatic placement; 2) during automatic placement, when counting how many VCPUs are already "bound" to a placement candidate (as part of the process of choosing the best candidate), both vcpu-affinity and node-affinity are considered. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: George Dunlap <george.dunlap@eu.citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c --- a/tools/libxl/libxl_dom.c +++ b/tools/libxl/libxl_dom.c @@ -133,13 +133,13 @@ static int numa_place_domain(libxl__gc * { int found; libxl__numa_candidate candidate; - libxl_bitmap candidate_nodemap; + libxl_bitmap cpupool_nodemap; libxl_cpupoolinfo cpupool_info; int i, cpupool, rc = 0; uint32_t memkb; libxl__numa_candidate_init(&candidate); - libxl_bitmap_init(&candidate_nodemap); + libxl_bitmap_init(&cpupool_nodemap); /* * Extract the cpumap from the cpupool the domain belong to. In fact, @@ -156,7 +156,7 @@ static int numa_place_domain(libxl__gc * rc = libxl_domain_need_memory(CTX, info, &memkb); if (rc) goto out; - if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) { + if (libxl_node_bitmap_alloc(CTX, &cpupool_nodemap, 0)) { rc = ERROR_FAIL; goto out; } @@ -174,17 +174,19 @@ static int numa_place_domain(libxl__gc * if (found == 0) goto out; - /* Map the candidate''s node map to the domain''s info->cpumap */ - libxl__numa_candidate_get_nodemap(gc, &candidate, &candidate_nodemap); - rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap); + /* Map the candidate''s node map to the domain''s info->nodemap */ + libxl__numa_candidate_get_nodemap(gc, &candidate, &info->nodemap); + + /* Avoid trying to set the affinity to nodes that might be in the + * candidate''s nodemap but out of our cpupool. */ + rc = libxl_cpumap_to_nodemap(CTX, &cpupool_info.cpumap, + &cpupool_nodemap); if (rc) goto out; - /* Avoid trying to set the affinity to cpus that might be in the - * nodemap but not in our cpupool. */ - libxl_for_each_set_bit(i, info->cpumap) { - if (!libxl_bitmap_test(&cpupool_info.cpumap, i)) - libxl_bitmap_reset(&info->cpumap, i); + libxl_for_each_set_bit(i, info->nodemap) { + if (!libxl_bitmap_test(&cpupool_nodemap, i)) + libxl_bitmap_reset(&info->nodemap, i); } LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " @@ -193,7 +195,7 @@ static int numa_place_domain(libxl__gc * out: libxl__numa_candidate_dispose(&candidate); - libxl_bitmap_dispose(&candidate_nodemap); + libxl_bitmap_dispose(&cpupool_nodemap); libxl_cpupoolinfo_dispose(&cpupool_info); return rc; } @@ -211,10 +213,10 @@ int libxl__build_pre(libxl__gc *gc, uint /* * Check if the domain has any CPU affinity. If not, try to build * up one. In case numa_place_domain() find at least a suitable - * candidate, it will affect info->cpumap accordingly; if it + * candidate, it will affect info->nodemap accordingly; if it * does not, it just leaves it as it is. This means (unless * some weird error manifests) the subsequent call to - * libxl_set_vcpuaffinity_all() will do the actual placement, + * libxl_domain_set_nodeaffinity() will do the actual placement, * whatever that turns out to be. */ if (libxl_defbool_val(info->numa_placement)) { diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c --- a/tools/libxl/libxl_numa.c +++ b/tools/libxl/libxl_numa.c @@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc * int vcpus_on_node[]) { libxl_dominfo *dinfo = NULL; - libxl_bitmap vcpu_nodemap; + libxl_bitmap dom_nodemap, vcpu_nodemap; int nr_doms, nr_cpus; int i, j, k; @@ -197,6 +197,12 @@ static int nr_vcpus_on_nodes(libxl__gc * return ERROR_FAIL; } + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) { + libxl_bitmap_dispose(&vcpu_nodemap); + libxl_dominfo_list_free(dinfo, nr_doms); + return ERROR_FAIL; + } + for (i = 0; i < nr_doms; i++) { libxl_vcpuinfo *vinfo; int nr_dom_vcpus; @@ -205,14 +211,21 @@ static int nr_vcpus_on_nodes(libxl__gc * if (vinfo == NULL) continue; + /* Retrieve the domain''s node-affinity map */ + libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap); + for (j = 0; j < nr_dom_vcpus; j++) { - /* For each vcpu of each domain, increment the elements of - * the array corresponding to the nodes where the vcpu runs */ + /* + * For each vcpu of each domain, it must have both vcpu-affinity + * and node-affinity to (a pcpu belonging to) a certain node to + * cause an increment in the corresponding element of the array. + */ libxl_bitmap_set_none(&vcpu_nodemap); libxl_for_each_set_bit(k, vinfo[j].cpumap) { int node = tinfo[k].node; if (libxl_bitmap_test(suitable_cpumap, k) && + libxl_bitmap_test(&dom_nodemap, node) && !libxl_bitmap_test(&vcpu_nodemap, node)) { libxl_bitmap_set(&vcpu_nodemap, node); vcpus_on_node[node]++; @@ -223,6 +236,7 @@ static int nr_vcpus_on_nodes(libxl__gc * libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus); } + libxl_bitmap_dispose(&dom_nodemap); libxl_bitmap_dispose(&vcpu_nodemap); libxl_dominfo_list_free(dinfo, nr_doms); return 0;
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list`
Node-affinity is now something that is under (some) control of the user, so show it upon request as part of the output of `xl list'' by the `-n'' option. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> --- Changes from v1: * print_{cpu,node}map() functions added instead of ''state variable''-izing print_bitmap(). diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -2961,14 +2961,95 @@ out: } } -static void list_domains(int verbose, int context, const libxl_dominfo *info, int nb_domain) +/* If map is not full, prints it and returns 0. Returns 1 otherwise. */ +static int print_bitmap(uint8_t *map, int maplen, FILE *stream) +{ + int i; + uint8_t pmap = 0, bitmask = 0; + int firstset = 0, state = 0; + + for (i = 0; i < maplen; i++) { + if (i % 8 == 0) { + pmap = *map++; + bitmask = 1; + } else bitmask <<= 1; + + switch (state) { + case 0: + case 2: + if ((pmap & bitmask) != 0) { + firstset = i; + state++; + } + continue; + case 1: + case 3: + if ((pmap & bitmask) == 0) { + fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); + if (i - 1 > firstset) + fprintf(stream, "-%d", i - 1); + state = 2; + } + continue; + } + } + switch (state) { + case 0: + fprintf(stream, "none"); + break; + case 2: + break; + case 1: + if (firstset == 0) + return 1; + case 3: + fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); + if (i - 1 > firstset) + fprintf(stream, "-%d", i - 1); + break; + } + + return 0; +} + +static void print_cpumap(uint8_t *map, int maplen, FILE *stream) +{ + if (print_bitmap(map, maplen, stream)) + fprintf(stream, "any cpu"); +} + +static void print_nodemap(uint8_t *map, int maplen, FILE *stream) +{ + if (print_bitmap(map, maplen, stream)) + fprintf(stream, "any node"); +} + +static void list_domains(int verbose, int context, int numa, const libxl_dominfo *info, int nb_domain) { int i; static const char shutdown_reason_letters[]= "-rscw"; + libxl_bitmap nodemap; + libxl_physinfo physinfo; + + libxl_bitmap_init(&nodemap); + libxl_physinfo_init(&physinfo); printf("Name ID Mem VCPUs\tState\tTime(s)"); if (verbose) printf(" UUID Reason-Code\tSecurity Label"); if (context && !verbose) printf(" Security Label"); + if (numa) { + if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) { + fprintf(stderr, "libxl_node_bitmap_alloc_failed.\n"); + exit(1); + } + if (libxl_get_physinfo(ctx, &physinfo) != 0) { + fprintf(stderr, "libxl_physinfo failed.\n"); + libxl_bitmap_dispose(&nodemap); + exit(1); + } + + printf(" NODE Affinity"); + } printf("\n"); for (i = 0; i < nb_domain; i++) { char *domname; @@ -3002,14 +3083,23 @@ static void list_domains(int verbose, in rc = libxl_flask_sid_to_context(ctx, info[i].ssidref, &buf, &size); if (rc < 0) - printf(" -"); + printf(" -"); else { - printf(" %s", buf); + printf(" %16s", buf); free(buf); } } + if (numa) { + libxl_domain_get_nodeaffinity(ctx, info[i].domid, &nodemap); + + putchar('' ''); + print_nodemap(nodemap.map, physinfo.nr_nodes, stdout); + } putchar(''\n''); } + + libxl_bitmap_dispose(&nodemap); + libxl_physinfo_dispose(&physinfo); } static void list_vm(void) @@ -3890,12 +3980,14 @@ int main_list(int argc, char **argv) int opt, verbose = 0; int context = 0; int details = 0; + int numa = 0; int option_index = 0; static struct option long_options[] = { {"long", 0, 0, ''l''}, {"help", 0, 0, ''h''}, {"verbose", 0, 0, ''v''}, {"context", 0, 0, ''Z''}, + {"numa", 0, 0, ''n''}, {0, 0, 0, 0} }; @@ -3904,7 +3996,7 @@ int main_list(int argc, char **argv) int nb_domain, rc; while (1) { - opt = getopt_long(argc, argv, "lvhZ", long_options, &option_index); + opt = getopt_long(argc, argv, "lvhZn", long_options, &option_index); if (opt == -1) break; @@ -3921,6 +4013,9 @@ int main_list(int argc, char **argv) case ''Z'': context = 1; break; + case ''n'': + numa = 1; + break; default: fprintf(stderr, "option `%c'' not supported.\n", optopt); break; @@ -3956,7 +4051,7 @@ int main_list(int argc, char **argv) if (details) list_domains_details(info, nb_domain); else - list_domains(verbose, context, info, nb_domain); + list_domains(verbose, context, numa, info, nb_domain); if (info_free) libxl_dominfo_list_free(info, nb_domain); @@ -4228,56 +4323,6 @@ int main_button_press(int argc, char **a return 0; } -static void print_bitmap(uint8_t *map, int maplen, FILE *stream) -{ - int i; - uint8_t pmap = 0, bitmask = 0; - int firstset = 0, state = 0; - - for (i = 0; i < maplen; i++) { - if (i % 8 == 0) { - pmap = *map++; - bitmask = 1; - } else bitmask <<= 1; - - switch (state) { - case 0: - case 2: - if ((pmap & bitmask) != 0) { - firstset = i; - state++; - } - continue; - case 1: - case 3: - if ((pmap & bitmask) == 0) { - fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); - if (i - 1 > firstset) - fprintf(stream, "-%d", i - 1); - state = 2; - } - continue; - } - } - switch (state) { - case 0: - fprintf(stream, "none"); - break; - case 2: - break; - case 1: - if (firstset == 0) { - fprintf(stream, "any cpu"); - break; - } - case 3: - fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); - if (i - 1 > firstset) - fprintf(stream, "-%d", i - 1); - break; - } -} - static void print_vcpuinfo(uint32_t tdomid, const libxl_vcpuinfo *vcpuinfo, uint32_t nr_cpus) @@ -4301,7 +4346,7 @@ static void print_vcpuinfo(uint32_t tdom /* TIM */ printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); /* CPU AFFINITY */ - print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout); + print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout); printf("\n"); } diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c --- a/tools/libxl/xl_cmdtable.c +++ b/tools/libxl/xl_cmdtable.c @@ -50,7 +50,8 @@ struct cmd_spec cmd_table[] = { "[options] [Domain]\n", "-l, --long Output all VM details\n" "-v, --verbose Prints out UUIDs and security context\n" - "-Z, --context Prints out security context" + "-Z, --context Prints out security context\n" + "-n, --numa Prints out NUMA node affinity" }, { "destroy", &main_destroy, 0, 1,
Dario Faggioli
2012-Dec-19 19:07 UTC
[PATCH 10 of 10 v2] docs: rearrange and update NUMA placement documentation
To include the new concept of NUMA aware scheduling and its impact. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown --- a/docs/misc/xl-numa-placement.markdown +++ b/docs/misc/xl-numa-placement.markdown @@ -14,22 +14,67 @@ the memory directly attached to the set The Xen hypervisor deals with NUMA machines by assigning to each domain a "node affinity", i.e., a set of NUMA nodes of the host from which they -get their memory allocated. +get their memory allocated. Also, even if the node affinity of a domain +is allowed to change on-line, it is very important to "place" the domain +correctly when it is fist created, as the most of its memory is allocated +at that time and can not (for now) be moved easily. NUMA awareness becomes very important as soon as many domains start running memory-intensive workloads on a shared host. In fact, the cost of accessing non node-local memory locations is very high, and the performance degradation is likely to be noticeable. -## Guest Placement in xl ## +For more information, have a look at the [Xen NUMA Introduction][numa_intro] +page on the Wiki. + +### Placing via pinning and cpupools ### + +The simplest way of placing a domain on a NUMA node is statically pinning +the domain''s vCPUs to the pCPUs of the node. This goes under the name of +CPU affinity and can be set through the "cpus=" option in the config file +(more about this below). Another option is to pool together the pCPUs +spanning the node and put the domain in such a cpupool with the "pool=" +config option (as documented in our [Wiki][cpupools_howto]). + +In both the above cases, the domain will not be able to execute outside +the specified set of pCPUs for any reasons, even if all those pCPUs are +busy doing something else while there are others, idle, pCPUs. + +So, when doing this, local memory accesses are 100% guaranteed, but that +may come at he cost of some load imbalances. + +### NUMA aware scheduling ### + +If the credit scheduler is in use, the concept of node affinity defined +above does not only apply to memory. In fact, starting from Xen 4.3, the +scheduler always tries to run the domain''s vCPUs on one of the nodes in +its node affinity. Only if that turns out to be impossible, it will just +pick any free pCPU. + +This is, therefore, something more flexible than CPU affinity, as a domain +can still run everywhere, it just prefers some nodes rather than others. +Locality of access is less guaranteed than in the pinning case, but that +comes along with better chances to exploit all the host resources (e.g., +the pCPUs). + +In fact, if all the pCPUs in a domain''s node affinity are busy, it is +possible for the domain to run outside of there, but it is very likely that +slower execution (due to remote memory accesses) is still better than no +execution at all, as it would happen with pinning. For this reason, NUMA +aware scheduling has the potential of bringing substantial performances +benefits, although this will depend on the workload. + +## Guest placement in xl ## If using xl for creating and managing guests, it is very easy to ask for both manual or automatic placement of them across the host''s NUMA nodes. -Note that xm/xend does the very same thing, the only differences residing -in the details of the heuristics adopted for the placement (see below). +Note that xm/xend does a very similar thing, the only differences being +the details of the heuristics adopted for automatic placement (see below), +and the lack of support (in both xm/xend and the Xen versions where that\ +was the default toolstack) for NUMA aware scheduling. -### Manual Guest Placement with xl ### +### Placing the guest manually ### Thanks to the "cpus=" option, it is possible to specify where a domain should be created and scheduled on, directly in its config file. This @@ -41,14 +86,19 @@ This is very simple and effective, but r administrator to explicitly specify affinities for each and every domain, or Xen won''t be able to guarantee the locality for their memory accesses. -It is also possible to deal with NUMA by partitioning the system using -cpupools. Again, this could be "The Right Answer" for many needs and -occasions, but has to be carefully considered and setup by hand. +Notice that this also pins the domain''s vCPUs to the specified set of +pCPUs, so it not only sets the domain''s node affinity (its memory will +come from the nodes to which the pCPUs belong), but at the same time +forces the vCPUs of the domain to be scheduled on those same pCPUs. -### Automatic Guest Placement with xl ### +### Placing the guest automatically ### If no "cpus=" option is specified in the config file, libxl tries to figure out on its own on which node(s) the domain could fit best. +If it finds one (some), the domain''s node affinity get set to there, +and both memory allocations and NUMA aware scheduling (for the credit +scheduler and starting from Xen 4.3) will comply with it. + It is worthwhile noting that optimally fitting a set of VMs on the NUMA nodes of an host is an incarnation of the Bin Packing Problem. In fact, the various VMs with different memory sizes are the items to be packed, @@ -81,7 +131,7 @@ largest amounts of free memory helps kee small, and maximizes the probability of being able to put more domains there. -## Guest Placement within libxl ## +## Guest placement in libxl ## xl achieves automatic NUMA placement because that is what libxl does by default. No API is provided (yet) for modifying the behaviour of @@ -93,15 +143,34 @@ any placement from happening: libxl_defbool_set(&domain_build_info->numa_placement, false); Also, if `numa_placement` is set to `true`, the domain must not -have any cpu affinity (i.e., `domain_build_info->cpumap` must +have any CPU affinity (i.e., `domain_build_info->cpumap` must have all its bits set, as it is by default), or domain creation will fail returning `ERROR_INVAL`. +Starting from Xen 4.3, in case automatic placement happens (and is +successful), it will affect the domain''s node affinity and _not_ its +CPU affinity. Namely, the domain''s vCPUs will not be pinned to any +pCPU on the host, but the memory from the domain will come from the +selected node(s) and the NUMA aware scheduling (if the credit scheduler +is in use) will try to keep the domain there as much as possible. + Besides than that, looking and/or tweaking the placement algorithm search "Automatic NUMA placement" in libxl\_internal.h. Note this may change in future versions of Xen/libxl. +## Xen < 4.3 ## + +As NUMA aware scheduling is a new feature of Xen 4.3, things are a little +bit different for earlier version of Xen. If no "cpus=" option is specified +and Xen 4.2 is in use, the automatic placement algorithm still runs, but +the results is used to _pin_ the vCPUs of the domain to the output node(s). +This is consistent with what was happening with xm/xend, which were also +affecting the domain''s CPU affinity. + +On a version of Xen earlier than 4.2, there is not automatic placement at +all in xl or libxl, and hence no node or CPU affinity being affected. + ## Limitations ## Analyzing various possible placement solutions is what makes the @@ -109,3 +178,6 @@ algorithm flexible and quite effective. it won''t scale well to systems with arbitrary number of nodes. For this reason, automatic placement is disabled (with a warning) if it is requested on a host with more than 16 NUMA nodes. + +[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction +[cpupools_howto]: http://wiki.xen.org/wiki/Cpupools_Howto
On Wed, 2012-12-19 at 20:07 +0100, Dario Faggioli wrote:> Which, reasoning in terms of %-performances increase/decrease, means NUMA aware > scheduling does as follows, as compared to no affinity at all and to pinning: > > ---------------------------------- > | SpecJBB2005 (throughput) | > ---------------------------------- > | #VMs | No affinity | Pinning | > | 2 | +14.36% | -0.36% | > | 6 | +14.72% | -0.26% | > | 10 | +11.88% | -2.44% | > ---------------------------------- > | Sysbench memory (throughput) | > ---------------------------------- > | #VMs | No affinity | Pinning | > | 2 | +14.12% | +0.09% | > | 6 | +11.12% | +2.14% | > | 10 | +11.81% | +5.06% | > ---------------------------------- > | LMBench proc (latency) | > ---------------------------------- > | #VMs | No affinity | Pinning | > ---------------------------------- > | 2 | +10.02% | +1.07% | > | 6 | +3.45% | +1.02% | > | 10 | +2.94% | +4.53% | > ---------------------------------- >Just to be sure, as I may have not picked up the perfect wording, in the table above a +xx.yy% means NUMA aware scheduling (i.e., with this patch series fully applied) performs xx.yy% _better_ than either ''No affinity'' or ''Pinning''. Conversely, a -zz.ww% means it performs zz.ww% worse. Sorry but the different combination and the presence of both throughput values (which are better if high) and latency values (which are better if low) made things a little bit tricky to present effectively. :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Juergen Gross
2012-Dec-20 06:44 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Am 19.12.2012 20:07, schrieb Dario Faggioli:> As vcpu-affinity tells where VCPUs must run, node-affinity tells > where they should or, better, prefer. While respecting vcpu-affinity > remains mandatory, node-affinity is not that strict, it only expresses > a preference, although honouring it is almost always true that will > bring significant performances benefit (especially as compared to > not having any affinity at all). > > This change modifies the VCPU load balancing algorithm (for the > credit scheduler only), introducing a two steps logic. > During the first step, we use the node-affinity mask. The aim is > giving precedence to the CPUs where it is known to be preferable > for the domain to run. If that fails in finding a valid PCPU, the > node-affinity is just ignored and, in the second step, we fall > back to using cpu-affinity only. > > Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com> > --- > Changes from v1: > * CPU masks variables moved off from the stack, as requested during > review. As per the comments in the code, having them in the private > (per-scheduler instance) struct could have been enough, but it would be > racy (again, see comments). For that reason, use a global bunch of > them of (via per_cpu());Wouldn''t it be better to put the mask in the scheduler private per-pcpu area? This could be applied to several other instances of cpu masks on the stack, too.> * George suggested a different load balancing logic during v1''s review. I > think he was right and then I changed the old implementation in a way > that resembles exactly that. I rewrote most of this patch to introduce > a more sensible and effective noda-affinity handling logic. > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -111,6 +111,33 @@ > > > /* > + * Node Balancing > + */ > +#define CSCHED_BALANCE_CPU_AFFINITY 0 > +#define CSCHED_BALANCE_NODE_AFFINITY 1 > +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY > + > +/* > + * When building for high number of CPUs, cpumask_var_t > + * variables on stack are better avoided. However, we need them, > + * in order to be able to consider both vcpu and node affinity. > + * We also don''t want to xmalloc()/xfree() them, as that would > + * happen in critical code paths. Therefore, let''s (pre)allocate > + * some scratch space for them. > + * > + * Having one mask for each instance of the scheduler seems > + * enough, and that would suggest putting it wihin `struct > + * csched_private'' below. However, we don''t always hold the > + * private scheduler lock when the mask itself would need to > + * be used, leaving room for races. For that reason, we define > + * and use a cpumask_t for each CPU. As preemption is not an > + * issue here (we''re holding the runqueue spin-lock!), that is > + * both enough and safe. > + */ > +DEFINE_PER_CPU(cpumask_t, csched_balance_mask); > +#define scratch_balance_mask (this_cpu(csched_balance_mask)) > + > +/* > * Boot parameters > */ > static int __read_mostly sched_credit_tslice_ms = CSCHED_DEFAULT_TSLICE_MS; > @@ -159,6 +186,9 @@ struct csched_dom { > struct list_head active_vcpu; > struct list_head active_sdom_elem; > struct domain *dom; > + /* cpumask translated from the domain''s node-affinity. > + * Basically, the CPUs we prefer to be scheduled on. */ > + cpumask_var_t node_affinity_cpumask; > uint16_t active_vcpu_count; > uint16_t weight; > uint16_t cap; > @@ -239,6 +269,42 @@ static inline void > list_del_init(&svc->runq_elem); > } > > +#define for_each_csched_balance_step(__step) \ > + for ( (__step) = CSCHED_BALANCE_LAST; (__step)>= 0; (__step)-- ) > + > +/* > + * Each csched-balance step has to use its own cpumask. This function > + * determines which one, given the step, and copies it in mask. Notice > + * that, in case of node-affinity balancing step, it also filters out from > + * the node-affinity mask the cpus that are not part of vc''s cpu-affinity, > + * as we do not want to end up running a vcpu where it would like, but > + * is not allowed to! > + * > + * As an optimization, if a domain does not have any node-affinity at all > + * (namely, its node affinity is automatically computed), not only the > + * computed mask will reflect its vcpu-affinity, but we also return -1 to > + * let the caller know that he can skip the step or quit the loop (if he > + * wants). > + */ > +static int > +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) > +{ > + if ( step == CSCHED_BALANCE_NODE_AFFINITY ) > + { > + struct domain *d = vc->domain; > + struct csched_dom *sdom = CSCHED_DOM(d); > + > + cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity); > + > + if ( cpumask_full(sdom->node_affinity_cpumask) ) > + return -1; > + } > + else /* step == CSCHED_BALANCE_CPU_AFFINITY */ > + cpumask_copy(mask, vc->cpu_affinity); > + > + return 0; > +} > + > static void burn_credits(struct csched_vcpu *svc, s_time_t now) > { > s_time_t delta; > @@ -266,67 +332,94 @@ static inline void > struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu)); > struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); > cpumask_t mask, idle_mask; > - int idlers_empty; > + int balance_step, idlers_empty; > > ASSERT(cur); > - cpumask_clear(&mask); > - > idlers_empty = cpumask_empty(prv->idlers); > > /* > - * If the pcpu is idle, or there are no idlers and the new > - * vcpu is a higher priority than the old vcpu, run it here. > - * > - * If there are idle cpus, first try to find one suitable to run > - * new, so we can avoid preempting cur. If we cannot find a > - * suitable idler on which to run new, run it here, but try to > - * find a suitable idler on which to run cur instead. > + * Node and vcpu-affinity balancing loop. To speed things up, in case > + * no node-affinity at all is present, scratch_balance_mask reflects > + * the vcpu-affinity, and ret is -1, so that we then can quit the > + * loop after only one step. > */ > - if ( cur->pri == CSCHED_PRI_IDLE > - || (idlers_empty&& new->pri> cur->pri) ) > + for_each_csched_balance_step( balance_step ) > { > - if ( cur->pri != CSCHED_PRI_IDLE ) > - SCHED_STAT_CRANK(tickle_idlers_none); > - cpumask_set_cpu(cpu,&mask); > - } > - else if ( !idlers_empty ) > - { > - /* Check whether or not there are idlers that can run new */ > - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > + int ret, new_idlers_empty; > + > + cpumask_clear(&mask); > > /* > - * If there are no suitable idlers for new, and it''s higher > - * priority than cur, ask the scheduler to migrate cur away. > - * We have to act like this (instead of just waking some of > - * the idlers suitable for cur) because cur is running. > + * If the pcpu is idle, or there are no idlers and the new > + * vcpu is a higher priority than the old vcpu, run it here. > * > - * If there are suitable idlers for new, no matter priorities, > - * leave cur alone (as it is running and is, likely, cache-hot) > - * and wake some of them (which is waking up and so is, likely, > - * cache cold anyway). > + * If there are idle cpus, first try to find one suitable to run > + * new, so we can avoid preempting cur. If we cannot find a > + * suitable idler on which to run new, run it here, but try to > + * find a suitable idler on which to run cur instead. > */ > - if ( cpumask_empty(&idle_mask)&& new->pri> cur->pri ) > + if ( cur->pri == CSCHED_PRI_IDLE > + || (idlers_empty&& new->pri> cur->pri) ) > { > - SCHED_STAT_CRANK(tickle_idlers_none); > - SCHED_VCPU_STAT_CRANK(cur, kicked_away); > - SCHED_VCPU_STAT_CRANK(cur, migrate_r); > - SCHED_STAT_CRANK(migrate_kicked_away); > - set_bit(_VPF_migrating,&cur->vcpu->pause_flags); > + if ( cur->pri != CSCHED_PRI_IDLE ) > + SCHED_STAT_CRANK(tickle_idlers_none); > cpumask_set_cpu(cpu,&mask); > } > - else if ( !cpumask_empty(&idle_mask) ) > + else if ( !idlers_empty ) > { > - /* Which of the idlers suitable for new shall we wake up? */ > - SCHED_STAT_CRANK(tickle_idlers_some); > - if ( opt_tickle_one_idle ) > + /* Are there idlers suitable for new (for this balance step)? */ > + ret = csched_balance_cpumask(new->vcpu, balance_step, > +&scratch_balance_mask); > + cpumask_and(&idle_mask, prv->idlers,&scratch_balance_mask); > + new_idlers_empty = cpumask_empty(&idle_mask); > + > + /* > + * Let''s not be too harsh! If there aren''t idlers suitable > + * for new in its node-affinity mask, make sure we check its > + * vcpu-affinity as well, before tacking final decisions. > + */ > + if ( new_idlers_empty > +&& (balance_step == CSCHED_BALANCE_NODE_AFFINITY&& !ret) ) > + continue; > + > + /* > + * If there are no suitable idlers for new, and it''s higher > + * priority than cur, ask the scheduler to migrate cur away. > + * We have to act like this (instead of just waking some of > + * the idlers suitable for cur) because cur is running. > + * > + * If there are suitable idlers for new, no matter priorities, > + * leave cur alone (as it is running and is, likely, cache-hot) > + * and wake some of them (which is waking up and so is, likely, > + * cache cold anyway). > + */ > + if ( new_idlers_empty&& new->pri> cur->pri ) > { > - this_cpu(last_tickle_cpu) > - cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask); > - cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask); > + SCHED_STAT_CRANK(tickle_idlers_none); > + SCHED_VCPU_STAT_CRANK(cur, kicked_away); > + SCHED_VCPU_STAT_CRANK(cur, migrate_r); > + SCHED_STAT_CRANK(migrate_kicked_away); > + set_bit(_VPF_migrating,&cur->vcpu->pause_flags); > + cpumask_set_cpu(cpu,&mask); > } > - else > - cpumask_or(&mask,&mask,&idle_mask); > + else if ( !new_idlers_empty ) > + { > + /* Which of the idlers suitable for new shall we wake up? */ > + SCHED_STAT_CRANK(tickle_idlers_some); > + if ( opt_tickle_one_idle ) > + { > + this_cpu(last_tickle_cpu) > + cpumask_cycle(this_cpu(last_tickle_cpu),&idle_mask); > + cpumask_set_cpu(this_cpu(last_tickle_cpu),&mask); > + } > + else > + cpumask_or(&mask,&mask,&idle_mask); > + } > } > + > + /* Did we find anyone (or csched_balance_cpumask() says we''re done)? */ > + if ( !cpumask_empty(&mask) || ret ) > + break; > } > > if ( !cpumask_empty(&mask) ) > @@ -475,15 +568,28 @@ static inline int > } > > static inline int > -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu) > +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask) > { > /* > * Don''t pick up work that''s in the peer''s scheduling tail or hot on > - * peer PCPU. Only pick up work that''s allowed to run on our CPU. > + * peer PCPU. Only pick up work that prefers and/or is allowed to run > + * on our CPU. > */ > return !vc->is_running&& > !__csched_vcpu_is_cache_hot(vc)&& > - cpumask_test_cpu(dest_cpu, vc->cpu_affinity); > + cpumask_test_cpu(dest_cpu, mask); > +} > + > +static inline int > +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) > +{ > + /* > + * Consent to migration if cpu is one of the idlers in the VCPU''s > + * affinity mask. In fact, if that is not the case, it just means it > + * was some other CPU that was tickled and should hence come and pick > + * VCPU up. Migrating it to cpu would only make things worse. > + */ > + return cpumask_test_cpu(cpu, idlers)&& cpumask_test_cpu(cpu, mask); > } > > static int > @@ -493,85 +599,98 @@ static int > cpumask_t idlers; > cpumask_t *online; > struct csched_pcpu *spc = NULL; > + int ret, balance_step; > int cpu; > > - /* > - * Pick from online CPUs in VCPU''s affinity mask, giving a > - * preference to its current processor if it''s in there. > - */ > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > - cpumask_and(&cpus, online, vc->cpu_affinity); > - cpu = cpumask_test_cpu(vc->processor,&cpus) > - ? vc->processor > - : cpumask_cycle(vc->processor,&cpus); > - ASSERT( !cpumask_empty(&cpus)&& cpumask_test_cpu(cpu,&cpus) ); > + for_each_csched_balance_step( balance_step ) > + { > + /* Pick an online CPU from the proper affinity mask */ > + ret = csched_balance_cpumask(vc, balance_step,&cpus); > + cpumask_and(&cpus,&cpus, online); > > - /* > - * Try to find an idle processor within the above constraints. > - * > - * In multi-core and multi-threaded CPUs, not all idle execution > - * vehicles are equal! > - * > - * We give preference to the idle execution vehicle with the most > - * idling neighbours in its grouping. This distributes work across > - * distinct cores first and guarantees we don''t do something stupid > - * like run two VCPUs on co-hyperthreads while there are idle cores > - * or sockets. > - * > - * Notice that, when computing the "idleness" of cpu, we may want to > - * discount vc. That is, iff vc is the currently running and the only > - * runnable vcpu on cpu, we add cpu to the idlers. > - */ > - cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers); > - if ( vc->processor == cpu&& IS_RUNQ_IDLE(cpu) ) > - cpumask_set_cpu(cpu,&idlers); > - cpumask_and(&cpus,&cpus,&idlers); > - cpumask_clear_cpu(cpu,&cpus); > + /* If present, prefer vc''s current processor */ > + cpu = cpumask_test_cpu(vc->processor,&cpus) > + ? vc->processor > + : cpumask_cycle(vc->processor,&cpus); > + ASSERT( !cpumask_empty(&cpus)&& cpumask_test_cpu(cpu,&cpus) ); > > - while ( !cpumask_empty(&cpus) ) > - { > - cpumask_t cpu_idlers; > - cpumask_t nxt_idlers; > - int nxt, weight_cpu, weight_nxt; > - int migrate_factor; > + /* > + * Try to find an idle processor within the above constraints. > + * > + * In multi-core and multi-threaded CPUs, not all idle execution > + * vehicles are equal! > + * > + * We give preference to the idle execution vehicle with the most > + * idling neighbours in its grouping. This distributes work across > + * distinct cores first and guarantees we don''t do something stupid > + * like run two VCPUs on co-hyperthreads while there are idle cores > + * or sockets. > + * > + * Notice that, when computing the "idleness" of cpu, we may want to > + * discount vc. That is, iff vc is the currently running and the only > + * runnable vcpu on cpu, we add cpu to the idlers. > + */ > + cpumask_and(&idlers,&cpu_online_map, CSCHED_PRIV(ops)->idlers); > + if ( vc->processor == cpu&& IS_RUNQ_IDLE(cpu) ) > + cpumask_set_cpu(cpu,&idlers); > + cpumask_and(&cpus,&cpus,&idlers); > + /* If there are idlers and cpu is still not among them, pick one */ > + if ( !cpumask_empty(&cpus)&& !cpumask_test_cpu(cpu,&cpus) ) > + cpu = cpumask_cycle(cpu,&cpus); > + cpumask_clear_cpu(cpu,&cpus); > > - nxt = cpumask_cycle(cpu,&cpus); > + while ( !cpumask_empty(&cpus) ) > + { > + cpumask_t cpu_idlers; > + cpumask_t nxt_idlers; > + int nxt, weight_cpu, weight_nxt; > + int migrate_factor; > > - if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > - { > - /* We''re on the same socket, so check the busy-ness of threads. > - * Migrate if # of idlers is less at all */ > - ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 1; > - cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask, cpu)); > - cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask, nxt)); > - } > - else > - { > - /* We''re on different sockets, so check the busy-ness of cores. > - * Migrate only if the other core is twice as idle */ > - ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 2; > - cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu)); > - cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt)); > + nxt = cpumask_cycle(cpu,&cpus); > + > + if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > + { > + /* We''re on the same socket, so check the busy-ness of threads. > + * Migrate if # of idlers is less at all */ > + ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 1; > + cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_sibling_mask, > + cpu)); > + cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_sibling_mask, > + nxt)); > + } > + else > + { > + /* We''re on different sockets, so check the busy-ness of cores. > + * Migrate only if the other core is twice as idle */ > + ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 2; > + cpumask_and(&cpu_idlers,&idlers, per_cpu(cpu_core_mask, cpu)); > + cpumask_and(&nxt_idlers,&idlers, per_cpu(cpu_core_mask, nxt)); > + } > + > + weight_cpu = cpumask_weight(&cpu_idlers); > + weight_nxt = cpumask_weight(&nxt_idlers); > + /* smt_power_savings: consolidate work rather than spreading it */ > + if ( sched_smt_power_savings ? > + weight_cpu> weight_nxt : > + weight_cpu * migrate_factor< weight_nxt ) > + { > + cpumask_and(&nxt_idlers,&cpus,&nxt_idlers); > + spc = CSCHED_PCPU(nxt); > + cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers); > + cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu)); > + } > + else > + { > + cpumask_andnot(&cpus,&cpus,&nxt_idlers); > + } > } > > - weight_cpu = cpumask_weight(&cpu_idlers); > - weight_nxt = cpumask_weight(&nxt_idlers); > - /* smt_power_savings: consolidate work rather than spreading it */ > - if ( sched_smt_power_savings ? > - weight_cpu> weight_nxt : > - weight_cpu * migrate_factor< weight_nxt ) > - { > - cpumask_and(&nxt_idlers,&cpus,&nxt_idlers); > - spc = CSCHED_PCPU(nxt); > - cpu = cpumask_cycle(spc->idle_bias,&nxt_idlers); > - cpumask_andnot(&cpus,&cpus, per_cpu(cpu_sibling_mask, cpu)); > - } > - else > - { > - cpumask_andnot(&cpus,&cpus,&nxt_idlers); > - } > + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */ > + if ( cpumask_test_cpu(cpu,&idlers) || ret ) > + break; > } > > if ( commit&& spc ) > @@ -913,6 +1032,13 @@ csched_alloc_domdata(const struct schedu > if ( sdom == NULL ) > return NULL; > > + if ( !alloc_cpumask_var(&sdom->node_affinity_cpumask) ) > + { > + xfree(sdom); > + return NULL; > + } > + cpumask_setall(sdom->node_affinity_cpumask); > + > /* Initialize credit and weight */ > INIT_LIST_HEAD(&sdom->active_vcpu); > sdom->active_vcpu_count = 0; > @@ -944,6 +1070,9 @@ csched_dom_init(const struct scheduler * > static void > csched_free_domdata(const struct scheduler *ops, void *data) > { > + struct csched_dom *sdom = data; > + > + free_cpumask_var(sdom->node_affinity_cpumask); > xfree(data); > } > > @@ -1240,9 +1369,10 @@ csched_tick(void *_cpu) > } > > static struct csched_vcpu * > -csched_runq_steal(int peer_cpu, int cpu, int pri) > +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) > { > const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); > + struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu)); > const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu); > struct csched_vcpu *speer; > struct list_head *iter; > @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu, > if ( speer->pri<= pri ) > break; > > - /* Is this VCPU is runnable on our PCPU? */ > + /* Is this VCPU runnable on our PCPU? */ > vc = speer->vcpu; > BUG_ON( is_idle_vcpu(vc) ); > > - if (__csched_vcpu_is_migrateable(vc, cpu)) > + /* > + * Retrieve the correct mask for this balance_step or, if we''re > + * dealing with node-affinity and the vcpu has no node affinity > + * at all, just skip this vcpu. That is needed if we want to > + * check if we have any node-affine work to steal first (wrt > + * any vcpu-affine work). > + */ > + if ( csched_balance_cpumask(vc, balance_step, > +&scratch_balance_mask) ) > + continue; > + > + if ( __csched_vcpu_is_migrateable(vc, cpu,&scratch_balance_mask) > +&& __csched_vcpu_should_migrate(cpu,&scratch_balance_mask, > + prv->idlers) ) > { > /* We got a candidate. Grab it! */ > TRACE_3D(TRC_CSCHED_STOLEN_VCPU, peer_cpu, > @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat > struct csched_vcpu *speer; > cpumask_t workers; > cpumask_t *online; > - int peer_cpu; > + int peer_cpu, peer_node, bstep; > + int node = cpu_to_node(cpu); > > BUG_ON( cpu != snext->vcpu->processor ); > online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu)); > @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat > SCHED_STAT_CRANK(load_balance_other); > > /* > - * Peek at non-idling CPUs in the system, starting with our > - * immediate neighbour. > + * Let''s look around for work to steal, taking both vcpu-affinity > + * and node-affinity into account. More specifically, we check all > + * the non-idle CPUs'' runq, looking for: > + * 1. any node-affine work to steal first, > + * 2. if not finding anything, any vcpu-affine work to steal. > */ > - cpumask_andnot(&workers, online, prv->idlers); > - cpumask_clear_cpu(cpu,&workers); > - peer_cpu = cpu; > + for_each_csched_balance_step( bstep ) > + { > + /* > + * We peek at the non-idling CPUs in a node-wise fashion. In fact, > + * it is more likely that we find some node-affine work on our same > + * node, not to mention that migrating vcpus within the same node > + * could well expected to be cheaper than across-nodes (memory > + * stays local, there might be some node-wide cache[s], etc.). > + */ > + peer_node = node; > + do > + { > + /* Find out what the !idle are in this node */ > + cpumask_andnot(&workers, online, prv->idlers); > + cpumask_and(&workers,&workers,&node_to_cpumask(peer_node)); > + cpumask_clear_cpu(cpu,&workers); > > - while ( !cpumask_empty(&workers) ) > - { > - peer_cpu = cpumask_cycle(peer_cpu,&workers); > - cpumask_clear_cpu(peer_cpu,&workers); > + if ( cpumask_empty(&workers) ) > + goto next_node; > > - /* > - * Get ahold of the scheduler lock for this peer CPU. > - * > - * Note: We don''t spin on this lock but simply try it. Spinning could > - * cause a deadlock if the peer CPU is also load balancing and trying > - * to lock this CPU. > - */ > - if ( !pcpu_schedule_trylock(peer_cpu) ) > - { > - SCHED_STAT_CRANK(steal_trylock_failed); > - continue; > - } > + peer_cpu = cpumask_first(&workers); > + do > + { > + /* > + * Get ahold of the scheduler lock for this peer CPU. > + * > + * Note: We don''t spin on this lock but simply try it. Spinning > + * could cause a deadlock if the peer CPU is also load > + * balancing and trying to lock this CPU. > + */ > + if ( !pcpu_schedule_trylock(peer_cpu) ) > + { > + SCHED_STAT_CRANK(steal_trylock_failed); > + peer_cpu = cpumask_cycle(peer_cpu,&workers); > + continue; > + } > > - /* > - * Any work over there to steal? > - */ > - speer = cpumask_test_cpu(peer_cpu, online) ? > - csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL; > - pcpu_schedule_unlock(peer_cpu); > - if ( speer != NULL ) > - { > - *stolen = 1; > - return speer; > - } > + /* Any work over there to steal? */ > + speer = cpumask_test_cpu(peer_cpu, online) ? > + csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL; > + pcpu_schedule_unlock(peer_cpu); > + > + /* As soon as one vcpu is found, balancing ends */ > + if ( speer != NULL ) > + { > + *stolen = 1; > + return speer; > + } > + > + peer_cpu = cpumask_cycle(peer_cpu,&workers); > + > + } while( peer_cpu != cpumask_first(&workers) ); > + > + next_node: > + peer_node = cycle_node(peer_node, node_online_map); > + } while( peer_node != node ); > } > > out: > diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h > --- a/xen/include/xen/nodemask.h > +++ b/xen/include/xen/nodemask.h > @@ -41,6 +41,8 @@ > * int last_node(mask) Number highest set bit, or MAX_NUMNODES > * int first_unset_node(mask) First node not set in mask, or > * MAX_NUMNODES. > + * int cycle_node(node, mask) Next node cycling from ''node'', or > + * MAX_NUMNODES > * > * nodemask_t nodemask_of_node(node) Return nodemask with bit ''node'' set > * NODE_MASK_ALL Initializer - all bits set > @@ -254,6 +256,16 @@ static inline int __first_unset_node(con > find_first_zero_bit(maskp->bits, MAX_NUMNODES)); > } > > +#define cycle_node(n, src) __cycle_node((n),&(src), MAX_NUMNODES) > +static inline int __cycle_node(int n, const nodemask_t *maskp, int nbits) > +{ > + int nxt = __next_node(n, maskp, nbits); > + > + if (nxt == nbits) > + nxt = __first_node(maskp, nbits); > + return nxt; > +} > + > #define NODE_MASK_LAST_WORD BITMAP_LAST_WORD_MASK(MAX_NUMNODES) > > #if MAX_NUMNODES<= BITS_PER_LONG > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > >-- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Dario Faggioli
2012-Dec-20 08:16 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Thu, 2012-12-20 at 07:44 +0100, Juergen Gross wrote:> Am 19.12.2012 20:07, schrieb Dario Faggioli: > > [...] > > > > This change modifies the VCPU load balancing algorithm (for the > > credit scheduler only), introducing a two steps logic. > > During the first step, we use the node-affinity mask. The aim is > > giving precedence to the CPUs where it is known to be preferable > > for the domain to run. If that fails in finding a valid PCPU, the > > node-affinity is just ignored and, in the second step, we fall > > back to using cpu-affinity only. > > > > Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com> > > --- > > Changes from v1: > > * CPU masks variables moved off from the stack, as requested during > > review. As per the comments in the code, having them in the private > > (per-scheduler instance) struct could have been enough, but it would be > > racy (again, see comments). For that reason, use a global bunch of > > them of (via per_cpu()); > > Wouldn''t it be better to put the mask in the scheduler private per-pcpu area? > This could be applied to several other instances of cpu masks on the stack, > too. >Yes, as I tired to explain, if it''s per-cpu it should be fine, since credit has one runq per each CPU and hence runq lock is enough for serialization. BTW, can you be a little bit more specific about where you''re suggesting to put it? I''m sorry but I''m not sure I figured what you mean by "the scheduler private per-pcpu area"... Do you perhaps mean making it a member of `struct csched_pcpu'' ? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Juergen Gross
2012-Dec-20 08:25 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Am 20.12.2012 09:16, schrieb Dario Faggioli:> On Thu, 2012-12-20 at 07:44 +0100, Juergen Gross wrote: >> Am 19.12.2012 20:07, schrieb Dario Faggioli: >>> [...] >>> >>> This change modifies the VCPU load balancing algorithm (for the >>> credit scheduler only), introducing a two steps logic. >>> During the first step, we use the node-affinity mask. The aim is >>> giving precedence to the CPUs where it is known to be preferable >>> for the domain to run. If that fails in finding a valid PCPU, the >>> node-affinity is just ignored and, in the second step, we fall >>> back to using cpu-affinity only. >>> >>> Signed-off-by: Dario Faggioli<dario.faggioli@citrix.com> >>> --- >>> Changes from v1: >>> * CPU masks variables moved off from the stack, as requested during >>> review. As per the comments in the code, having them in the private >>> (per-scheduler instance) struct could have been enough, but it would be >>> racy (again, see comments). For that reason, use a global bunch of >>> them of (via per_cpu()); >> >> Wouldn''t it be better to put the mask in the scheduler private per-pcpu area? >> This could be applied to several other instances of cpu masks on the stack, >> too. >> > Yes, as I tired to explain, if it''s per-cpu it should be fine, since > credit has one runq per each CPU and hence runq lock is enough for > serialization. > > BTW, can you be a little bit more specific about where you''re suggesting > to put it? I''m sorry but I''m not sure I figured what you mean by "the > scheduler private per-pcpu area"... Do you perhaps mean making it a > member of `struct csched_pcpu'' ?Yes, that''s what I would suggest. Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Dario Faggioli
2012-Dec-20 08:33 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Thu, 2012-12-20 at 08:25 +0000, Juergen Gross wrote:> > BTW, can you be a little bit more specific about where you''re suggesting > > to put it? I''m sorry but I''m not sure I figured what you mean by "the > > scheduler private per-pcpu area"... Do you perhaps mean making it a > > member of `struct csched_pcpu'' ? > > Yes, that''s what I would suggest. >Ok then, functionally, that is going to be exactly the same thing as where it is right now, i.e., a set of global per_cpu() variables. It is possible for your solution to bring some cache/locality benefits, although that will very much depends on single cases, architecture, workload, etc. That being said, I''m definitely fine with it and can go for it. It was Jan that suggested/asked to pull them out of the stack, and I guess it''s George''s taste that we value most when hacking sched_*.c, so let''s see if they want to comment on this and then I''ll decide. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Juergen Gross
2012-Dec-20 08:39 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
Am 20.12.2012 09:33, schrieb Dario Faggioli:> On Thu, 2012-12-20 at 08:25 +0000, Juergen Gross wrote: >>> BTW, can you be a little bit more specific about where you''re suggesting >>> to put it? I''m sorry but I''m not sure I figured what you mean by "the >>> scheduler private per-pcpu area"... Do you perhaps mean making it a >>> member of `struct csched_pcpu'' ? >> >> Yes, that''s what I would suggest. >> > Ok then, functionally, that is going to be exactly the same thing as > where it is right now, i.e., a set of global per_cpu() variables. It is > possible for your solution to bring some cache/locality benefits, > although that will very much depends on single cases, architecture, > workload, etc.The space is only allocated if sched_credit is responsible for the pcpu. Juergen -- Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
Ian Campbell
2012-Dec-20 08:41 UTC
Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote:> This reduces the complexity of the overall algorithm, as it moves aWhat was/is the complexity before/after this change? ISTR it was O(n^2) or something along those lines before. Ian.
Dario Faggioli
2012-Dec-20 08:58 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Thu, 2012-12-20 at 08:39 +0000, Juergen Gross wrote:> > Ok then, functionally, that is going to be exactly the same thing as > > where it is right now, i.e., a set of global per_cpu() variables. It is > > possible for your solution to bring some cache/locality benefits, > > although that will very much depends on single cases, architecture, > > workload, etc. > > The space is only allocated if sched_credit is responsible for the pcpu. >Right, that''s true too. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2012-Dec-20 09:17 UTC
Re: [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote: > More specifically: > 1. replaces xenctl_cpumap with xenctl_bitmap > 2. provides bitmap_to_xenctl_bitmap and the reverse; > 3. re-implement cpumask_to_xenctl_bitmap with > bitmap_to_xenctl_bitmap and the reverse; > > Other than #3, no functional changes. Interface only slightly > afected. > > This is in preparation of introducing NUMA node-affinity maps.This (at least) lacks an adjustment to tools/tests/mce-test/tools/xen-mceinj.c afaict. Jan.
Jan Beulich
2012-Dec-20 09:18 UTC
Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote: > --- a/xen/include/xen/nodemask.h > +++ b/xen/include/xen/nodemask.h > @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const > } > #endif > > +/* > + * nodemask_var_t: struct nodemask for stack usage. > + * > + * See definition of cpumask_var_t in include/xen//cpumask.h. > + */ > +#if MAX_NUMNODES > 2 * BITS_PER_LONGIs that case reasonable to expect? Jan> +#include <xen/xmalloc.h> > + > +typedef nodemask_t *nodemask_var_t; > + > +#define nr_nodemask_bits (BITS_TO_LONGS(MAX_NUMNODES) * BITS_PER_LONG) > + > +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask) > +{ > + *(void **)mask = _xmalloc(nr_nodemask_bits / 8, sizeof(long)); > + return *mask != NULL; > +} > + > +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask) > +{ > + *(void **)mask = _xzalloc(nr_nodemask_bits / 8, sizeof(long)); > + return *mask != NULL; > +} > + > +static inline void free_nodemask_var(nodemask_var_t mask) > +{ > + xfree(mask); > +} > +#else > +typedef nodemask_t nodemask_var_t; > + > +static inline bool_t alloc_nodemask_var(nodemask_var_t *mask) > +{ > + return 1; > +} > + > +static inline bool_t zalloc_nodemask_var(nodemask_var_t *mask) > +{ > + nodes_clear(*mask); > + return 1; > +} > + > +static inline void free_nodemask_var(nodemask_var_t mask) > +{ > +} > +#endif > + > #if MAX_NUMNODES > 1 > #define for_each_node_mask(node, mask) \ > for ((node) = first_node(mask); \
Jan Beulich
2012-Dec-20 09:22 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
>>> On 20.12.12 at 09:33, Dario Faggioli <dario.faggioli@citrix.com> wrote: > It was Jan that suggested/asked to pull them out of the stack, and I > guess it''s George''s taste that we value most when hacking sched_*.c, so > let''s see if they want to comment on this and then I''ll decide. :-)Yes, I''m with Juergen here. Jan
Dario Faggioli
2012-Dec-20 09:24 UTC
Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
On Thu, 2012-12-20 at 08:41 +0000, Ian Campbell wrote:> On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote: > > This reduces the complexity of the overall algorithm, as it moves a > > What was/is the complexity before/after this change? ISTR it was O(n^2) > or something along those lines before. >Yes and no. Let me try to explain. Counting the number of vCPUs that can run on (a set) of node(s) was and remains O(n_domains*n_domain_vcpus), so, yes, sort of quadratic. The upper bound of the number of candidates evaluated by the placement algorithm is exponential in the number of NUMA nodes: O(2^n_nodes). Before this change, we counted the number of vCPUs runnable on each candidate during each step, so the overall complexity was: O(2^n_nodes) * O(n_domains*n_domain_vcpus) In this change I count the number of vCPUs runnable on each candidate only once, and that happens outside the candidate generation loop, so the overall complexity is: O(n_domains*n_domains_vcpus) + O(2^n_nodes) = O(2^n_nodes) Did I answer your question? Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-20 09:35 UTC
Re: [PATCH 01 of 10 v2] xen, libxc: rename xenctl_cpumap to xenctl_bitmap
On Thu, 2012-12-20 at 09:17 +0000, Jan Beulich wrote:> >>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote: > > More specifically: > > 1. replaces xenctl_cpumap with xenctl_bitmap > > 2. provides bitmap_to_xenctl_bitmap and the reverse; > > 3. re-implement cpumask_to_xenctl_bitmap with > > bitmap_to_xenctl_bitmap and the reverse; > > > > Other than #3, no functional changes. Interface only slightly > > afected. > > > > This is in preparation of introducing NUMA node-affinity maps. > > This (at least) lacks an adjustment to > tools/tests/mce-test/tools/xen-mceinj.c afaict. >Indeed. I really think I was building those, but it looks like I wasn''t. Sorry for that and thanks for pointing out, will fix. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-20 09:55 UTC
Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
On Thu, 2012-12-20 at 09:18 +0000, Jan Beulich wrote:> > +/* > > + * nodemask_var_t: struct nodemask for stack usage. > > + * > > + * See definition of cpumask_var_t in include/xen//cpumask.h. > > + */ > > +#if MAX_NUMNODES > 2 * BITS_PER_LONG > > Is that case reasonable to expect? >I really don''t think so. Here, I really was just aligning cpumask and nodemask types. But you''re right, thinking more about it, this is completely unnecessary. I''ll kill this hunk. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2012-Dec-20 14:33 UTC
Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
On 20/12/12 09:18, Jan Beulich wrote:>>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote: >> --- a/xen/include/xen/nodemask.h >> +++ b/xen/include/xen/nodemask.h >> @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const >> } >> #endif >> >> +/* >> + * nodemask_var_t: struct nodemask for stack usage. >> + * >> + * See definition of cpumask_var_t in include/xen//cpumask.h. >> + */ >> +#if MAX_NUMNODES > 2 * BITS_PER_LONG > Is that case reasonable to expect?2 * BITS_PER_LONG is just going to be 128, right? It wasn''t too long ago that I would have considered 4096 cores a pretty unreasonable expectation. Is there a particular reason you think this is going to be more than a few years away, and a particular harm in having the code here to begin with? At very least it should be replaced with something like this: #if MAX_NUMNODES > 2 * BITS_PER_LONG # error "MAX_NUMNODES exceeds fixed size nodemask; need to implement variable-length nodemasks" #endif -George
Jan Beulich
2012-Dec-20 14:52 UTC
Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
>>> On 20.12.12 at 15:33, George Dunlap <george.dunlap@eu.citrix.com> wrote: > On 20/12/12 09:18, Jan Beulich wrote: >>>>> On 19.12.12 at 20:07, Dario Faggioli <dario.faggioli@citrix.com> wrote: >>> --- a/xen/include/xen/nodemask.h >>> +++ b/xen/include/xen/nodemask.h >>> @@ -298,6 +298,53 @@ static inline int __nodemask_parse(const >>> } >>> #endif >>> >>> +/* >>> + * nodemask_var_t: struct nodemask for stack usage. >>> + * >>> + * See definition of cpumask_var_t in include/xen//cpumask.h. >>> + */ >>> +#if MAX_NUMNODES > 2 * BITS_PER_LONG >> Is that case reasonable to expect? > > 2 * BITS_PER_LONG is just going to be 128, right? It wasn''t too long > ago that I would have considered 4096 cores a pretty unreasonable > expectation. Is there a particular reason you think this is going to be > more than a few years away, and a particular harm in having the code > here to begin with?I just don''t see node counts grow even near as quickly as core/ thread ones.> At very least it should be replaced with something like this: > > #if MAX_NUMNODES > 2 * BITS_PER_LONG > # error "MAX_NUMNODES exceeds fixed size nodemask; need to implement > variable-length nodemasks" > #endifYes, if there is a limitation in the code, it being violated should be detected at build time. But I''d suppose on can construct the statically sized mask definition such that it copes with larger counts (just at the expense of having larger data objects perhaps on-stack). Making sure to always pass pointers rather than objects to functions will already eliminated a good part of the problem. Jan
Dario Faggioli
2012-Dec-20 15:13 UTC
Re: [PATCH 02 of 10 v2] xen, libxc: introduce node maps and masks
On Thu, Dec 20, 2012 at 3:52 PM, Jan Beulich <JBeulich@suse.com> wrote:>> 2 * BITS_PER_LONG is just going to be 128, right? It wasn''t too long >> ago that I would have considered 4096 cores a pretty unreasonable >> expectation. Is there a particular reason you think this is going to be >> more than a few years away, and a particular harm in having the code >> here to begin with? > > I just don''t see node counts grow even near as quickly as core/ > thread ones. >Yep, same here, that''s why, despite having put that code there, I agree with Jan in removing it. :-) That feeling matches with what we''ve been repeatedly told by hardware vendors, about node count already saturating at around 8. Beyond that, a different architectural approach will be needed to tackle scalability issues (clusters? NoCs?). Just for the records, Linux still deals with this in the same way it did when we took these files from there (i.e., without this hunk): - cpumasks exists in static and dynamically allocated (cpumask_var_t) foms - nodemask are only and always statically allocated Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) --------------------------------------------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
George Dunlap
2012-Dec-20 15:28 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On 20/12/12 08:39, Juergen Gross wrote:> Am 20.12.2012 09:33, schrieb Dario Faggioli: >> On Thu, 2012-12-20 at 08:25 +0000, Juergen Gross wrote: >>>> BTW, can you be a little bit more specific about where you''re suggesting >>>> to put it? I''m sorry but I''m not sure I figured what you mean by "the >>>> scheduler private per-pcpu area"... Do you perhaps mean making it a >>>> member of `struct csched_pcpu'' ? >>> Yes, that''s what I would suggest. >>> >> Ok then, functionally, that is going to be exactly the same thing as >> where it is right now, i.e., a set of global per_cpu() variables. It is >> possible for your solution to bring some cache/locality benefits, >> although that will very much depends on single cases, architecture, >> workload, etc. > The space is only allocated if sched_credit is responsible for the pcpu.That makes sense. -George
George Dunlap
2012-Dec-20 15:56 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On 19/12/12 19:07, Dario Faggioli wrote:> As vcpu-affinity tells where VCPUs must run, node-affinity tells > where they should or, better, prefer. While respecting vcpu-affinity > remains mandatory, node-affinity is not that strict, it only expresses > a preference, although honouring it is almost always true that will > bring significant performances benefit (especially as compared to > not having any affinity at all). > > This change modifies the VCPU load balancing algorithm (for the > credit scheduler only), introducing a two steps logic. > During the first step, we use the node-affinity mask. The aim is > giving precedence to the CPUs where it is known to be preferable > for the domain to run. If that fails in finding a valid PCPU, the > node-affinity is just ignored and, in the second step, we fall > back to using cpu-affinity only. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>This one has a lot of structural comments; so I''m going to send a couple of different mails as I''m going through it, so we can parallize the discussion better. :-)> --- > Changes from v1: > * CPU masks variables moved off from the stack, as requested during > review. As per the comments in the code, having them in the private > (per-scheduler instance) struct could have been enough, but it would be > racy (again, see comments). For that reason, use a global bunch of > them of (via per_cpu()); > * George suggested a different load balancing logic during v1''s review. I > think he was right and then I changed the old implementation in a way > that resembles exactly that. I rewrote most of this patch to introduce > a more sensible and effective noda-affinity handling logic. > > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -111,6 +111,33 @@ > > > /* > + * Node Balancing > + */ > +#define CSCHED_BALANCE_CPU_AFFINITY 0 > +#define CSCHED_BALANCE_NODE_AFFINITY 1 > +#define CSCHED_BALANCE_LAST CSCHED_BALANCE_NODE_AFFINITY[snip]> +#define for_each_csched_balance_step(__step) \ > + for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- )Why are we starting at the top and going down? Is there any good reason for it? Every time you do anything unexpected, you add to the cognitive load of the person reading your code, leaving less spare processing power or memory for other bits of the code, and increasing (sligthly) the chance of making a mistake. The most natural thing would be for someone to expect that the steps start at 0 and go up; just reversing it means it''s that little bit harder to understand. When you name it "LAST", it''s even worse, because that would definitely imply that this step is going to be executed last. So why not just have this be as follows? for(step=0; step<CSCHED_BALANCE_MAX; step++)> + > +/* > + * Each csched-balance step has to use its own cpumask. This function > + * determines which one, given the step, and copies it in mask. Notice > + * that, in case of node-affinity balancing step, it also filters out from > + * the node-affinity mask the cpus that are not part of vc''s cpu-affinity, > + * as we do not want to end up running a vcpu where it would like, but > + * is not allowed to! > + * > + * As an optimization, if a domain does not have any node-affinity at all > + * (namely, its node affinity is automatically computed), not only the > + * computed mask will reflect its vcpu-affinity, but we also return -1 to > + * let the caller know that he can skip the step or quit the loop (if he > + * wants). > + */ > +static int > +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) > +{ > + if ( step == CSCHED_BALANCE_NODE_AFFINITY ) > + { > + struct domain *d = vc->domain; > + struct csched_dom *sdom = CSCHED_DOM(d); > + > + cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity); > + > + if ( cpumask_full(sdom->node_affinity_cpumask) ) > + return -1;There''s no optimization in having this comparison done here. You''re not reading something from a local variable that you''ve just calculated. But hiding this comparison inside this function, and disguising it as "returns -1", does increase the cognitive load on anybody trying to read and understand the code -- particularly, as how the return value is used is not really clear. Also, when you use this value, effectively what you''re doing is saying, "Actually, we just said we were doing the NODE_BALANCE step, but it turns out that the results of NODE_BALANCE and CPU_BALANCE will be the same, so we''re just going to pretend that we''ve been doing the CPU_BALANCE step instead." (See for example, "balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret" -- why the !ret in this clause? Because if !ret then we''re not actually doing NODE_AFFINITY now, but CPU_AFFINITY.) Another non-negligible chunk of cognitive load for someone reading the code to 1) figure out, and 2) keep in mind as she tries to analyze it. I took a look at all the places which use this return value, and it seems like the best thing in each case would just be to have the *caller*, before getting into the loop, call cpumask_full(sdom->node_affinity_cpumask) and just skip the CSCHED_NODE_BALANCE step altogether if it''s true. (Example below.)> @@ -266,67 +332,94 @@ static inline void > struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu)); > struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); > cpumask_t mask, idle_mask; > - int idlers_empty; > + int balance_step, idlers_empty; > > ASSERT(cur); > - cpumask_clear(&mask); > - > idlers_empty = cpumask_empty(prv->idlers); > > /* > - * If the pcpu is idle, or there are no idlers and the new > - * vcpu is a higher priority than the old vcpu, run it here. > - * > - * If there are idle cpus, first try to find one suitable to run > - * new, so we can avoid preempting cur. If we cannot find a > - * suitable idler on which to run new, run it here, but try to > - * find a suitable idler on which to run cur instead. > + * Node and vcpu-affinity balancing loop. To speed things up, in case > + * no node-affinity at all is present, scratch_balance_mask reflects > + * the vcpu-affinity, and ret is -1, so that we then can quit the > + * loop after only one step. > */ > - if ( cur->pri == CSCHED_PRI_IDLE > - || (idlers_empty && new->pri > cur->pri) ) > + for_each_csched_balance_step( balance_step ) > { > - if ( cur->pri != CSCHED_PRI_IDLE ) > - SCHED_STAT_CRANK(tickle_idlers_none); > - cpumask_set_cpu(cpu, &mask); > - } > - else if ( !idlers_empty ) > - { > - /* Check whether or not there are idlers that can run new */ > - cpumask_and(&idle_mask, prv->idlers, new->vcpu->cpu_affinity); > + int ret, new_idlers_empty; > + > + cpumask_clear(&mask); > > /* > - * If there are no suitable idlers for new, and it''s higher > - * priority than cur, ask the scheduler to migrate cur away. > - * We have to act like this (instead of just waking some of > - * the idlers suitable for cur) because cur is running. > + * If the pcpu is idle, or there are no idlers and the new > + * vcpu is a higher priority than the old vcpu, run it here. > * > - * If there are suitable idlers for new, no matter priorities, > - * leave cur alone (as it is running and is, likely, cache-hot) > - * and wake some of them (which is waking up and so is, likely, > - * cache cold anyway). > + * If there are idle cpus, first try to find one suitable to run > + * new, so we can avoid preempting cur. If we cannot find a > + * suitable idler on which to run new, run it here, but try to > + * find a suitable idler on which to run cur instead. > */ > - if ( cpumask_empty(&idle_mask) && new->pri > cur->pri ) > + if ( cur->pri == CSCHED_PRI_IDLE > + || (idlers_empty && new->pri > cur->pri) ) > { > - SCHED_STAT_CRANK(tickle_idlers_none); > - SCHED_VCPU_STAT_CRANK(cur, kicked_away); > - SCHED_VCPU_STAT_CRANK(cur, migrate_r); > - SCHED_STAT_CRANK(migrate_kicked_away); > - set_bit(_VPF_migrating, &cur->vcpu->pause_flags); > + if ( cur->pri != CSCHED_PRI_IDLE ) > + SCHED_STAT_CRANK(tickle_idlers_none); > cpumask_set_cpu(cpu, &mask); > } > - else if ( !cpumask_empty(&idle_mask) ) > + else if ( !idlers_empty ) > { > - /* Which of the idlers suitable for new shall we wake up? */ > - SCHED_STAT_CRANK(tickle_idlers_some); > - if ( opt_tickle_one_idle ) > + /* Are there idlers suitable for new (for this balance step)? */ > + ret = csched_balance_cpumask(new->vcpu, balance_step, > + &scratch_balance_mask); > + cpumask_and(&idle_mask, prv->idlers, &scratch_balance_mask); > + new_idlers_empty = cpumask_empty(&idle_mask); > + > + /* > + * Let''s not be too harsh! If there aren''t idlers suitable > + * for new in its node-affinity mask, make sure we check its > + * vcpu-affinity as well, before tacking final decisions. > + */ > + if ( new_idlers_empty > + && (balance_step == CSCHED_BALANCE_NODE_AFFINITY && !ret) ) > + continue; > + > + /* > + * If there are no suitable idlers for new, and it''s higher > + * priority than cur, ask the scheduler to migrate cur away. > + * We have to act like this (instead of just waking some of > + * the idlers suitable for cur) because cur is running. > + * > + * If there are suitable idlers for new, no matter priorities, > + * leave cur alone (as it is running and is, likely, cache-hot) > + * and wake some of them (which is waking up and so is, likely, > + * cache cold anyway). > + */ > + if ( new_idlers_empty && new->pri > cur->pri ) > { > - this_cpu(last_tickle_cpu) > - cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); > - cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); > + SCHED_STAT_CRANK(tickle_idlers_none); > + SCHED_VCPU_STAT_CRANK(cur, kicked_away); > + SCHED_VCPU_STAT_CRANK(cur, migrate_r); > + SCHED_STAT_CRANK(migrate_kicked_away); > + set_bit(_VPF_migrating, &cur->vcpu->pause_flags); > + cpumask_set_cpu(cpu, &mask); > } > - else > - cpumask_or(&mask, &mask, &idle_mask); > + else if ( !new_idlers_empty ) > + { > + /* Which of the idlers suitable for new shall we wake up? */ > + SCHED_STAT_CRANK(tickle_idlers_some); > + if ( opt_tickle_one_idle ) > + { > + this_cpu(last_tickle_cpu) > + cpumask_cycle(this_cpu(last_tickle_cpu), &idle_mask); > + cpumask_set_cpu(this_cpu(last_tickle_cpu), &mask); > + } > + else > + cpumask_or(&mask, &mask, &idle_mask); > + } > } > + > + /* Did we find anyone (or csched_balance_cpumask() says we''re done)? */ > + if ( !cpumask_empty(&mask) || ret ) > + break; > }The whole logic here is really convoluted and hard to read. For example, if cur->pri==IDLE, then you will always just break of the loop after the first iteration. In that case, why have the if() inside the loop to begin with? And if idlers_empty is true but cur->pri >= new->pri, then you''ll go through the loop two times, even though both times it will come up empty. And, of course, the whole thing about the node affinity mask being checked inside csched_balance_cpumask(), but not used until the very end. A much more straighforward way to arrange it would be: if(cur->pri=IDLE &c &c) { foo; } else if(!idlers_empty) { if(cpumask_full(sdom->node_affinity_cpumask) balance_step=CSCHED_BALANCE_CPU_AFFINITY; else balance_step=CSCHED_BALANCE_NODE_AFFINITY; for(; balance_step <= CSCHED_BALANCE_MAX; balance_step++) { ... } } That seems a lot clearer to me -- does that make sense? [To be continued...]
Dario Faggioli
2012-Dec-20 16:00 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Thu, Dec 20, 2012 at 4:28 PM, George Dunlap <george.dunlap@eu.citrix.com> wrote:>> The space is only allocated if sched_credit is responsible for the pcpu. > > > That makes sense. > -George >Ok, already done here in my queue. When I repost, you''ll find it there. :-) Thanks an Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) --------------------------------------------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
George Dunlap
2012-Dec-20 16:48 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On 19/12/12 19:07, Dario Faggioli wrote:> static inline int > -__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu) > +__csched_vcpu_is_migrateable(struct vcpu *vc, int dest_cpu, cpumask_t *mask) > { > /* > * Don''t pick up work that''s in the peer''s scheduling tail or hot on > - * peer PCPU. Only pick up work that''s allowed to run on our CPU. > + * peer PCPU. Only pick up work that prefers and/or is allowed to run > + * on our CPU. > */ > return !vc->is_running && > !__csched_vcpu_is_cache_hot(vc) && > - cpumask_test_cpu(dest_cpu, vc->cpu_affinity); > + cpumask_test_cpu(dest_cpu, mask); > +} > + > +static inline int > +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) > +{ > + /* > + * Consent to migration if cpu is one of the idlers in the VCPU''s > + * affinity mask. In fact, if that is not the case, it just means it > + * was some other CPU that was tickled and should hence come and pick > + * VCPU up. Migrating it to cpu would only make things worse. > + */ > + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask); > }I don''t get what this function is for. The only time you call it is in csched_runq_steal(), immediately after calling __csched_vcpu_is_migrateable(). But is_migrateable() has already checked cpu_mask_test_cpu(cpu, mask). So why do we need to check it again? We could just replace this with cpumask_test_cpu(cpu, prv->idlers). But that clause is going to be either true or false for every single iteration of all the loops, including the loops in csched_load_balance(). Wouldn''t it make more sense to check it once in csched_load_balance(), rather than doing all those nested loops? And in any case, looking at the caller of csched_load_balance(), it explicitly says to steal work if the next thing on the runqueue of cpu has a priority of TS_OVER. That was chosen for a reason -- if you want to change that, you should change it there at the top (and make a justification for doing so), not deeply nested in a function like this. Or am I completely missing something?> static struct csched_vcpu * > -csched_runq_steal(int peer_cpu, int cpu, int pri) > +csched_runq_steal(int peer_cpu, int cpu, int pri, int balance_step) > { > const struct csched_pcpu * const peer_pcpu = CSCHED_PCPU(peer_cpu); > + struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, peer_cpu)); > const struct vcpu * const peer_vcpu = curr_on_cpu(peer_cpu); > struct csched_vcpu *speer; > struct list_head *iter; > @@ -1265,11 +1395,24 @@ csched_runq_steal(int peer_cpu, int cpu, > if ( speer->pri <= pri ) > break; > > - /* Is this VCPU is runnable on our PCPU? */ > + /* Is this VCPU runnable on our PCPU? */ > vc = speer->vcpu; > BUG_ON( is_idle_vcpu(vc) ); > > - if (__csched_vcpu_is_migrateable(vc, cpu)) > + /* > + * Retrieve the correct mask for this balance_step or, if we''re > + * dealing with node-affinity and the vcpu has no node affinity > + * at all, just skip this vcpu. That is needed if we want to > + * check if we have any node-affine work to steal first (wrt > + * any vcpu-affine work). > + */ > + if ( csched_balance_cpumask(vc, balance_step, > + &scratch_balance_mask) ) > + continue;Again, I think for clarity the best thing to do here is: if ( balance_step == NODE && cpumask_full(speer->sdom->node_affinity_cpumask) ) continue; csched_balance_cpumask(); /* Etc. */> @@ -1295,7 +1438,8 @@ csched_load_balance(struct csched_privat > struct csched_vcpu *speer; > cpumask_t workers; > cpumask_t *online; > - int peer_cpu; > + int peer_cpu, peer_node, bstep; > + int node = cpu_to_node(cpu); > > BUG_ON( cpu != snext->vcpu->processor ); > online = cpupool_scheduler_cpumask(per_cpu(cpupool, cpu)); > @@ -1312,42 +1456,68 @@ csched_load_balance(struct csched_privat > SCHED_STAT_CRANK(load_balance_other); > > /* > - * Peek at non-idling CPUs in the system, starting with our > - * immediate neighbour. > + * Let''s look around for work to steal, taking both vcpu-affinity > + * and node-affinity into account. More specifically, we check all > + * the non-idle CPUs'' runq, looking for: > + * 1. any node-affine work to steal first, > + * 2. if not finding anything, any vcpu-affine work to steal. > */ > - cpumask_andnot(&workers, online, prv->idlers); > - cpumask_clear_cpu(cpu, &workers); > - peer_cpu = cpu; > + for_each_csched_balance_step( bstep ) > + { > + /* > + * We peek at the non-idling CPUs in a node-wise fashion. In fact, > + * it is more likely that we find some node-affine work on our same > + * node, not to mention that migrating vcpus within the same node > + * could well expected to be cheaper than across-nodes (memory > + * stays local, there might be some node-wide cache[s], etc.). > + */ > + peer_node = node; > + do > + { > + /* Find out what the !idle are in this node */ > + cpumask_andnot(&workers, online, prv->idlers); > + cpumask_and(&workers, &workers, &node_to_cpumask(peer_node)); > + cpumask_clear_cpu(cpu, &workers); > > - while ( !cpumask_empty(&workers) ) > - { > - peer_cpu = cpumask_cycle(peer_cpu, &workers); > - cpumask_clear_cpu(peer_cpu, &workers); > + if ( cpumask_empty(&workers) ) > + goto next_node; > > - /* > - * Get ahold of the scheduler lock for this peer CPU. > - * > - * Note: We don''t spin on this lock but simply try it. Spinning could > - * cause a deadlock if the peer CPU is also load balancing and trying > - * to lock this CPU. > - */ > - if ( !pcpu_schedule_trylock(peer_cpu) ) > - { > - SCHED_STAT_CRANK(steal_trylock_failed); > - continue; > - } > + peer_cpu = cpumask_first(&workers); > + do > + { > + /* > + * Get ahold of the scheduler lock for this peer CPU. > + * > + * Note: We don''t spin on this lock but simply try it. Spinning > + * could cause a deadlock if the peer CPU is also load > + * balancing and trying to lock this CPU. > + */ > + if ( !pcpu_schedule_trylock(peer_cpu) ) > + { > + SCHED_STAT_CRANK(steal_trylock_failed); > + peer_cpu = cpumask_cycle(peer_cpu, &workers); > + continue; > + } > > - /* > - * Any work over there to steal? > - */ > - speer = cpumask_test_cpu(peer_cpu, online) ? > - csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL; > - pcpu_schedule_unlock(peer_cpu); > - if ( speer != NULL ) > - { > - *stolen = 1; > - return speer; > - } > + /* Any work over there to steal? */ > + speer = cpumask_test_cpu(peer_cpu, online) ? > + csched_runq_steal(peer_cpu, cpu, snext->pri, bstep) : NULL; > + pcpu_schedule_unlock(peer_cpu); > + > + /* As soon as one vcpu is found, balancing ends */ > + if ( speer != NULL ) > + { > + *stolen = 1; > + return speer; > + } > + > + peer_cpu = cpumask_cycle(peer_cpu, &workers); > + > + } while( peer_cpu != cpumask_first(&workers) ); > + > + next_node: > + peer_node = cycle_node(peer_node, node_online_map); > + } while( peer_node != node ); > }These changes all look right. But then, I''m a bit tired, so I''ll give it another once-over tomorrow. :-) [To be continued]
Dario Faggioli
2012-Dec-20 17:12 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Thu, Dec 20, 2012 at 4:56 PM, George Dunlap <george.dunlap@eu.citrix.com> wrote:>> This change modifies the VCPU load balancing algorithm (for the >> credit scheduler only), introducing a two steps logic. >> During the first step, we use the node-affinity mask. The aim is >> giving precedence to the CPUs where it is known to be preferable >> for the domain to run. If that fails in finding a valid PCPU, the >> node-affinity is just ignored and, in the second step, we fall >> back to using cpu-affinity only. >> >> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > > This one has a lot of structural comments; so I''m going to send a couple of > different mails as I''m going through it, so we can parallize the discussion > better. :-) >Ok.>> +#define for_each_csched_balance_step(__step) \ >> + for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- ) > > > Why are we starting at the top and going down? Is there any good reason for > it? >You''re totally right, it looked like this in the very first RFC, when the whole set of macros was different. I changed that but never considered this, but I agree going up would be more natural.> So why not just have this be as follows? > > for(step=0; step<CSCHED_BALANCE_MAX; step++) >Will do.>> +static int >> +csched_balance_cpumask(const struct vcpu *vc, int step, cpumask_t *mask) >> +{ >> + if ( step == CSCHED_BALANCE_NODE_AFFINITY ) >> + { >> + struct domain *d = vc->domain; >> + struct csched_dom *sdom = CSCHED_DOM(d); >> + >> + cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity); >> + >> + if ( cpumask_full(sdom->node_affinity_cpumask) ) >> + return -1; > > > There''s no optimization in having this comparison done here. You''re not > reading something from a local variable that you''ve just calculated. But > hiding this comparison inside this function, and disguising it as "returns > -1", does increase the cognitive load on anybody trying to read and > understand the code -- particularly, as how the return value is used is not > really clear. >Yes, I again agree this is ugly. The previous version was better, from the caller point of view, but had other downsides (IIRC it was mangling with the loop control variable directly) so we agreed on this ''-1'' thing.> Also, when you use this value, effectively what you''re doing is saying, > "Actually, we just said we were doing the NODE_BALANCE step, but it turns > out that the results of NODE_BALANCE and CPU_BALANCE will be the same, so > we''re just going to pretend that we''ve been doing the CPU_BALANCE step > instead." (See for example, "balance_step == CSCHED_BALANCE_NODE_AFFINITY > && !ret" -- why the !ret in this clause? Because if !ret then we''re not > actually doing NODE_AFFINITY now, but CPU_AFFINITY.) Another non-negligible > chunk of cognitive load for someone reading the code to 1) figure out, and > 2) keep in mind as she tries to analyze it. >Totally agreed. :-)> I took a look at all the places which use this return value, and it seems > like the best thing in each case would just be to have the *caller*, before > getting into the loop, call cpumask_full(sdom->node_affinity_cpumask) and > just skip the CSCHED_NODE_BALANCE step altogether if it''s true. (Example > below.) >I will think about it and see if I can find a nice solution for making that happen (the point being I''m not sure I like exposing and disseminating that cpumask_full... thing that much, but I guess I can hide it under some more macro-ing and stuff).>> @@ -266,67 +332,94 @@ static inline void >> struct csched_vcpu * const cur = CSCHED_VCPU(curr_on_cpu(cpu)); >> struct csched_private *prv = CSCHED_PRIV(per_cpu(scheduler, cpu)); >> cpumask_t mask, idle_mask; >> - int idlers_empty; >> + int balance_step, idlers_empty; >> >> ASSERT(cur); >> - cpumask_clear(&mask); >> - >> idlers_empty = cpumask_empty(prv->idlers); >> >> /* >> - * If the pcpu is idle, or there are no idlers and the new >> - * vcpu is a higher priority than the old vcpu, run it here. >> - * >> - * If there are idle cpus, first try to find one suitable to run >> - * new, so we can avoid preempting cur. If we cannot find a >> - * suitable idler on which to run new, run it here, but try to >> - * find a suitable idler on which to run cur instead. >> + * Node and vcpu-affinity balancing loop. To speed things up, in case >> + * no node-affinity at all is present, scratch_balance_mask reflects >> + * the vcpu-affinity, and ret is -1, so that we then can quit the >> + * loop after only one step. >> */ >> [snip] > > The whole logic here is really convoluted and hard to read. For example, if > cur->pri==IDLE, then you will always just break of the loop after the first > iteration. In that case, why have the if() inside the loop to begin with? > And if idlers_empty is true but cur->pri >= new->pri, then you''ll go through > the loop two times, even though both times it will come up empty. And, of > course, the whole thing about the node affinity mask being checked inside > csched_balance_cpumask(), but not used until the very end. >I fear it looks convoluted and complex because, well, it _is_ quite complex! However, I see you''re point, and there definitely are chances for it to, although being complex, look more complex than it should. :-)> A much more straighforward way to arrange it would be: > > if(cur->pri=IDLE &c &c) > { > foo; > } > else if(!idlers_empty) > { > if(cpumask_full(sdom->node_affinity_cpumask) > balance_step=CSCHED_BALANCE_CPU_AFFINITY; > else > balance_step=CSCHED_BALANCE_NODE_AFFINITY; >Yes, I think this can be taken out. I''ll give some more thinking to this and see if I can simplify the flow.> [To be continued...] >Thanks for now :-) Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) --------------------------------------------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2012-Dec-20 18:18 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Thu, Dec 20, 2012 at 5:48 PM, George Dunlap <george.dunlap@eu.citrix.com> wrote:> On 19/12/12 19:07, Dario Faggioli wrote: >> +static inline int >> +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) >> +{ >> + /* >> + * Consent to migration if cpu is one of the idlers in the VCPU''s >> + * affinity mask. In fact, if that is not the case, it just means it >> + * was some other CPU that was tickled and should hence come and pick >> + * VCPU up. Migrating it to cpu would only make things worse. >> + */ >> + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask); >> } > > And in any case, looking at the caller of csched_load_balance(), it > explicitly says to steal work if the next thing on the runqueue of cpu has a > priority of TS_OVER. That was chosen for a reason -- if you want to change > that, you should change it there at the top (and make a justification for > doing so), not deeply nested in a function like this. > > Or am I completely missing something? >No, you''re right. Trying to solve a nasty issue I was seeing, I overlooked I was changing the underlying logic until that point... Thanks! What I want to avoid is the following: a vcpu wakes-up on the busy pcpu Y. As a consequence, the idle pcpu X is tickled. Then, for any unrelated reason, pcpu Z reschedules and, as it would go idle too, it looks around for any vcpu to steal, finds one in Y''s runqueue and grabs it. Afterward, when X gets the IPI and schedules, it just does not find anyone to run and goes back idling. Now, suppose the vcpu has X, but *not* Z, in its node-affinity (while it has a full vcpu-affinity, i.e., can run everywhere). In this case, a vcpu that could have run on a pcpu in its node-affinity, executes outside from it. That happens because, the NODE_BALANCE_STEP in csched_load_balance(), when called by Z, won''t find anything suitable to steal (provided there actually isn''t any vcpu waiting in any runqueue with node-affinity with Z), while the CPU_BALANCE_STEP will find our vcpu. :-( So, what I wanted is something that could tell me whether the pcpu which is stealing work is the one that has actually been tickled to do so. I was then using the pcpu idleness as a (cheap and easy to check) indication of that, but I now see this is having side effects I in the first place did not want to cause. Sorry for that, I probably spent so much time buried, as you where saying, in the various nested loops and calls, that I lost the context a little bit! :-P Ok, I think the problem I was describing is real, and I''ve seen it happening and causing performances degradation. However, as I think a good solution is going to be more complex than I thought, I''d better repost without this function and deal with it in a future separate patch (after having figured out the best way of doing so). Is that fine with you?> These changes all look right. >At least. :-)> But then, I''m a bit tired, so I''ll give it > another once-over tomorrow. :-) >I can imagine, looking forward to your next comments. Thanks a lot and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) --------------------------------------------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
George Dunlap
2012-Dec-20 20:21 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On 19/12/12 19:07, Dario Faggioli wrote:> +static inline int > +__csched_vcpu_should_migrate(int cpu, cpumask_t *mask, cpumask_t *idlers) > +{ > + /* > + * Consent to migration if cpu is one of the idlers in the VCPU''s > + * affinity mask. In fact, if that is not the case, it just means it > + * was some other CPU that was tickled and should hence come and pick > + * VCPU up. Migrating it to cpu would only make things worse. > + */ > + return cpumask_test_cpu(cpu, idlers) && cpumask_test_cpu(cpu, mask); > } > > static int > @@ -493,85 +599,98 @@ static int > cpumask_t idlers; > cpumask_t *online; > struct csched_pcpu *spc = NULL; > + int ret, balance_step; > int cpu; > > - /* > - * Pick from online CPUs in VCPU''s affinity mask, giving a > - * preference to its current processor if it''s in there. > - */ > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > - cpumask_and(&cpus, online, vc->cpu_affinity); > - cpu = cpumask_test_cpu(vc->processor, &cpus) > - ? vc->processor > - : cpumask_cycle(vc->processor, &cpus); > - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); > + for_each_csched_balance_step( balance_step ) > + { > + /* Pick an online CPU from the proper affinity mask */ > + ret = csched_balance_cpumask(vc, balance_step, &cpus); > + cpumask_and(&cpus, &cpus, online); > > - /* > - * Try to find an idle processor within the above constraints. > - * > - * In multi-core and multi-threaded CPUs, not all idle execution > - * vehicles are equal! > - * > - * We give preference to the idle execution vehicle with the most > - * idling neighbours in its grouping. This distributes work across > - * distinct cores first and guarantees we don''t do something stupid > - * like run two VCPUs on co-hyperthreads while there are idle cores > - * or sockets. > - * > - * Notice that, when computing the "idleness" of cpu, we may want to > - * discount vc. That is, iff vc is the currently running and the only > - * runnable vcpu on cpu, we add cpu to the idlers. > - */ > - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) > - cpumask_set_cpu(cpu, &idlers); > - cpumask_and(&cpus, &cpus, &idlers); > - cpumask_clear_cpu(cpu, &cpus); > + /* If present, prefer vc''s current processor */ > + cpu = cpumask_test_cpu(vc->processor, &cpus) > + ? vc->processor > + : cpumask_cycle(vc->processor, &cpus); > + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); > > - while ( !cpumask_empty(&cpus) ) > - { > - cpumask_t cpu_idlers; > - cpumask_t nxt_idlers; > - int nxt, weight_cpu, weight_nxt; > - int migrate_factor; > + /* > + * Try to find an idle processor within the above constraints. > + * > + * In multi-core and multi-threaded CPUs, not all idle execution > + * vehicles are equal! > + * > + * We give preference to the idle execution vehicle with the most > + * idling neighbours in its grouping. This distributes work across > + * distinct cores first and guarantees we don''t do something stupid > + * like run two VCPUs on co-hyperthreads while there are idle cores > + * or sockets. > + * > + * Notice that, when computing the "idleness" of cpu, we may want to > + * discount vc. That is, iff vc is the currently running and the only > + * runnable vcpu on cpu, we add cpu to the idlers. > + */ > + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) > + cpumask_set_cpu(cpu, &idlers); > + cpumask_and(&cpus, &cpus, &idlers); > + /* If there are idlers and cpu is still not among them, pick one */ > + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) ) > + cpu = cpumask_cycle(cpu, &cpus);This seems to be an addition to the algorithm -- particularly hidden in this kind of "indent a big section that''s almost exactly the same", I think this at least needs to be called out in the changelog message, perhaps put in a separate patch. Can you comment on to why you think it''s necessary? Was there a particular problem you were seeing?> + cpumask_clear_cpu(cpu, &cpus); > > - nxt = cpumask_cycle(cpu, &cpus); > + while ( !cpumask_empty(&cpus) ) > + { > + cpumask_t cpu_idlers; > + cpumask_t nxt_idlers; > + int nxt, weight_cpu, weight_nxt; > + int migrate_factor; > > - if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > - { > - /* We''re on the same socket, so check the busy-ness of threads. > - * Migrate if # of idlers is less at all */ > - ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 1; > - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, cpu)); > - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, nxt)); > - } > - else > - { > - /* We''re on different sockets, so check the busy-ness of cores. > - * Migrate only if the other core is twice as idle */ > - ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > - migrate_factor = 2; > - cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); > - cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); > + nxt = cpumask_cycle(cpu, &cpus); > + > + if ( cpumask_test_cpu(cpu, per_cpu(cpu_core_mask, nxt)) ) > + { > + /* We''re on the same socket, so check the busy-ness of threads. > + * Migrate if # of idlers is less at all */ > + ASSERT( cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 1; > + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_sibling_mask, > + cpu)); > + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_sibling_mask, > + nxt)); > + } > + else > + { > + /* We''re on different sockets, so check the busy-ness of cores. > + * Migrate only if the other core is twice as idle */ > + ASSERT( !cpumask_test_cpu(nxt, per_cpu(cpu_core_mask, cpu)) ); > + migrate_factor = 2; > + cpumask_and(&cpu_idlers, &idlers, per_cpu(cpu_core_mask, cpu)); > + cpumask_and(&nxt_idlers, &idlers, per_cpu(cpu_core_mask, nxt)); > + } > + > + weight_cpu = cpumask_weight(&cpu_idlers); > + weight_nxt = cpumask_weight(&nxt_idlers); > + /* smt_power_savings: consolidate work rather than spreading it */ > + if ( sched_smt_power_savings ? > + weight_cpu > weight_nxt : > + weight_cpu * migrate_factor < weight_nxt ) > + { > + cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); > + spc = CSCHED_PCPU(nxt); > + cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); > + cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); > + } > + else > + { > + cpumask_andnot(&cpus, &cpus, &nxt_idlers); > + } > } > > - weight_cpu = cpumask_weight(&cpu_idlers); > - weight_nxt = cpumask_weight(&nxt_idlers); > - /* smt_power_savings: consolidate work rather than spreading it */ > - if ( sched_smt_power_savings ? > - weight_cpu > weight_nxt : > - weight_cpu * migrate_factor < weight_nxt ) > - { > - cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); > - spc = CSCHED_PCPU(nxt); > - cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); > - cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); > - } > - else > - { > - cpumask_andnot(&cpus, &cpus, &nxt_idlers); > - } > + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */ > + if ( cpumask_test_cpu(cpu, &idlers) || ret ) > + break;Right -- OK, I think everything looks good here, except the "return -1 from csched_balance_cpumask" thing. I think it would be better if we explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped the NODE step if that''s the case. Also -- and sorry to have to ask this kind of thing, but after sorting through the placement algorithm my head hurts -- under what circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this point? It seems like the only possibility would be if: ( (vc->processor was not in the original &cpus [1]) || !IS_RUNQ_IDLE(vc->processor) ) && (there are no idlers in the original &cpus) Which I suppose probably matches the time when we want to move on from looking at NODE affinity and look for CPU affinity. [1] This could happen either if the vcpu/node affinity has changed, or if we''re currently running outside our node affinity and we''re doing the NODE step. OK -- I think I''ve convinced myself that this is OK as well (apart from the hidden check). I''ll come back to look at your response to the load balancing thing tomorrow. -George
Dario Faggioli
2012-Dec-21 00:18 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Thu, 2012-12-20 at 20:21 +0000, George Dunlap wrote:> > - /* > > - * Pick from online CPUs in VCPU''s affinity mask, giving a > > - * preference to its current processor if it''s in there. > > - */ > > online = cpupool_scheduler_cpumask(vc->domain->cpupool); > > - cpumask_and(&cpus, online, vc->cpu_affinity); > > - cpu = cpumask_test_cpu(vc->processor, &cpus) > > - ? vc->processor > > - : cpumask_cycle(vc->processor, &cpus); > > - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); > > + for_each_csched_balance_step( balance_step ) > > + { > > + /* Pick an online CPU from the proper affinity mask */ > > + ret = csched_balance_cpumask(vc, balance_step, &cpus); > > + cpumask_and(&cpus, &cpus, online); > > > > - /* > > - * Try to find an idle processor within the above constraints. > > - * > > - * In multi-core and multi-threaded CPUs, not all idle execution > > - * vehicles are equal! > > - * > > - * We give preference to the idle execution vehicle with the most > > - * idling neighbours in its grouping. This distributes work across > > - * distinct cores first and guarantees we don''t do something stupid > > - * like run two VCPUs on co-hyperthreads while there are idle cores > > - * or sockets. > > - * > > - * Notice that, when computing the "idleness" of cpu, we may want to > > - * discount vc. That is, iff vc is the currently running and the only > > - * runnable vcpu on cpu, we add cpu to the idlers. > > - */ > > - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > > - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) > > - cpumask_set_cpu(cpu, &idlers); > > - cpumask_and(&cpus, &cpus, &idlers); > > - cpumask_clear_cpu(cpu, &cpus); > > + /* If present, prefer vc''s current processor */ > > + cpu = cpumask_test_cpu(vc->processor, &cpus) > > + ? vc->processor > > + : cpumask_cycle(vc->processor, &cpus); > > + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); > > > > - while ( !cpumask_empty(&cpus) ) > > - { > > - cpumask_t cpu_idlers; > > - cpumask_t nxt_idlers; > > - int nxt, weight_cpu, weight_nxt; > > - int migrate_factor; > > + /* > > + * Try to find an idle processor within the above constraints. > > + * > > + * In multi-core and multi-threaded CPUs, not all idle execution > > + * vehicles are equal! > > + * > > + * We give preference to the idle execution vehicle with the most > > + * idling neighbours in its grouping. This distributes work across > > + * distinct cores first and guarantees we don''t do something stupid > > + * like run two VCPUs on co-hyperthreads while there are idle cores > > + * or sockets. > > + * > > + * Notice that, when computing the "idleness" of cpu, we may want to > > + * discount vc. That is, iff vc is the currently running and the only > > + * runnable vcpu on cpu, we add cpu to the idlers. > > + */ > > + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); > > + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) > > + cpumask_set_cpu(cpu, &idlers); > > + cpumask_and(&cpus, &cpus, &idlers); > > + /* If there are idlers and cpu is still not among them, pick one */ > > + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) ) > > + cpu = cpumask_cycle(cpu, &cpus); > > This seems to be an addition to the algorithm -- particularly hidden in > this kind of "indent a big section that''s almost exactly the same", I > think this at least needs to be called out in the changelog message, > perhaps put in a separate patch. >You''re right, it is an addition, although a minor enough one (at least from the amount of code point of view, the effect of not having it was pretty bad! :-P) that I thought it can "hide" here. :-) But I guess I can put it in a separate patch.> Can you comment on to why you think it''s necessary? Was there a > particular problem you were seeing? >Yep. Suppose vc is for some reason running on a pcpu which is outside its node-affinity, but that now some pcpus within vc''s node-affinity have become idle. What we would like is vc start running there as soon as possible, so we expect this call to _csched_pick_cpu() to determine that. What happens is we do not use vc->processor (as it is outside of vc''s node-affinity) and ''cpu'' gets set to cpumask_cycle(vc->processor, &cpus), where ''cpu'' is the cpumask_and(&cpus, balance_mask, online). This means ''cpu''. Let''s also suppose that ''cpu'' now points to a busy thread but with an idle sibling, and that there aren''t any other idle pcpus (either core or threads). Now, the algorithm evaluates the idleness of ''cpu'', and compares it with the idleness of all the other pcpus, and it won''t find anything better that ''cpu'' itself, as all the other pcpus except its sibling thread are busy, while its sibling thread has the very same idleness it has (2 threads, 1 idle 1 busy). The neat effect is vc being moved to ''cpu'', which is busy, while it could have been moved to ''cpu''''s sibling thread, which is indeed idle. The if() I added fixes this by making sure that the reference cpu is an idle one (if that is possible). I hope I''ve explained it correctly, and sorry if it is a little bit tricky, especially to explain like this (although, believe me, it was tricky to hunt it out too! :-P). I''ve seen that happening and I''m almost sure I kept a trace somewhere, so let me know if you want to see the "smoking gun". :-)> > - weight_cpu = cpumask_weight(&cpu_idlers); > > - weight_nxt = cpumask_weight(&nxt_idlers); > > - /* smt_power_savings: consolidate work rather than spreading it */ > > - if ( sched_smt_power_savings ? > > - weight_cpu > weight_nxt : > > - weight_cpu * migrate_factor < weight_nxt ) > > - { > > - cpumask_and(&nxt_idlers, &cpus, &nxt_idlers); > > - spc = CSCHED_PCPU(nxt); > > - cpu = cpumask_cycle(spc->idle_bias, &nxt_idlers); > > - cpumask_andnot(&cpus, &cpus, per_cpu(cpu_sibling_mask, cpu)); > > - } > > - else > > - { > > - cpumask_andnot(&cpus, &cpus, &nxt_idlers); > > - } > > + /* Stop if cpu is idle (or if csched_balance_cpumask() says we can) */ > > + if ( cpumask_test_cpu(cpu, &idlers) || ret ) > > + break; > > Right -- OK, I think everything looks good here, except the "return -1 > from csched_balance_cpumask" thing. I think it would be better if we > explicitly checked cpumask_full(...->node_affinity_cpumask) and skipped > the NODE step if that''s the case. >Yep. Will do this, or something along this line, all over the place. Thanks.> Also -- and sorry to have to ask this kind of thing, but after sorting > through the placement algorithm my head hurts -- under what > circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this > point? It seems like the only possibility would be if: > ( (vc->processor was not in the original &cpus [1]) > || !IS_RUNQ_IDLE(vc->processor) ) > && (there are no idlers in the original &cpus) > > Which I suppose probably matches the time when we want to move on from > looking at NODE affinity and look for CPU affinity. > > [1] This could happen either if the vcpu/node affinity has changed, or > if we''re currently running outside our node affinity and we''re doing the > NODE step. > > OK -- I think I''ve convinced myself that this is OK as well (apart from > the hidden check). I''ll come back to look at your response to the load > balancing thing tomorrow. >Mmm... Sorry, not sure I follow, does this means that you figured out and understood why I need that ''if(){break;}'' ? Sounds like so, but I can''t be sure (my head hurts a bit too, after having written that thing! :-D). If no, consider that, for example, it can be false if all the pcpus in the mask for this step are busy, and if this step is the node-affinity step, I do _not_ want to exit the balancing loop, and check the other pcpus in the vcpu-affinity. OTOH, if I don''t put a break somewhere, even if an idle pcpu is found during the node-affinity balancing step, it will just go on and check all the others pcpus in the vcpu-affinity, which would kill having tried to do the balancing here. Makes sense? Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2012-Dec-21 14:29 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On 20/12/12 18:18, Dario Faggioli wrote:> On Thu, Dec 20, 2012 at 5:48 PM, George Dunlap > <george.dunlap@eu.citrix.com> wrote: >> And in any case, looking at the caller of csched_load_balance(), it >> explicitly says to steal work if the next thing on the runqueue of cpu has a >> priority of TS_OVER. That was chosen for a reason -- if you want to change >> that, you should change it there at the top (and make a justification for >> doing so), not deeply nested in a function like this. >> >> Or am I completely missing something? >> > No, you''re right. Trying to solve a nasty issue I was seeing, I overlooked I was > changing the underlying logic until that point... Thanks! > > What I want to avoid is the following: a vcpu wakes-up on the busy pcpu Y. As > a consequence, the idle pcpu X is tickled. Then, for any unrelated reason, pcpu > Z reschedules and, as it would go idle too, it looks around for any > vcpu to steal, > finds one in Y''s runqueue and grabs it. Afterward, when X gets the IPI and > schedules, it just does not find anyone to run and goes back idling. > > Now, suppose the vcpu has X, but *not* Z, in its node-affinity (while > it has a full > vcpu-affinity, i.e., can run everywhere). In this case, a vcpu that > could have run on > a pcpu in its node-affinity, executes outside from it. That happens because, > the NODE_BALANCE_STEP in csched_load_balance(), when called by Z, won''t > find anything suitable to steal (provided there actually isn''t any > vcpu waiting in > any runqueue with node-affinity with Z), while the CPU_BALANCE_STEP will > find our vcpu. :-( > > So, what I wanted is something that could tell me whether the pcpu which is > stealing work is the one that has actually been tickled to do so. I > was then using > the pcpu idleness as a (cheap and easy to check) indication of that, > but I now see > this is having side effects I in the first place did not want to cause. > > Sorry for that, I probably spent so much time buried, as you where > saying, in the > various nested loops and calls, that I lost the context a little bit! :-POK, that makes sense -- I figured it was something like that. Don''t feel too bad about missing that connection -- we''re all fairly blind to our own code, and I only caught it because I was trying to figure out what was going on. That''s why we do patch review. :-) Honestly, the whole "steal work" idea seemed a bit backwards to begin with, but now that we''re not just dealing with "possible" and "not possible", but with "better" and "worse", the work-stealing method of load balancing sort of falls down. It does make sense to do the load-balancing work on idle cpus rather than already-busy cpus; but I wonder if what should happen instead is that before idling, a pcpu chooses a "busy" pcpu and does a global load balancing for it -- i.e., pcpu 1 will look at pcpu 5''s runqueue, and consider moving away the vcpus on the runqueue not just to itself but to any available cpu. That way, in your example, Z might wake up, look at X''s runqueue, and say, "This would probably run well on Y -- I''ll migrate it there." But that''s kind of a half-baked idea at this point.> Ok, I think the problem I was describing is real, and I''ve seen it happening and > causing performances degradation. However, as I think a good solution > is going to > be more complex than I thought, I''d better repost without this > function and deal with > it in a future separate patch (after having figured out the best way > of doing so). Is > that fine with you?Yes, that''s fine. Thanks, Dario. -George
George Dunlap
2012-Dec-21 14:56 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On 21/12/12 00:18, Dario Faggioli wrote:> On Thu, 2012-12-20 at 20:21 +0000, George Dunlap wrote: >>> - /* >>> - * Pick from online CPUs in VCPU''s affinity mask, giving a >>> - * preference to its current processor if it''s in there. >>> - */ >>> online = cpupool_scheduler_cpumask(vc->domain->cpupool); >>> - cpumask_and(&cpus, online, vc->cpu_affinity); >>> - cpu = cpumask_test_cpu(vc->processor, &cpus) >>> - ? vc->processor >>> - : cpumask_cycle(vc->processor, &cpus); >>> - ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); >>> + for_each_csched_balance_step( balance_step ) >>> + { >>> + /* Pick an online CPU from the proper affinity mask */ >>> + ret = csched_balance_cpumask(vc, balance_step, &cpus); >>> + cpumask_and(&cpus, &cpus, online); >>> >>> - /* >>> - * Try to find an idle processor within the above constraints. >>> - * >>> - * In multi-core and multi-threaded CPUs, not all idle execution >>> - * vehicles are equal! >>> - * >>> - * We give preference to the idle execution vehicle with the most >>> - * idling neighbours in its grouping. This distributes work across >>> - * distinct cores first and guarantees we don''t do something stupid >>> - * like run two VCPUs on co-hyperthreads while there are idle cores >>> - * or sockets. >>> - * >>> - * Notice that, when computing the "idleness" of cpu, we may want to >>> - * discount vc. That is, iff vc is the currently running and the only >>> - * runnable vcpu on cpu, we add cpu to the idlers. >>> - */ >>> - cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); >>> - if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) >>> - cpumask_set_cpu(cpu, &idlers); >>> - cpumask_and(&cpus, &cpus, &idlers); >>> - cpumask_clear_cpu(cpu, &cpus); >>> + /* If present, prefer vc''s current processor */ >>> + cpu = cpumask_test_cpu(vc->processor, &cpus) >>> + ? vc->processor >>> + : cpumask_cycle(vc->processor, &cpus); >>> + ASSERT( !cpumask_empty(&cpus) && cpumask_test_cpu(cpu, &cpus) ); >>> >>> - while ( !cpumask_empty(&cpus) ) >>> - { >>> - cpumask_t cpu_idlers; >>> - cpumask_t nxt_idlers; >>> - int nxt, weight_cpu, weight_nxt; >>> - int migrate_factor; >>> + /* >>> + * Try to find an idle processor within the above constraints. >>> + * >>> + * In multi-core and multi-threaded CPUs, not all idle execution >>> + * vehicles are equal! >>> + * >>> + * We give preference to the idle execution vehicle with the most >>> + * idling neighbours in its grouping. This distributes work across >>> + * distinct cores first and guarantees we don''t do something stupid >>> + * like run two VCPUs on co-hyperthreads while there are idle cores >>> + * or sockets. >>> + * >>> + * Notice that, when computing the "idleness" of cpu, we may want to >>> + * discount vc. That is, iff vc is the currently running and the only >>> + * runnable vcpu on cpu, we add cpu to the idlers. >>> + */ >>> + cpumask_and(&idlers, &cpu_online_map, CSCHED_PRIV(ops)->idlers); >>> + if ( vc->processor == cpu && IS_RUNQ_IDLE(cpu) ) >>> + cpumask_set_cpu(cpu, &idlers); >>> + cpumask_and(&cpus, &cpus, &idlers); >>> + /* If there are idlers and cpu is still not among them, pick one */ >>> + if ( !cpumask_empty(&cpus) && !cpumask_test_cpu(cpu, &cpus) ) >>> + cpu = cpumask_cycle(cpu, &cpus); >> This seems to be an addition to the algorithm -- particularly hidden in >> this kind of "indent a big section that''s almost exactly the same", I >> think this at least needs to be called out in the changelog message, >> perhaps put in a separate patch. >> > You''re right, it is an addition, although a minor enough one (at least > from the amount of code point of view, the effect of not having it was > pretty bad! :-P) that I thought it can "hide" here. :-) > > But I guess I can put it in a separate patch. > >> Can you comment on to why you think it''s necessary? Was there a >> particular problem you were seeing? >> > Yep. Suppose vc is for some reason running on a pcpu which is outside > its node-affinity, but that now some pcpus within vc''s node-affinity > have become idle. What we would like is vc start running there as soon > as possible, so we expect this call to _csched_pick_cpu() to determine > that. > > What happens is we do not use vc->processor (as it is outside of vc''s > node-affinity) and ''cpu'' gets set to cpumask_cycle(vc->processor, > &cpus), where ''cpu'' is the cpumask_and(&cpus, balance_mask, online). > This means ''cpu''. Let''s also suppose that ''cpu'' now points to a busy > thread but with an idle sibling, and that there aren''t any other idle > pcpus (either core or threads). Now, the algorithm evaluates the > idleness of ''cpu'', and compares it with the idleness of all the other > pcpus, and it won''t find anything better that ''cpu'' itself, as all the > other pcpus except its sibling thread are busy, while its sibling thread > has the very same idleness it has (2 threads, 1 idle 1 busy). > > The neat effect is vc being moved to ''cpu'', which is busy, while it > could have been moved to ''cpu''''s sibling thread, which is indeed idle. > > The if() I added fixes this by making sure that the reference cpu is an > idle one (if that is possible). > > I hope I''ve explained it correctly, and sorry if it is a little bit > tricky, especially to explain like this (although, believe me, it was > tricky to hunt it out too! :-P). I''ve seen that happening and I''m almost > sure I kept a trace somewhere, so let me know if you want to see the > "smoking gun". :-)No, the change looks quite plausible. I guess it''s not obvious that the balancing code will never migrate from one thread to another thread. (That whole algorithm could do with some commenting -- I may submit a patch once this series is in.) I''m really glad you''ve had the opportunity to take a close look at these kinds of things.>> Also -- and sorry to have to ask this kind of thing, but after sorting >> through the placement algorithm my head hurts -- under what >> circumstances would "cpumask_test_cpu(cpu, &idlers)" be false at this >> point? It seems like the only possibility would be if: >> ( (vc->processor was not in the original &cpus [1]) >> || !IS_RUNQ_IDLE(vc->processor) ) >> && (there are no idlers in the original &cpus) >> >> Which I suppose probably matches the time when we want to move on from >> looking at NODE affinity and look for CPU affinity. >> >> [1] This could happen either if the vcpu/node affinity has changed, or >> if we''re currently running outside our node affinity and we''re doing the >> NODE step. >> >> OK -- I think I''ve convinced myself that this is OK as well (apart from >> the hidden check). I''ll come back to look at your response to the load >> balancing thing tomorrow. >> > Mmm... Sorry, not sure I follow, does this means that you figured out > and understood why I need that ''if(){break;}'' ? Sounds like so, but I > can''t be sure (my head hurts a bit too, after having written that > thing! :-D).Well I always understood why we needed the break -- the purpose is to avoid the second run through when it''s not necessary. What I was doing, in a sort of "thinking out loud" fashion, seeing under what conditions that break might actually happen. Like the analysis with vcpu_should_migrate(), it might have turned out to be redundant, or to have missed some cases. But I think it doesn''t, so it''s fine. :-) -George
George Dunlap
2012-Dec-21 15:17 UTC
Re: [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
On 19/12/12 19:07, Dario Faggioli wrote:> Make it possible to pass the node-affinity of a domain to the hypervisor > from the upper layers, instead of always being computed automatically. > > Note that this also required generalizing the Flask hooks for setting > and getting the affinity, so that they now deal with both vcpu and > node affinity. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>I can''t comment on the XSM stuff -- is any part of the "getvcpuaffinity" stuff for XSM a public interface that needs to be backwards-compatible? I.e., is s/vcpu//; OK from an interface point of view? WRT everything else: Acked-by: George Dunlap <george.dunlap@eu.citrix.com>> --- > Changes from v1: > * added the missing dummy hook for nodeaffinity; > * let the permission renaming affect flask policies too. > > diff --git a/tools/flask/policy/policy/flask/access_vectors b/tools/flask/policy/policy/flask/access_vectors > --- a/tools/flask/policy/policy/flask/access_vectors > +++ b/tools/flask/policy/policy/flask/access_vectors > @@ -47,8 +47,8 @@ class domain > transition > max_vcpus > destroy > - setvcpuaffinity > - getvcpuaffinity > + setaffinity > + getaffinity > scheduler > getdomaininfo > getvcpuinfo > diff --git a/tools/flask/policy/policy/mls b/tools/flask/policy/policy/mls > --- a/tools/flask/policy/policy/mls > +++ b/tools/flask/policy/policy/mls > @@ -70,11 +70,11 @@ mlsconstrain domain transition > (( h1 dom h2 ) and (( l1 eq l2 ) or (t1 == mls_priv))); > > # all the domain "read" ops > -mlsconstrain domain { getvcpuaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext } > +mlsconstrain domain { getaffinity getdomaininfo getvcpuinfo getvcpucontext getaddrsize getextvcpucontext } > ((l1 dom l2) or (t1 == mls_priv)); > > # all the domain "write" ops > -mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setvcpuaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext } > +mlsconstrain domain { setvcpucontext pause unpause resume create max_vcpus destroy setaffinity scheduler setdomainmaxmem setdomainhandle setdebugging hypercall settime set_target shutdown setaddrsize trigger setextvcpucontext } > ((l1 eq l2) or (t1 == mls_priv)); > > # This is incomplete - similar constraints must be written for all classes > diff --git a/tools/flask/policy/policy/modules/xen/xen.if b/tools/flask/policy/policy/modules/xen/xen.if > --- a/tools/flask/policy/policy/modules/xen/xen.if > +++ b/tools/flask/policy/policy/modules/xen/xen.if > @@ -55,9 +55,9 @@ define(`create_domain_build_label'', ` > # manage_domain(priv, target) > # Allow managing a running domain > define(`manage_domain'', ` > - allow $1 $2:domain { getdomaininfo getvcpuinfo getvcpuaffinity > + allow $1 $2:domain { getdomaininfo getvcpuinfo getaffinity > getaddrsize pause unpause trigger shutdown destroy > - setvcpuaffinity setdomainmaxmem }; > + setaffinity setdomainmaxmem }; > '') > > # migrate_domain_out(priv, target) > diff --git a/xen/common/domain.c b/xen/common/domain.c > --- a/xen/common/domain.c > +++ b/xen/common/domain.c > @@ -222,6 +222,7 @@ struct domain *domain_create( > > spin_lock_init(&d->node_affinity_lock); > d->node_affinity = NODE_MASK_ALL; > + d->auto_node_affinity = 1; > > spin_lock_init(&d->shutdown_lock); > d->shutdown_code = -1; > @@ -362,11 +363,26 @@ void domain_update_node_affinity(struct > cpumask_or(cpumask, cpumask, online_affinity); > } > > - for_each_online_node ( node ) > - if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) > - node_set(node, nodemask); > + if ( d->auto_node_affinity ) > + { > + /* Node-affinity is automaically computed from all vcpu-affinities */ > + for_each_online_node ( node ) > + if ( cpumask_intersects(&node_to_cpumask(node), cpumask) ) > + node_set(node, nodemask); > > - d->node_affinity = nodemask; > + d->node_affinity = nodemask; > + } > + else > + { > + /* Node-affinity is provided by someone else, just filter out cpus > + * that are either offline or not in the affinity of any vcpus. */ > + for_each_node_mask ( node, d->node_affinity ) > + if ( !cpumask_intersects(&node_to_cpumask(node), cpumask) ) > + node_clear(node, d->node_affinity); > + } > + > + sched_set_node_affinity(d, &d->node_affinity); > + > spin_unlock(&d->node_affinity_lock); > > free_cpumask_var(online_affinity); > @@ -374,6 +390,36 @@ void domain_update_node_affinity(struct > } > > > +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity) > +{ > + /* Being affine with no nodes is just wrong */ > + if ( nodes_empty(*affinity) ) > + return -EINVAL; > + > + spin_lock(&d->node_affinity_lock); > + > + /* > + * Being/becoming explicitly affine to all nodes is not particularly > + * useful. Let''s take it as the `reset node affinity` command. > + */ > + if ( nodes_full(*affinity) ) > + { > + d->auto_node_affinity = 1; > + goto out; > + } > + > + d->auto_node_affinity = 0; > + d->node_affinity = *affinity; > + > +out: > + spin_unlock(&d->node_affinity_lock); > + > + domain_update_node_affinity(d); > + > + return 0; > +} > + > + > struct domain *get_domain_by_id(domid_t dom) > { > struct domain *d; > diff --git a/xen/common/domctl.c b/xen/common/domctl.c > --- a/xen/common/domctl.c > +++ b/xen/common/domctl.c > @@ -609,6 +609,40 @@ long do_domctl(XEN_GUEST_HANDLE_PARAM(xe > } > break; > > + case XEN_DOMCTL_setnodeaffinity: > + case XEN_DOMCTL_getnodeaffinity: > + { > + domid_t dom = op->domain; > + struct domain *d = rcu_lock_domain_by_id(dom); > + > + ret = -ESRCH; > + if ( d == NULL ) > + break; > + > + ret = xsm_nodeaffinity(op->cmd, d); > + if ( ret ) > + goto nodeaffinity_out; > + > + if ( op->cmd == XEN_DOMCTL_setnodeaffinity ) > + { > + nodemask_t new_affinity; > + > + ret = xenctl_bitmap_to_nodemask(&new_affinity, > + &op->u.nodeaffinity.nodemap); > + if ( !ret ) > + ret = domain_set_node_affinity(d, &new_affinity); > + } > + else > + { > + ret = nodemask_to_xenctl_bitmap(&op->u.nodeaffinity.nodemap, > + &d->node_affinity); > + } > + > + nodeaffinity_out: > + rcu_unlock_domain(d); > + } > + break; > + > case XEN_DOMCTL_setvcpuaffinity: > case XEN_DOMCTL_getvcpuaffinity: > { > diff --git a/xen/common/keyhandler.c b/xen/common/keyhandler.c > --- a/xen/common/keyhandler.c > +++ b/xen/common/keyhandler.c > @@ -217,6 +217,14 @@ static void cpuset_print(char *set, int > *set++ = ''\0''; > } > > +static void nodeset_print(char *set, int size, const nodemask_t *mask) > +{ > + *set++ = ''[''; > + set += nodelist_scnprintf(set, size-2, mask); > + *set++ = '']''; > + *set++ = ''\0''; > +} > + > static void periodic_timer_print(char *str, int size, uint64_t period) > { > if ( period == 0 ) > @@ -272,6 +280,9 @@ static void dump_domains(unsigned char k > > dump_pageframe_info(d); > > + nodeset_print(tmpstr, sizeof(tmpstr), &d->node_affinity); > + printk("NODE affinity for domain %d: %s\n", d->domain_id, tmpstr); > + > printk("VCPU information and callbacks for domain %u:\n", > d->domain_id); > for_each_vcpu ( d, v ) > diff --git a/xen/common/sched_credit.c b/xen/common/sched_credit.c > --- a/xen/common/sched_credit.c > +++ b/xen/common/sched_credit.c > @@ -269,6 +269,33 @@ static inline void > list_del_init(&svc->runq_elem); > } > > +/* > + * Translates node-affinity mask into a cpumask, so that we can use it during > + * actual scheduling. That of course will contain all the cpus from all the > + * set nodes in the original node-affinity mask. > + * > + * Note that any serialization needed to access mask safely is complete > + * responsibility of the caller of this function/hook. > + */ > +static void csched_set_node_affinity( > + const struct scheduler *ops, > + struct domain *d, > + nodemask_t *mask) > +{ > + struct csched_dom *sdom; > + int node; > + > + /* Skip idle domain since it doesn''t even have a node_affinity_cpumask */ > + if ( unlikely(is_idle_domain(d)) ) > + return; > + > + sdom = CSCHED_DOM(d); > + cpumask_clear(sdom->node_affinity_cpumask); > + for_each_node_mask( node, *mask ) > + cpumask_or(sdom->node_affinity_cpumask, sdom->node_affinity_cpumask, > + &node_to_cpumask(node)); > +} > + > #define for_each_csched_balance_step(__step) \ > for ( (__step) = CSCHED_BALANCE_LAST; (__step) >= 0; (__step)-- ) > > @@ -296,7 +323,8 @@ csched_balance_cpumask(const struct vcpu > > cpumask_and(mask, sdom->node_affinity_cpumask, vc->cpu_affinity); > > - if ( cpumask_full(sdom->node_affinity_cpumask) ) > + if ( cpumask_full(sdom->node_affinity_cpumask) || > + d->auto_node_affinity == 1 ) > return -1; > } > else /* step == CSCHED_BALANCE_CPU_AFFINITY */ > @@ -1896,6 +1924,8 @@ const struct scheduler sched_credit_def > .adjust = csched_dom_cntl, > .adjust_global = csched_sys_cntl, > > + .set_node_affinity = csched_set_node_affinity, > + > .pick_cpu = csched_cpu_pick, > .do_schedule = csched_schedule, > > diff --git a/xen/common/schedule.c b/xen/common/schedule.c > --- a/xen/common/schedule.c > +++ b/xen/common/schedule.c > @@ -590,6 +590,11 @@ int cpu_disable_scheduler(unsigned int c > return ret; > } > > +void sched_set_node_affinity(struct domain *d, nodemask_t *mask) > +{ > + SCHED_OP(DOM2OP(d), set_node_affinity, d, mask); > +} > + > int vcpu_set_affinity(struct vcpu *v, const cpumask_t *affinity) > { > cpumask_t online_affinity; > diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h > --- a/xen/include/public/domctl.h > +++ b/xen/include/public/domctl.h > @@ -279,6 +279,16 @@ typedef struct xen_domctl_getvcpuinfo xe > DEFINE_XEN_GUEST_HANDLE(xen_domctl_getvcpuinfo_t); > > > +/* Get/set the NUMA node(s) with which the guest has affinity with. */ > +/* XEN_DOMCTL_setnodeaffinity */ > +/* XEN_DOMCTL_getnodeaffinity */ > +struct xen_domctl_nodeaffinity { > + struct xenctl_bitmap nodemap;/* IN */ > +}; > +typedef struct xen_domctl_nodeaffinity xen_domctl_nodeaffinity_t; > +DEFINE_XEN_GUEST_HANDLE(xen_domctl_nodeaffinity_t); > + > + > /* Get/set which physical cpus a vcpu can execute on. */ > /* XEN_DOMCTL_setvcpuaffinity */ > /* XEN_DOMCTL_getvcpuaffinity */ > @@ -907,6 +917,8 @@ struct xen_domctl { > #define XEN_DOMCTL_audit_p2m 65 > #define XEN_DOMCTL_set_virq_handler 66 > #define XEN_DOMCTL_set_broken_page_p2m 67 > +#define XEN_DOMCTL_setnodeaffinity 68 > +#define XEN_DOMCTL_getnodeaffinity 69 > #define XEN_DOMCTL_gdbsx_guestmemio 1000 > #define XEN_DOMCTL_gdbsx_pausevcpu 1001 > #define XEN_DOMCTL_gdbsx_unpausevcpu 1002 > @@ -920,6 +932,7 @@ struct xen_domctl { > struct xen_domctl_getpageframeinfo getpageframeinfo; > struct xen_domctl_getpageframeinfo2 getpageframeinfo2; > struct xen_domctl_getpageframeinfo3 getpageframeinfo3; > + struct xen_domctl_nodeaffinity nodeaffinity; > struct xen_domctl_vcpuaffinity vcpuaffinity; > struct xen_domctl_shadow_op shadow_op; > struct xen_domctl_max_mem max_mem; > diff --git a/xen/include/xen/nodemask.h b/xen/include/xen/nodemask.h > --- a/xen/include/xen/nodemask.h > +++ b/xen/include/xen/nodemask.h > @@ -8,8 +8,9 @@ > * See detailed comments in the file linux/bitmap.h describing the > * data type on which these nodemasks are based. > * > - * For details of nodemask_scnprintf() and nodemask_parse(), > - * see bitmap_scnprintf() and bitmap_parse() in lib/bitmap.c. > + * For details of nodemask_scnprintf(), nodelist_scnpintf() and > + * nodemask_parse(), see bitmap_scnprintf() and bitmap_parse() > + * in lib/bitmap.c. > * > * The available nodemask operations are: > * > @@ -50,6 +51,7 @@ > * unsigned long *nodes_addr(mask) Array of unsigned long''s in mask > * > * int nodemask_scnprintf(buf, len, mask) Format nodemask for printing > + * int nodelist_scnprintf(buf, len, mask) Format nodemask as a list for printing > * int nodemask_parse(ubuf, ulen, mask) Parse ascii string as nodemask > * > * for_each_node_mask(node, mask) for-loop node over mask > @@ -292,6 +294,14 @@ static inline int __cycle_node(int n, co > > #define nodes_addr(src) ((src).bits) > > +#define nodelist_scnprintf(buf, len, src) \ > + __nodelist_scnprintf((buf), (len), (src), MAX_NUMNODES) > +static inline int __nodelist_scnprintf(char *buf, int len, > + const nodemask_t *srcp, int nbits) > +{ > + return bitmap_scnlistprintf(buf, len, srcp->bits, nbits); > +} > + > #if 0 > #define nodemask_scnprintf(buf, len, src) \ > __nodemask_scnprintf((buf), (len), &(src), MAX_NUMNODES) > diff --git a/xen/include/xen/sched-if.h b/xen/include/xen/sched-if.h > --- a/xen/include/xen/sched-if.h > +++ b/xen/include/xen/sched-if.h > @@ -184,6 +184,8 @@ struct scheduler { > struct xen_domctl_scheduler_op *); > int (*adjust_global) (const struct scheduler *, > struct xen_sysctl_scheduler_op *); > + void (*set_node_affinity) (const struct scheduler *, > + struct domain *, nodemask_t *); > void (*dump_settings) (const struct scheduler *); > void (*dump_cpu_state) (const struct scheduler *, int); > > diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h > --- a/xen/include/xen/sched.h > +++ b/xen/include/xen/sched.h > @@ -359,8 +359,12 @@ struct domain > /* Various mem_events */ > struct mem_event_per_domain *mem_event; > > - /* Currently computed from union of all vcpu cpu-affinity masks. */ > + /* > + * Can be specified by the user. If that is not the case, it is > + * computed from the union of all the vcpu cpu-affinity masks. > + */ > nodemask_t node_affinity; > + int auto_node_affinity; > unsigned int last_alloc_node; > spinlock_t node_affinity_lock; > }; > @@ -429,6 +433,7 @@ static inline void get_knownalive_domain > ASSERT(!(atomic_read(&d->refcnt) & DOMAIN_DESTROYED)); > } > > +int domain_set_node_affinity(struct domain *d, const nodemask_t *affinity); > void domain_update_node_affinity(struct domain *d); > > struct domain *domain_create( > @@ -543,6 +548,7 @@ void sched_destroy_domain(struct domain > int sched_move_domain(struct domain *d, struct cpupool *c); > long sched_adjust(struct domain *, struct xen_domctl_scheduler_op *); > long sched_adjust_global(struct xen_sysctl_scheduler_op *); > +void sched_set_node_affinity(struct domain *, nodemask_t *); > int sched_id(void); > void sched_tick_suspend(void); > void sched_tick_resume(void); > diff --git a/xen/include/xsm/xsm.h b/xen/include/xsm/xsm.h > --- a/xen/include/xsm/xsm.h > +++ b/xen/include/xsm/xsm.h > @@ -56,6 +56,7 @@ struct xsm_operations { > int (*domain_create) (struct domain *d, u32 ssidref); > int (*max_vcpus) (struct domain *d); > int (*destroydomain) (struct domain *d); > + int (*nodeaffinity) (int cmd, struct domain *d); > int (*vcpuaffinity) (int cmd, struct domain *d); > int (*scheduler) (struct domain *d); > int (*getdomaininfo) (struct domain *d); > @@ -229,6 +230,11 @@ static inline int xsm_destroydomain (str > return xsm_call(destroydomain(d)); > } > > +static inline int xsm_nodeaffinity (int cmd, struct domain *d) > +{ > + return xsm_call(nodeaffinity(cmd, d)); > +} > + > static inline int xsm_vcpuaffinity (int cmd, struct domain *d) > { > return xsm_call(vcpuaffinity(cmd, d)); > diff --git a/xen/xsm/dummy.c b/xen/xsm/dummy.c > --- a/xen/xsm/dummy.c > +++ b/xen/xsm/dummy.c > @@ -54,6 +54,11 @@ static int dummy_destroydomain (struct d > return 0; > } > > +static int dummy_nodeaffinity (int cmd, struct domain *d) > +{ > + return 0; > +} > + > static int dummy_vcpuaffinity (int cmd, struct domain *d) > { > return 0; > @@ -634,6 +639,7 @@ void xsm_fixup_ops (struct xsm_operation > set_to_dummy_if_null(ops, domain_create); > set_to_dummy_if_null(ops, max_vcpus); > set_to_dummy_if_null(ops, destroydomain); > + set_to_dummy_if_null(ops, nodeaffinity); > set_to_dummy_if_null(ops, vcpuaffinity); > set_to_dummy_if_null(ops, scheduler); > set_to_dummy_if_null(ops, getdomaininfo); > diff --git a/xen/xsm/flask/hooks.c b/xen/xsm/flask/hooks.c > --- a/xen/xsm/flask/hooks.c > +++ b/xen/xsm/flask/hooks.c > @@ -521,17 +521,19 @@ static int flask_destroydomain(struct do > DOMAIN__DESTROY); > } > > -static int flask_vcpuaffinity(int cmd, struct domain *d) > +static int flask_affinity(int cmd, struct domain *d) > { > u32 perm; > > switch ( cmd ) > { > case XEN_DOMCTL_setvcpuaffinity: > - perm = DOMAIN__SETVCPUAFFINITY; > + case XEN_DOMCTL_setnodeaffinity: > + perm = DOMAIN__SETAFFINITY; > break; > case XEN_DOMCTL_getvcpuaffinity: > - perm = DOMAIN__GETVCPUAFFINITY; > + case XEN_DOMCTL_getnodeaffinity: > + perm = DOMAIN__GETAFFINITY; > break; > default: > return -EPERM; > @@ -1473,7 +1475,8 @@ static struct xsm_operations flask_ops > .domain_create = flask_domain_create, > .max_vcpus = flask_max_vcpus, > .destroydomain = flask_destroydomain, > - .vcpuaffinity = flask_vcpuaffinity, > + .nodeaffinity = flask_affinity, > + .vcpuaffinity = flask_affinity, > .scheduler = flask_scheduler, > .getdomaininfo = flask_getdomaininfo, > .getvcpucontext = flask_getvcpucontext, > diff --git a/xen/xsm/flask/include/av_perm_to_string.h b/xen/xsm/flask/include/av_perm_to_string.h > --- a/xen/xsm/flask/include/av_perm_to_string.h > +++ b/xen/xsm/flask/include/av_perm_to_string.h > @@ -37,8 +37,8 @@ > S_(SECCLASS_DOMAIN, DOMAIN__TRANSITION, "transition") > S_(SECCLASS_DOMAIN, DOMAIN__MAX_VCPUS, "max_vcpus") > S_(SECCLASS_DOMAIN, DOMAIN__DESTROY, "destroy") > - S_(SECCLASS_DOMAIN, DOMAIN__SETVCPUAFFINITY, "setvcpuaffinity") > - S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUAFFINITY, "getvcpuaffinity") > + S_(SECCLASS_DOMAIN, DOMAIN__SETAFFINITY, "setaffinity") > + S_(SECCLASS_DOMAIN, DOMAIN__GETAFFINITY, "getaffinity") > S_(SECCLASS_DOMAIN, DOMAIN__SCHEDULER, "scheduler") > S_(SECCLASS_DOMAIN, DOMAIN__GETDOMAININFO, "getdomaininfo") > S_(SECCLASS_DOMAIN, DOMAIN__GETVCPUINFO, "getvcpuinfo") > diff --git a/xen/xsm/flask/include/av_permissions.h b/xen/xsm/flask/include/av_permissions.h > --- a/xen/xsm/flask/include/av_permissions.h > +++ b/xen/xsm/flask/include/av_permissions.h > @@ -38,8 +38,8 @@ > #define DOMAIN__TRANSITION 0x00000020UL > #define DOMAIN__MAX_VCPUS 0x00000040UL > #define DOMAIN__DESTROY 0x00000080UL > -#define DOMAIN__SETVCPUAFFINITY 0x00000100UL > -#define DOMAIN__GETVCPUAFFINITY 0x00000200UL > +#define DOMAIN__SETAFFINITY 0x00000100UL > +#define DOMAIN__GETAFFINITY 0x00000200UL > #define DOMAIN__SCHEDULER 0x00000400UL > #define DOMAIN__GETDOMAININFO 0x00000800UL > #define DOMAIN__GETVCPUINFO 0x00001000UL
George Dunlap
2012-Dec-21 15:19 UTC
Re: [PATCH 05 of 10 v2] libxc: allow for explicitly specifying node-affinity
On 19/12/12 19:07, Dario Faggioli wrote:> By providing the proper get/set interface and wiring them > to the new domctl-s from the previous commit. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>I haven''t done a detailed review, but everything looks OK: Acked-by: George Dunlap <george.dunlap@eu.citrix.com>> > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > --- a/tools/libxc/xc_domain.c > +++ b/tools/libxc/xc_domain.c > @@ -110,6 +110,83 @@ int xc_domain_shutdown(xc_interface *xch > } > > > +int xc_domain_node_setaffinity(xc_interface *xch, > + uint32_t domid, > + xc_nodemap_t nodemap) > +{ > + DECLARE_DOMCTL; > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > + int ret = -1; > + int nodesize; > + > + nodesize = xc_get_nodemap_size(xch); > + if (!nodesize) > + { > + PERROR("Could not get number of nodes"); > + goto out; > + } > + > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > + if ( local == NULL ) > + { > + PERROR("Could not allocate memory for setnodeaffinity domctl hypercall"); > + goto out; > + } > + > + domctl.cmd = XEN_DOMCTL_setnodeaffinity; > + domctl.domain = (domid_t)domid; > + > + memcpy(local, nodemap, nodesize); > + set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local); > + domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8; > + > + ret = do_domctl(xch, &domctl); > + > + xc_hypercall_buffer_free(xch, local); > + > + out: > + return ret; > +} > + > +int xc_domain_node_getaffinity(xc_interface *xch, > + uint32_t domid, > + xc_nodemap_t nodemap) > +{ > + DECLARE_DOMCTL; > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > + int ret = -1; > + int nodesize; > + > + nodesize = xc_get_nodemap_size(xch); > + if (!nodesize) > + { > + PERROR("Could not get number of nodes"); > + goto out; > + } > + > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > + if ( local == NULL ) > + { > + PERROR("Could not allocate memory for getnodeaffinity domctl hypercall"); > + goto out; > + } > + > + domctl.cmd = XEN_DOMCTL_getnodeaffinity; > + domctl.domain = (domid_t)domid; > + > + set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local); > + domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8; > + > + ret = do_domctl(xch, &domctl); > + > + memcpy(nodemap, local, nodesize); > + > + xc_hypercall_buffer_free(xch, local); > + > + out: > + return ret; > +} > + > int xc_vcpu_setaffinity(xc_interface *xch, > uint32_t domid, > int vcpu, > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > --- a/tools/libxc/xenctrl.h > +++ b/tools/libxc/xenctrl.h > @@ -521,6 +521,32 @@ int xc_watchdog(xc_interface *xch, > uint32_t id, > uint32_t timeout); > > +/** > + * This function explicitly sets the host NUMA nodes the domain will > + * have affinity with. > + * > + * @parm xch a handle to an open hypervisor interface. > + * @parm domid the domain id one wants to set the affinity of. > + * @parm nodemap the map of the affine nodes. > + * @return 0 on success, -1 on failure. > + */ > +int xc_domain_node_setaffinity(xc_interface *xch, > + uint32_t domind, > + xc_nodemap_t nodemap); > + > +/** > + * This function retrieves the host NUMA nodes the domain has > + * affinity with. > + * > + * @parm xch a handle to an open hypervisor interface. > + * @parm domid the domain id one wants to get the node affinity of. > + * @parm nodemap the map of the affine nodes. > + * @return 0 on success, -1 on failure. > + */ > +int xc_domain_node_getaffinity(xc_interface *xch, > + uint32_t domind, > + xc_nodemap_t nodemap); > + > int xc_vcpu_setaffinity(xc_interface *xch, > uint32_t domid, > int vcpu,
George Dunlap
2012-Dec-21 15:30 UTC
Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
On 19/12/12 19:07, Dario Faggioli wrote:> By introducing a nodemap in libxl_domain_build_info and > providing the get/set methods to deal with it. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>I think you''ll probably need to add a line like the following: #define LIBXL_HAVE_NODEAFFINITY 1 So that people wanting to build against different versions of the library can behave appropriately. But IanC or IanJ would be the final word on that, I think. -George> > diff --git a/tools/libxl/libxl.c b/tools/libxl/libxl.c > --- a/tools/libxl/libxl.c > +++ b/tools/libxl/libxl.c > @@ -4142,6 +4142,26 @@ int libxl_set_vcpuaffinity_all(libxl_ctx > return rc; > } > > +int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, > + libxl_bitmap *nodemap) > +{ > + if (xc_domain_node_setaffinity(ctx->xch, domid, nodemap->map)) { > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "setting node affinity"); > + return ERROR_FAIL; > + } > + return 0; > +} > + > +int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, > + libxl_bitmap *nodemap) > +{ > + if (xc_domain_node_getaffinity(ctx->xch, domid, nodemap->map)) { > + LIBXL__LOG_ERRNO(ctx, LIBXL__LOG_ERROR, "getting node affinity"); > + return ERROR_FAIL; > + } > + return 0; > +} > + > int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap) > { > GC_INIT(ctx); > diff --git a/tools/libxl/libxl.h b/tools/libxl/libxl.h > --- a/tools/libxl/libxl.h > +++ b/tools/libxl/libxl.h > @@ -861,6 +861,10 @@ int libxl_set_vcpuaffinity(libxl_ctx *ct > libxl_bitmap *cpumap); > int libxl_set_vcpuaffinity_all(libxl_ctx *ctx, uint32_t domid, > unsigned int max_vcpus, libxl_bitmap *cpumap); > +int libxl_domain_set_nodeaffinity(libxl_ctx *ctx, uint32_t domid, > + libxl_bitmap *nodemap); > +int libxl_domain_get_nodeaffinity(libxl_ctx *ctx, uint32_t domid, > + libxl_bitmap *nodemap); > int libxl_set_vcpuonline(libxl_ctx *ctx, uint32_t domid, libxl_bitmap *cpumap); > > libxl_scheduler libxl_get_scheduler(libxl_ctx *ctx); > diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c > --- a/tools/libxl/libxl_create.c > +++ b/tools/libxl/libxl_create.c > @@ -184,6 +184,12 @@ int libxl__domain_build_info_setdefault( > > libxl_defbool_setdefault(&b_info->numa_placement, true); > > + if (!b_info->nodemap.size) { > + if (libxl_node_bitmap_alloc(CTX, &b_info->nodemap, 0)) > + return ERROR_FAIL; > + libxl_bitmap_set_any(&b_info->nodemap); > + } > + > if (b_info->max_memkb == LIBXL_MEMKB_DEFAULT) > b_info->max_memkb = 32 * 1024; > if (b_info->target_memkb == LIBXL_MEMKB_DEFAULT) > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -230,6 +230,7 @@ int libxl__build_pre(libxl__gc *gc, uint > if (rc) > return rc; > } > + libxl_domain_set_nodeaffinity(ctx, domid, &info->nodemap); > libxl_set_vcpuaffinity_all(ctx, domid, info->max_vcpus, &info->cpumap); > > xc_domain_setmaxmem(ctx->xch, domid, info->target_memkb + LIBXL_MAXMEM_CONSTANT); > diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl > --- a/tools/libxl/libxl_types.idl > +++ b/tools/libxl/libxl_types.idl > @@ -261,6 +261,7 @@ libxl_domain_build_info = Struct("domain > ("max_vcpus", integer), > ("avail_vcpus", libxl_bitmap), > ("cpumap", libxl_bitmap), > + ("nodemap", libxl_bitmap), > ("numa_placement", libxl_defbool), > ("tsc_mode", libxl_tsc_mode), > ("max_memkb", MemKB),
George Dunlap
2012-Dec-21 16:00 UTC
Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
On 19/12/12 19:07, Dario Faggioli wrote:> For choosing the best NUMA placement candidate, we need to figure out > how many VCPUs are runnable on each of them. That requires going through > all the VCPUs of all the domains and check their affinities. > > With this change, instead of doing the above for each candidate, we > do it once for all, populating an array while counting. This way, when > we later are evaluating candidates, all we need is summing up the right > elements of the array itself. > > This reduces the complexity of the overall algorithm, as it moves a > potentially expensive operation (for_each_vcpu_of_each_domain {}) > outside from the core placement loop, so that it is performed only > once instead of (potentially) tens or hundreds of times. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>You know this code best. :-) I''ve looked it over and just have one minor suggestion:> for (j = 0; j < nr_dom_vcpus; j++) { > + /* For each vcpu of each domain, increment the elements of > + * the array corresponding to the nodes where the vcpu runs */ > + libxl_bitmap_set_none(&vcpu_nodemap); > + libxl_for_each_set_bit(k, vinfo[j].cpumap) { > + int node = tinfo[k].node;I think I might rename "vcpu_nodemap" to something that suggests better how it fits with the algorithm -- for instance, "counted_nodemap" or "nodes_counted" -- something to suggest that this is how we avoid counting the same vcpu on the same node multiple times. -George
Dario Faggioli
2012-Dec-21 16:07 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Fri, 2012-12-21 at 14:29 +0000, George Dunlap wrote:> > Sorry for that, I probably spent so much time buried, as you where > > saying, in the > > various nested loops and calls, that I lost the context a little bit! :-P > > OK, that makes sense -- I figured it was something like that. Don''t > feel too bad about missing that connection -- we''re all fairly blind to > our own code, and I only caught it because I was trying to figure out > what was going on. >Yeah, thanks, and no, I won''t let this let me down to much, Even if this was quite a big one. After all, that''s what we have patch review for!> That''s why we do patch review. :-) >Hehe, I see we agree. :-)> Honestly, the whole "steal work" idea seemed a bit backwards to begin > with, but now that we''re not just dealing with "possible" and "not > possible", but with "better" and "worse", the work-stealing method of > load balancing sort of falls down. > > [snip] > > But that''s kind of a half-baked idea at this point. >Yes, this whole stealing work machinery may be rethought a bit. However, let''s get something sane in re NUMA load balancing ASAP, as we planned, and then we''ll see whether/how to rework it with both simplicity and effectiveness in mind.> > Ok, I think the problem I was describing is real, and I''ve seen it happening and > > causing performances degradation. However, as I think a good solution > > is going to > > be more complex than I thought, I''d better repost without this > > function and deal with > > it in a future separate patch (after having figured out the best way > > of doing so). Is > > that fine with you? > > Yes, that''s fine. >Ok, I''ll sort out all your comments and try to post v3 in early January, so that you''ll find it in your inbox as soon as you''ll be back from vacations! :-)> Thanks, Dario. >Thanks to you, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-21 16:13 UTC
Re: [PATCH 03 of 10 v2] xen: sched_credit: let the scheduler know about node-affinity
On Fri, 2012-12-21 at 14:56 +0000, George Dunlap wrote:> > I hope I''ve explained it correctly, and sorry if it is a little bit > > tricky, especially to explain like this (although, believe me, it was > > tricky to hunt it out too! :-P). I''ve seen that happening and I''m almost > > sure I kept a trace somewhere, so let me know if you want to see the > > "smoking gun". :-) > > No, the change looks quite plausible. I guess it''s not obvious that the > balancing code will never migrate from one thread to another thread. >It was far from obvious to figure out this was happening, yes. :-)> (That whole algorithm could do with some commenting -- I may submit a > patch once this series is in.) >Nice.> I''m really glad you''ve had the opportunity to take a close look at these > kinds of things. >Yeah, well, I''m happy to, scheduling never stops entertaining me, even (or especially) when it requires my brain-cells to workout so hard! :-D> What I was doing, > in a sort of "thinking out loud" fashion, seeing under what conditions > that break might actually happen. Like the analysis with > vcpu_should_migrate(), it might have turned out to be redundant, or to > have missed some cases. >Yep, I agree, it''s another aspects of the patch-review model which is really helpful. Thanks, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-21 16:17 UTC
Re: [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
On Fri, 2012-12-21 at 15:17 +0000, George Dunlap wrote:> On 19/12/12 19:07, Dario Faggioli wrote: > > Make it possible to pass the node-affinity of a domain to the hypervisor > > from the upper layers, instead of always being computed automatically. > > > > Note that this also required generalizing the Flask hooks for setting > > and getting the affinity, so that they now deal with both vcpu and > > node affinity. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > I can''t comment on the XSM stuff >Right, it''s the part I''m most week on too... Daniel had a go with this parch during v1''s review, and he raised a couple of points that I think I addressed, let''s see if he thinks there''s anything else.> -- is any part of the "getvcpuaffinity" > stuff for XSM a public interface that needs to be backwards-compatible? > I.e., is s/vcpu//; OK from an interface point of view? >Mmm... Good point, I haven''t thought about that. This was here in v1 and Daniel explicitly said he was fine with the renaming instead of adding a new hook, but mostly for the "semantic" point of view, not sure whether bkwrd compatibility is an issue... Daniel, what do you think?> WRT everything else: > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> >Ok. Thanks, Dario <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-21 16:18 UTC
Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
On Fri, 2012-12-21 at 15:30 +0000, George Dunlap wrote:> On 19/12/12 19:07, Dario Faggioli wrote: > > By introducing a nodemap in libxl_domain_build_info and > > providing the get/set methods to deal with it. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> > > I think you''ll probably need to add a line like the following: > > #define LIBXL_HAVE_NODEAFFINITY 1 > > So that people wanting to build against different versions of the > library can behave appropriately. >I see what you mean.> But IanC or IanJ would be the final > word on that, I think. >Well, let''s see if they want to comment. In any case, it shouldn''t be a big deal to add this. Thanks for pointing out. Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2012-Dec-21 16:22 UTC
Re: [PATCH 08 of 10 v2] libxl: automatic placement deals with node-affinity
On 19/12/12 19:07, Dario Faggioli wrote:> Which basically means the following two things: > 1) during domain creation, it is the node-affinity of > the domain --rather than the vcpu-affinities of its > VCPUs-- that is affected by automatic placement; > 2) during automatic placement, when counting how many > VCPUs are already "bound" to a placement candidate > (as part of the process of choosing the best > candidate), both vcpu-affinity and node-affinity > are considered. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com>Re-confirming Ack. -George> > diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c > --- a/tools/libxl/libxl_dom.c > +++ b/tools/libxl/libxl_dom.c > @@ -133,13 +133,13 @@ static int numa_place_domain(libxl__gc * > { > int found; > libxl__numa_candidate candidate; > - libxl_bitmap candidate_nodemap; > + libxl_bitmap cpupool_nodemap; > libxl_cpupoolinfo cpupool_info; > int i, cpupool, rc = 0; > uint32_t memkb; > > libxl__numa_candidate_init(&candidate); > - libxl_bitmap_init(&candidate_nodemap); > + libxl_bitmap_init(&cpupool_nodemap); > > /* > * Extract the cpumap from the cpupool the domain belong to. In fact, > @@ -156,7 +156,7 @@ static int numa_place_domain(libxl__gc * > rc = libxl_domain_need_memory(CTX, info, &memkb); > if (rc) > goto out; > - if (libxl_node_bitmap_alloc(CTX, &candidate_nodemap, 0)) { > + if (libxl_node_bitmap_alloc(CTX, &cpupool_nodemap, 0)) { > rc = ERROR_FAIL; > goto out; > } > @@ -174,17 +174,19 @@ static int numa_place_domain(libxl__gc * > if (found == 0) > goto out; > > - /* Map the candidate''s node map to the domain''s info->cpumap */ > - libxl__numa_candidate_get_nodemap(gc, &candidate, &candidate_nodemap); > - rc = libxl_nodemap_to_cpumap(CTX, &candidate_nodemap, &info->cpumap); > + /* Map the candidate''s node map to the domain''s info->nodemap */ > + libxl__numa_candidate_get_nodemap(gc, &candidate, &info->nodemap); > + > + /* Avoid trying to set the affinity to nodes that might be in the > + * candidate''s nodemap but out of our cpupool. */ > + rc = libxl_cpumap_to_nodemap(CTX, &cpupool_info.cpumap, > + &cpupool_nodemap); > if (rc) > goto out; > > - /* Avoid trying to set the affinity to cpus that might be in the > - * nodemap but not in our cpupool. */ > - libxl_for_each_set_bit(i, info->cpumap) { > - if (!libxl_bitmap_test(&cpupool_info.cpumap, i)) > - libxl_bitmap_reset(&info->cpumap, i); > + libxl_for_each_set_bit(i, info->nodemap) { > + if (!libxl_bitmap_test(&cpupool_nodemap, i)) > + libxl_bitmap_reset(&info->nodemap, i); > } > > LOG(DETAIL, "NUMA placement candidate with %d nodes, %d cpus and " > @@ -193,7 +195,7 @@ static int numa_place_domain(libxl__gc * > > out: > libxl__numa_candidate_dispose(&candidate); > - libxl_bitmap_dispose(&candidate_nodemap); > + libxl_bitmap_dispose(&cpupool_nodemap); > libxl_cpupoolinfo_dispose(&cpupool_info); > return rc; > } > @@ -211,10 +213,10 @@ int libxl__build_pre(libxl__gc *gc, uint > /* > * Check if the domain has any CPU affinity. If not, try to build > * up one. In case numa_place_domain() find at least a suitable > - * candidate, it will affect info->cpumap accordingly; if it > + * candidate, it will affect info->nodemap accordingly; if it > * does not, it just leaves it as it is. This means (unless > * some weird error manifests) the subsequent call to > - * libxl_set_vcpuaffinity_all() will do the actual placement, > + * libxl_domain_set_nodeaffinity() will do the actual placement, > * whatever that turns out to be. > */ > if (libxl_defbool_val(info->numa_placement)) { > diff --git a/tools/libxl/libxl_numa.c b/tools/libxl/libxl_numa.c > --- a/tools/libxl/libxl_numa.c > +++ b/tools/libxl/libxl_numa.c > @@ -184,7 +184,7 @@ static int nr_vcpus_on_nodes(libxl__gc * > int vcpus_on_node[]) > { > libxl_dominfo *dinfo = NULL; > - libxl_bitmap vcpu_nodemap; > + libxl_bitmap dom_nodemap, vcpu_nodemap; > int nr_doms, nr_cpus; > int i, j, k; > > @@ -197,6 +197,12 @@ static int nr_vcpus_on_nodes(libxl__gc * > return ERROR_FAIL; > } > > + if (libxl_node_bitmap_alloc(CTX, &dom_nodemap, 0) < 0) { > + libxl_bitmap_dispose(&vcpu_nodemap); > + libxl_dominfo_list_free(dinfo, nr_doms); > + return ERROR_FAIL; > + } > + > for (i = 0; i < nr_doms; i++) { > libxl_vcpuinfo *vinfo; > int nr_dom_vcpus; > @@ -205,14 +211,21 @@ static int nr_vcpus_on_nodes(libxl__gc * > if (vinfo == NULL) > continue; > > + /* Retrieve the domain''s node-affinity map */ > + libxl_domain_get_nodeaffinity(CTX, dinfo[i].domid, &dom_nodemap); > + > for (j = 0; j < nr_dom_vcpus; j++) { > - /* For each vcpu of each domain, increment the elements of > - * the array corresponding to the nodes where the vcpu runs */ > + /* > + * For each vcpu of each domain, it must have both vcpu-affinity > + * and node-affinity to (a pcpu belonging to) a certain node to > + * cause an increment in the corresponding element of the array. > + */ > libxl_bitmap_set_none(&vcpu_nodemap); > libxl_for_each_set_bit(k, vinfo[j].cpumap) { > int node = tinfo[k].node; > > if (libxl_bitmap_test(suitable_cpumap, k) && > + libxl_bitmap_test(&dom_nodemap, node) && > !libxl_bitmap_test(&vcpu_nodemap, node)) { > libxl_bitmap_set(&vcpu_nodemap, node); > vcpus_on_node[node]++; > @@ -223,6 +236,7 @@ static int nr_vcpus_on_nodes(libxl__gc * > libxl_vcpuinfo_list_free(vinfo, nr_dom_vcpus); > } > > + libxl_bitmap_dispose(&dom_nodemap); > libxl_bitmap_dispose(&vcpu_nodemap); > libxl_dominfo_list_free(dinfo, nr_doms); > return 0;
Dario Faggioli
2012-Dec-21 16:23 UTC
Re: [PATCH 07 of 10 v2] libxl: optimize the calculation of how many VCPUs can run on a candidate
On Fri, 2012-12-21 at 16:00 +0000, George Dunlap wrote:> On 19/12/12 19:07, Dario Faggioli wrote: > > For choosing the best NUMA placement candidate, we need to figure out > > how many VCPUs are runnable on each of them. That requires going through > > all the VCPUs of all the domains and check their affinities. > > > > With this change, instead of doing the above for each candidate, we > > do it once for all, populating an array while counting. This way, when > > we later are evaluating candidates, all we need is summing up the right > > elements of the array itself. > > > > This reduces the complexity of the overall algorithm, as it moves a > > potentially expensive operation (for_each_vcpu_of_each_domain {}) > > outside from the core placement loop, so that it is performed only > > once instead of (potentially) tens or hundreds of times. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > You know this code best. :-) I''ve looked it over and just have one > minor suggestion: >Well, I certainly spent quite a bit of time on it, and it still is in need of some more, but again, this change only speed things up at no "functional cost", so (despite this not being a critical path) I really think it is something we want. BTW, thanks for taking a look.> > for (j = 0; j < nr_dom_vcpus; j++) { > > + /* For each vcpu of each domain, increment the elements of > > + * the array corresponding to the nodes where the vcpu runs */ > > + libxl_bitmap_set_none(&vcpu_nodemap); > > + libxl_for_each_set_bit(k, vinfo[j].cpumap) { > > + int node = tinfo[k].node; > > I think I might rename "vcpu_nodemap" to something that suggests better > how it fits with the algorithm -- for instance, "counted_nodemap" or > "nodes_counted" -- something to suggest that this is how we avoid > counting the same vcpu on the same node multiple times. >Good point, I''ll go for something like that. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2012-Dec-21 16:27 UTC
Re: [PATCH 05 of 10 v2] libxc: allow for explicitly specifying node-affinity
On Fri, 2012-12-21 at 15:19 +0000, George Dunlap wrote:> On 19/12/12 19:07, Dario Faggioli wrote: > > By providing the proper get/set interface and wiring them > > to the new domctl-s from the previous commit. > > > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> > > I haven''t done a detailed review, but everything looks OK: > > Acked-by: George Dunlap <george.dunlap@eu.citrix.com> >Ok. Also, let me use this e-mail to thank you properly for the thorough, useful and especially quick review! All you said was really useful, I''ll do my best do address the points you raised in a proper and equally quick manner. See you in 2013 (I''m going on vacation today) with v3! :-D Thanks again and Regards, Dario> > > > diff --git a/tools/libxc/xc_domain.c b/tools/libxc/xc_domain.c > > --- a/tools/libxc/xc_domain.c > > +++ b/tools/libxc/xc_domain.c > > @@ -110,6 +110,83 @@ int xc_domain_shutdown(xc_interface *xch > > } > > > > > > +int xc_domain_node_setaffinity(xc_interface *xch, > > + uint32_t domid, > > + xc_nodemap_t nodemap) > > +{ > > + DECLARE_DOMCTL; > > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > > + int ret = -1; > > + int nodesize; > > + > > + nodesize = xc_get_nodemap_size(xch); > > + if (!nodesize) > > + { > > + PERROR("Could not get number of nodes"); > > + goto out; > > + } > > + > > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > > + if ( local == NULL ) > > + { > > + PERROR("Could not allocate memory for setnodeaffinity domctl hypercall"); > > + goto out; > > + } > > + > > + domctl.cmd = XEN_DOMCTL_setnodeaffinity; > > + domctl.domain = (domid_t)domid; > > + > > + memcpy(local, nodemap, nodesize); > > + set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local); > > + domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8; > > + > > + ret = do_domctl(xch, &domctl); > > + > > + xc_hypercall_buffer_free(xch, local); > > + > > + out: > > + return ret; > > +} > > + > > +int xc_domain_node_getaffinity(xc_interface *xch, > > + uint32_t domid, > > + xc_nodemap_t nodemap) > > +{ > > + DECLARE_DOMCTL; > > + DECLARE_HYPERCALL_BUFFER(uint8_t, local); > > + int ret = -1; > > + int nodesize; > > + > > + nodesize = xc_get_nodemap_size(xch); > > + if (!nodesize) > > + { > > + PERROR("Could not get number of nodes"); > > + goto out; > > + } > > + > > + local = xc_hypercall_buffer_alloc(xch, local, nodesize); > > + if ( local == NULL ) > > + { > > + PERROR("Could not allocate memory for getnodeaffinity domctl hypercall"); > > + goto out; > > + } > > + > > + domctl.cmd = XEN_DOMCTL_getnodeaffinity; > > + domctl.domain = (domid_t)domid; > > + > > + set_xen_guest_handle(domctl.u.nodeaffinity.nodemap.bitmap, local); > > + domctl.u.nodeaffinity.nodemap.nr_elems = nodesize * 8; > > + > > + ret = do_domctl(xch, &domctl); > > + > > + memcpy(nodemap, local, nodesize); > > + > > + xc_hypercall_buffer_free(xch, local); > > + > > + out: > > + return ret; > > +} > > + > > int xc_vcpu_setaffinity(xc_interface *xch, > > uint32_t domid, > > int vcpu, > > diff --git a/tools/libxc/xenctrl.h b/tools/libxc/xenctrl.h > > --- a/tools/libxc/xenctrl.h > > +++ b/tools/libxc/xenctrl.h > > @@ -521,6 +521,32 @@ int xc_watchdog(xc_interface *xch, > > uint32_t id, > > uint32_t timeout); > > > > +/** > > + * This function explicitly sets the host NUMA nodes the domain will > > + * have affinity with. > > + * > > + * @parm xch a handle to an open hypervisor interface. > > + * @parm domid the domain id one wants to set the affinity of. > > + * @parm nodemap the map of the affine nodes. > > + * @return 0 on success, -1 on failure. > > + */ > > +int xc_domain_node_setaffinity(xc_interface *xch, > > + uint32_t domind, > > + xc_nodemap_t nodemap); > > + > > +/** > > + * This function retrieves the host NUMA nodes the domain has > > + * affinity with. > > + * > > + * @parm xch a handle to an open hypervisor interface. > > + * @parm domid the domain id one wants to get the node affinity of. > > + * @parm nodemap the map of the affine nodes. > > + * @return 0 on success, -1 on failure. > > + */ > > +int xc_domain_node_getaffinity(xc_interface *xch, > > + uint32_t domind, > > + xc_nodemap_t nodemap); > > + > > int xc_vcpu_setaffinity(xc_interface *xch, > > uint32_t domid, > > int vcpu, > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel-- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
George Dunlap
2012-Dec-21 16:34 UTC
Re: [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list`
On 19/12/12 19:07, Dario Faggioli wrote:> Node-affinity is now something that is under (some) control of the > user, so show it upon request as part of the output of `xl list'' > by the `-n'' option. > > Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > Acked-by: Juergen Gross <juergen.gross@ts.fujitsu.com> > --- > Changes from v1: > * print_{cpu,node}map() functions added instead of ''state variable''-izing > print_bitmap(). > > diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c > --- a/tools/libxl/xl_cmdimpl.c > +++ b/tools/libxl/xl_cmdimpl.c > @@ -2961,14 +2961,95 @@ out: > } > } > > -static void list_domains(int verbose, int context, const libxl_dominfo *info, int nb_domain) > +/* If map is not full, prints it and returns 0. Returns 1 otherwise. */ > +static int print_bitmap(uint8_t *map, int maplen, FILE *stream) > +{ > + int i; > + uint8_t pmap = 0, bitmask = 0; > + int firstset = 0, state = 0; > + > + for (i = 0; i < maplen; i++) { > + if (i % 8 == 0) { > + pmap = *map++; > + bitmask = 1; > + } else bitmask <<= 1; > + > + switch (state) { > + case 0: > + case 2: > + if ((pmap & bitmask) != 0) { > + firstset = i; > + state++; > + } > + continue; > + case 1: > + case 3: > + if ((pmap & bitmask) == 0) { > + fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); > + if (i - 1 > firstset) > + fprintf(stream, "-%d", i - 1); > + state = 2; > + } > + continue; > + } > + } > + switch (state) { > + case 0: > + fprintf(stream, "none"); > + break; > + case 2: > + break; > + case 1: > + if (firstset == 0) > + return 1; > + case 3: > + fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); > + if (i - 1 > firstset) > + fprintf(stream, "-%d", i - 1); > + break; > + } > + > + return 0; > +}Just checking -- is the print_bitmap() thing pure code motion? If so, would you mind saying that explicitly in the commit message, just to save people time when reading this patch? Other than that, looks OK to me -- I haven''t done a detailed review of the output layout however. -George> + > +static void print_cpumap(uint8_t *map, int maplen, FILE *stream) > +{ > + if (print_bitmap(map, maplen, stream)) > + fprintf(stream, "any cpu"); > +} > + > +static void print_nodemap(uint8_t *map, int maplen, FILE *stream) > +{ > + if (print_bitmap(map, maplen, stream)) > + fprintf(stream, "any node"); > +} > + > +static void list_domains(int verbose, int context, int numa, const libxl_dominfo *info, int nb_domain) > { > int i; > static const char shutdown_reason_letters[]= "-rscw"; > + libxl_bitmap nodemap; > + libxl_physinfo physinfo; > + > + libxl_bitmap_init(&nodemap); > + libxl_physinfo_init(&physinfo); > > printf("Name ID Mem VCPUs\tState\tTime(s)"); > if (verbose) printf(" UUID Reason-Code\tSecurity Label"); > if (context && !verbose) printf(" Security Label"); > + if (numa) { > + if (libxl_node_bitmap_alloc(ctx, &nodemap, 0)) { > + fprintf(stderr, "libxl_node_bitmap_alloc_failed.\n"); > + exit(1); > + } > + if (libxl_get_physinfo(ctx, &physinfo) != 0) { > + fprintf(stderr, "libxl_physinfo failed.\n"); > + libxl_bitmap_dispose(&nodemap); > + exit(1); > + } > + > + printf(" NODE Affinity"); > + } > printf("\n"); > for (i = 0; i < nb_domain; i++) { > char *domname; > @@ -3002,14 +3083,23 @@ static void list_domains(int verbose, in > rc = libxl_flask_sid_to_context(ctx, info[i].ssidref, &buf, > &size); > if (rc < 0) > - printf(" -"); > + printf(" -"); > else { > - printf(" %s", buf); > + printf(" %16s", buf); > free(buf); > } > } > + if (numa) { > + libxl_domain_get_nodeaffinity(ctx, info[i].domid, &nodemap); > + > + putchar('' ''); > + print_nodemap(nodemap.map, physinfo.nr_nodes, stdout); > + } > putchar(''\n''); > } > + > + libxl_bitmap_dispose(&nodemap); > + libxl_physinfo_dispose(&physinfo); > } > > static void list_vm(void) > @@ -3890,12 +3980,14 @@ int main_list(int argc, char **argv) > int opt, verbose = 0; > int context = 0; > int details = 0; > + int numa = 0; > int option_index = 0; > static struct option long_options[] = { > {"long", 0, 0, ''l''}, > {"help", 0, 0, ''h''}, > {"verbose", 0, 0, ''v''}, > {"context", 0, 0, ''Z''}, > + {"numa", 0, 0, ''n''}, > {0, 0, 0, 0} > }; > > @@ -3904,7 +3996,7 @@ int main_list(int argc, char **argv) > int nb_domain, rc; > > while (1) { > - opt = getopt_long(argc, argv, "lvhZ", long_options, &option_index); > + opt = getopt_long(argc, argv, "lvhZn", long_options, &option_index); > if (opt == -1) > break; > > @@ -3921,6 +4013,9 @@ int main_list(int argc, char **argv) > case ''Z'': > context = 1; > break; > + case ''n'': > + numa = 1; > + break; > default: > fprintf(stderr, "option `%c'' not supported.\n", optopt); > break; > @@ -3956,7 +4051,7 @@ int main_list(int argc, char **argv) > if (details) > list_domains_details(info, nb_domain); > else > - list_domains(verbose, context, info, nb_domain); > + list_domains(verbose, context, numa, info, nb_domain); > > if (info_free) > libxl_dominfo_list_free(info, nb_domain); > @@ -4228,56 +4323,6 @@ int main_button_press(int argc, char **a > return 0; > } > > -static void print_bitmap(uint8_t *map, int maplen, FILE *stream) > -{ > - int i; > - uint8_t pmap = 0, bitmask = 0; > - int firstset = 0, state = 0; > - > - for (i = 0; i < maplen; i++) { > - if (i % 8 == 0) { > - pmap = *map++; > - bitmask = 1; > - } else bitmask <<= 1; > - > - switch (state) { > - case 0: > - case 2: > - if ((pmap & bitmask) != 0) { > - firstset = i; > - state++; > - } > - continue; > - case 1: > - case 3: > - if ((pmap & bitmask) == 0) { > - fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); > - if (i - 1 > firstset) > - fprintf(stream, "-%d", i - 1); > - state = 2; > - } > - continue; > - } > - } > - switch (state) { > - case 0: > - fprintf(stream, "none"); > - break; > - case 2: > - break; > - case 1: > - if (firstset == 0) { > - fprintf(stream, "any cpu"); > - break; > - } > - case 3: > - fprintf(stream, "%s%d", state > 1 ? "," : "", firstset); > - if (i - 1 > firstset) > - fprintf(stream, "-%d", i - 1); > - break; > - } > -} > - > static void print_vcpuinfo(uint32_t tdomid, > const libxl_vcpuinfo *vcpuinfo, > uint32_t nr_cpus) > @@ -4301,7 +4346,7 @@ static void print_vcpuinfo(uint32_t tdom > /* TIM */ > printf("%9.1f ", ((float)vcpuinfo->vcpu_time / 1e9)); > /* CPU AFFINITY */ > - print_bitmap(vcpuinfo->cpumap.map, nr_cpus, stdout); > + print_cpumap(vcpuinfo->cpumap.map, nr_cpus, stdout); > printf("\n"); > } > > diff --git a/tools/libxl/xl_cmdtable.c b/tools/libxl/xl_cmdtable.c > --- a/tools/libxl/xl_cmdtable.c > +++ b/tools/libxl/xl_cmdtable.c > @@ -50,7 +50,8 @@ struct cmd_spec cmd_table[] = { > "[options] [Domain]\n", > "-l, --long Output all VM details\n" > "-v, --verbose Prints out UUIDs and security context\n" > - "-Z, --context Prints out security context" > + "-Z, --context Prints out security context\n" > + "-n, --numa Prints out NUMA node affinity" > }, > { "destroy", > &main_destroy, 0, 1,
Dario Faggioli
2012-Dec-21 16:54 UTC
Re: [PATCH 09 of 10 v2] xl: add node-affinity to the output of `xl list`
On Fri, 2012-12-21 at 16:34 +0000, George Dunlap wrote:> Just checking -- is the print_bitmap() thing pure code motion? If so, > would you mind saying that explicitly in the commit message, just to > save people time when reading this patch? >It''s _mostly_ code motion, but I had to hack it a tiny little bit to make it possible to use the same function for printing cpu and node bitmaps (basically, in case all bits are sets, the function was printing "any cpu" itself, which wasn''t fitting well with node maps). But yeah, I should have mentioned what I just put here in the changelog. Will do in v3.> Other than that, looks OK to me -- I haven''t done a detailed review of > the output layout however. >That''s the most tricky part of this patch! :-) For your and others convenience, here''s how it looks like on my testbox (sorry for the extra long lines, but that''s not my fault!): root@Zhaman:~# xl list Name ID Mem VCPUs State Time(s) Domain-0 0 375 16 r----- 225.1 vm1 1 960 2 -b---- 34.0 root@Zhaman:~# xl list -Z Name ID Mem VCPUs State Time(s) Security Label Domain-0 0 375 16 r----- 225.4 - vm1 1 960 2 -b---- 34.0 - root@Zhaman:~# xl list -v Name ID Mem VCPUs State Time(s) UUID Reason-Code Security Label Domain-0 0 375 16 r----- 226.6 00000000-0000-0000-0000-000000000000 - - vm1 1 960 2 -b---- 34.2 e36429cc-d2a2-4da7-b21d-b053f725e7a7 - - root@Zhaman:~# xl list -n Name ID Mem VCPUs State Time(s) NODE Affinity Domain-0 0 375 16 r----- 226.8 any node vm1 1 960 2 -b---- 34.2 0 root@Zhaman:~# xl list -nZ Name ID Mem VCPUs State Time(s) Security Label NODE Affinity Domain-0 0 375 16 r----- 226.9 - any node vm1 1 960 2 -b---- 34.2 - 0 root@Zhaman:~# xl list -nv Name ID Mem VCPUs State Time(s) UUID Reason-Code Security Label NODE Affinity Domain-0 0 375 16 r----- 227.0 00000000-0000-0000-0000-000000000000 - - any node vm1 1 960 2 -b---- 34.2 e36429cc-d2a2-4da7-b21d-b053f725e7a7 - - 0 Reasonable enough? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2012-Dec-21 17:02 UTC
Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
Dario Faggioli writes ("Re: [Xen-devel] [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity"):> On Fri, 2012-12-21 at 15:30 +0000, George Dunlap wrote: > > I think you''ll probably need to add a line like the following: > > > > #define LIBXL_HAVE_NODEAFFINITY 1 > > > > So that people wanting to build against different versions of the > > library can behave appropriately. > > > I see what you mean.I think this is a good idea. I''m afraid I won''t be able to do proper justice to your series until the new year. Ian.
Dario Faggioli
2012-Dec-21 17:09 UTC
Re: [PATCH 06 of 10 v2] libxl: allow for explicitly specifying node-affinity
On Fri, 2012-12-21 at 17:02 +0000, Ian Jackson wrote:> > On Fri, 2012-12-21 at 15:30 +0000, George Dunlap wrote: > > > I think you''ll probably need to add a line like the following: > > > > > > #define LIBXL_HAVE_NODEAFFINITY 1 > > > > > > So that people wanting to build against different versions of the > > > library can behave appropriately. > > > > > I see what you mean. > > I think this is a good idea. >Ok then.> I''m afraid I won''t be able to do proper justice to your series until > the new year. >Yep, I now you''re away, and that''s fine, my bad I couldn''t post it earlier. :-) I''ll be happy to see your comments when you''re back. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Daniel De Graaf
2013-Jan-03 16:05 UTC
Re: [PATCH 04 of 10 v2] xen: allow for explicitly specifying node-affinity
On 12/21/2012 10:17 AM, George Dunlap wrote:> On 19/12/12 19:07, Dario Faggioli wrote: >> Make it possible to pass the node-affinity of a domain to the hypervisor >> from the upper layers, instead of always being computed automatically. >> >> Note that this also required generalizing the Flask hooks for setting >> and getting the affinity, so that they now deal with both vcpu and >> node affinity. >> >> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> > > I can''t comment on the XSM stuff -- is any part of the "getvcpuaffinity" stuff for XSM a public interface that needs to be backwards-compatible? I.e., is s/vcpu//; OK from an interface point of view? > > WRT everything else: > Acked-by: George Dunlap <george.dunlap@eu.citrix.com>It is an interface used only by the XSM policy itself, which is already going to have non-backwards-compatible changes in 4.3 due to IS_PRIV reworking and adding new hooks. The FLASK policy in Xen has not supported loading policies that do not exactly match the hypervisor''s access vectors because the hypervisor policy is still maintained in the same source code tree as the hypervisor, so I would consider this similar to the compatibility between libxc/libxl and the hypervisor rather than trying for the same level of compatibility that Linux provides for SELinux policies. A quick grep of xen-unstable finds one instance of getvcpuaffinity in xen.te that needs to be changed to getaffinity; with that: Acked-by: Daniel De Graaf <dgdegra@tycho.nsa.gov>>> --- >> Changes from v1: >> * added the missing dummy hook for nodeaffinity; >> * let the permission renaming affect flask policies too. >>
On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote:> * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap > * [ 2/10] xen, libxc: introduce node maps and masksI was just about to apply these two when I realised they had a hypervisor component too. Needs an Ack from Jan and/or Keir then, I think.> [ 3/10] xen: sched_credit: let the scheduler know about node-affinity > [ 4/10] xen: allow for explicitly specifying node-affinityThese are for Keir and or Jan I think. I wasn''t sure if it was ok to apply the rest without the above. But given that I''m now not immediately doing 1+2 either that''s somewhat moot.> * [ 5/10] libxc: allow for explicitly specifying node-affinity > * [ 6/10] libxl: allow for explicitly specifying node-affinity > [ 7/10] libxl: optimize the calculation of how many VCPUs can run on a candidate > * [ 8/10] libxl: automatic placement deals with node-affinity > * [ 9/10] xl: add node-affinity to the output of `xl list` > [10/10] docs: rearrange and update NUMA placement documentation > > Thanks and Regards, > Dario > > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) >
On Fri, 2013-01-11 at 12:19 +0000, Ian Campbell wrote:> On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote: > > * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap > > * [ 2/10] xen, libxc: introduce node maps and masks > > I was just about to apply these two when I realised they had a > hypervisor component too. Needs an Ack from Jan and/or Keir then, I > think. >Mmm... But there have been comments, on at least 2, and I need to address them in v3 (which I''ll repost shortly... Sorry but I''ve been sidetracked a bit).> > [ 3/10] xen: sched_credit: let the scheduler know about node-affinity > > [ 4/10] xen: allow for explicitly specifying node-affinity > > These are for Keir and or Jan I think. > > I wasn''t sure if it was ok to apply the rest without the above. >Well, as above, 3 needs reworking to let me cope with Juergen, Jan''s and George''s comments. So, if it''s fine for you, I''d say to wait for a v3 before applying anything from this series. Sorry if I did not make the fact that there _will_ be a v3 not clear enough. :-P Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 2013-01-11 at 13:57 +0000, Dario Faggioli wrote:> On Fri, 2013-01-11 at 12:19 +0000, Ian Campbell wrote: > > On Wed, 2012-12-19 at 19:07 +0000, Dario Faggioli wrote: > > > * [ 1/10] xen, libxc: rename xenctl_cpumap to xenctl_bitmap > > > * [ 2/10] xen, libxc: introduce node maps and masks > > > > I was just about to apply these two when I realised they had a > > hypervisor component too. Needs an Ack from Jan and/or Keir then, I > > think. > > > Mmm... But there have been comments, on at least 2, and I need to > address them in v3 (which I''ll repost shortly... Sorry but I''ve been > sidetracked a bit).Ah, I was just going by the two acks. OK I''ll wait Ian.