Dario Faggioli
2012-Jul-26 15:46 UTC
[PATCH v8] Some automatic NUMA placement documentation
About rationale, usage and (some small bits of) API. Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> Acked-by: Ian Campbell <ian.campbell@citrix.com> --- Changes from v7: * avoid referring to 4.2 release as "upcoming". * libxl placement disabling key explicitly mentioned. * Limit of max 16 NUMA nodes explicitly mentioned. Changes from v6: * text updated to reflect the modified behaviour. * Lines wrapped to a smaller column number. Changes from v5: * text updated to reflect the modified behaviour. Changes from v3: * typos and rewording of some sentences, as suggested during review. Changes from v1: * API documentation moved close to the actual functions. diff --git a/docs/misc/xl-numa-placement.markdown b/docs/misc/xl-numa-placement.markdown new file mode 100644 --- /dev/null +++ b/docs/misc/xl-numa-placement.markdown @@ -0,0 +1,112 @@ +# Guest Automatic NUMA Placement in libxl and xl # + +## Rationale ## + +NUMA (which stands for Non-Uniform Memory Access) means that the memory +accessing times of a program running on a CPU depends on the relative +distance between that CPU and that memory. In fact, most of the NUMA +systems are built in such a way that each processor has its local memory, +on which it can operate very fast. On the other hand, getting and storing +data from and on remote memory (that is, memory local to some other processor) +is quite more complex and slow. On these machines, a NUMA node is usually +defined as a set of processor cores (typically a physical CPU package) and +the memory directly attached to the set of cores. + +The Xen hypervisor deals with NUMA machines by assigning to each domain +a "node affinity", i.e., a set of NUMA nodes of the host from which they +get their memory allocated. + +NUMA awareness becomes very important as soon as many domains start +running memory-intensive workloads on a shared host. In fact, the cost +of accessing non node-local memory locations is very high, and the +performance degradation is likely to be noticeable. + +## Guest Placement in xl ## + +If using xl for creating and managing guests, it is very easy to ask for +both manual or automatic placement of them across the host''s NUMA nodes. + +Note that xm/xend does the very same thing, the only differences residing +in the details of the heuristics adopted for the placement (see below). + +### Manual Guest Placement with xl ### + +Thanks to the "cpus=" option, it is possible to specify where a domain +should be created and scheduled on, directly in its config file. This +affects NUMA placement and memory accesses as the hypervisor constructs +the node affinity of a VM basing right on its CPU affinity when it is +created. + +This is very simple and effective, but requires the user/system +administrator to explicitly specify affinities for each and every domain, +or Xen won''t be able to guarantee the locality for their memory accesses. + +It is also possible to deal with NUMA by partitioning the system using +cpupools. Again, this could be "The Right Answer" for many needs and +occasions, but has to be carefully considered and setup by hand. + +### Automatic Guest Placement with xl ### + +If no "cpus=" option is specified in the config file, libxl tries +to figure out on its own on which node(s) the domain could fit best. +It is worthwhile noting that optimally fitting a set of VMs on the NUMA +nodes of an host is an incarnation of the Bin Packing Problem. In fact, +the various VMs with different memory sizes are the items to be packed, +and the host nodes are the bins. As such problem is known to be NP-hard, +we will be using some heuristics. + +The first thing to do is find the nodes or the sets of nodes (from now +on referred to as ''candidates'') that have enough free memory and enough +physical CPUs for accommodating the new domain. The idea is to find a +spot for the domain with at least as much free memory as it has configured +to have, and as much pCPUs as it has vCPUs. After that, the actual +decision on which candidate to pick happens accordingly to the following +heuristics: + + * candidates involving fewer nodes are considered better. In case + two (or more) candidates span the same number of nodes, + * candidates with a smaller number of vCPUs runnable on them (due + to previous placement and/or plain vCPU pinning) are considered + better. In case the same number of vCPUs can run on two (or more) + candidates, + * the candidate with with the greatest amount of free memory is + considered to be the best one. + +Giving preference to candidates with fewer nodes ensures better +performance for the guest, as it avoid spreading its memory among +different nodes. Favoring candidates with fewer vCPUs already runnable +there ensures a good balance of the overall host load. Finally, if more +candidates fulfil these criteria, prioritizing the nodes that have the +largest amounts of free memory helps keeping the memory fragmentation +small, and maximizes the probability of being able to put more domains +there. + +## Guest Placement within libxl ## + +xl achieves automatic NUMA placement because that is what libxl does +by default. No API is provided (yet) for modifying the behaviour of +the placement algorithm. However, if your program is calling libxl, +it is possible to set the `numa_placement` build info key to `false` +(it is `true` by default) with something like the below, to prevent +any placement from happening: + + libxl_defbool_set(&domain_build_info->numa_placement, false); + +Also, if `numa_placement` is set to `true`, the domain must not +have any cpu affinity (i.e., `domain_build_info->cpumap` must +have all its bits set, as it is by default), or domain creation +will fail returning `ERROR_INVAL`. + +Besides than that, looking and/or tweaking the placement algorithm +search "Automatic NUMA placement" in libxl\_internal.h. + +Note this may change in future versions of Xen/libxl. + +## Limitations ## + +Analyzing various possible placement solutions is what makes the +algorithm flexible and quite effective. However, that also means +it won''t scale well to systems with arbitrary number of nodes. +For this reason, if machines with more than 16 NUMA nodes will +ever exist, placement will automatically disables itself when +running libxl on them.
Ian Jackson
2012-Jul-26 16:00 UTC
Re: [PATCH v8] Some automatic NUMA placement documentation
Dario Faggioli writes ("[PATCH v8] Some automatic NUMA placement documentation"):> About rationale, usage and (some small bits of) API.> +Also, if `numa_placement` is set to `true`, the domain must not > +have any cpu affinity (i.e., `domain_build_info->cpumap` must > +have all its bits set, as it is by default), or domain creation > +will fail returning `ERROR_INVAL`.This is no longer true, is it ? And:> +Analyzing various possible placement solutions is what makes the > +algorithm flexible and quite effective. However, that also means > +it won''t scale well to systems with arbitrary number of nodes. > +For this reason, if machines with more than 16 NUMA nodes will > +ever exist, placement will automatically disables itself when > +running libxl on them.This last is very awkwardly phrased. How about: For this reason, automatic numa placement is disabled (with a warning) if it is requested on a host with more than 16 NUMA nodes. Ian.
Dario Faggioli
2012-Jul-26 16:09 UTC
Re: [PATCH v8] Some automatic NUMA placement documentation
On Thu, 2012-07-26 at 17:00 +0100, Ian Jackson wrote:> Dario Faggioli writes ("[PATCH v8] Some automatic NUMA placement documentation"): > > About rationale, usage and (some small bits of) API. > > > +Also, if `numa_placement` is set to `true`, the domain must not > > +have any cpu affinity (i.e., `domain_build_info->cpumap` must > > +have all its bits set, as it is by default), or domain creation > > +will fail returning `ERROR_INVAL`. > > This is no longer true, is it ? >I think it is. Actually, it was you that suggested turning the code into that... :-P It''s this thing I''m trying to document: if (libxl_defbool_val(info->numa_placement)) { int rc; if (!libxl_bitmap_is_full(&info->cpumap)) { LOG(ERROR, "Can run NUMA placement only if no vcpu " "affinity is specified"); return ERROR_INVAL; } rc = numa_place_domain(gc, domid, info); if (rc) return rc; } Am I missing something?> > +Analyzing various possible placement solutions is what makes the > > +algorithm flexible and quite effective. However, that also means > > +it won''t scale well to systems with arbitrary number of nodes. > > +For this reason, if machines with more than 16 NUMA nodes will > > +ever exist, placement will automatically disables itself when > > +running libxl on them. > > This last is very awkwardly phrased. How about: > > For this reason, automatic numa placement is disabled (with > a warning) if it is requested on a host with more than 16 NUMA > nodes. >Fine with it. Will change it. Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2012-Jul-26 16:12 UTC
Re: [PATCH v8] Some automatic NUMA placement documentation
Dario Faggioli writes ("Re: [PATCH v8] Some automatic NUMA placement documentation"):> I think it is. Actually, it was you that suggested turning the code into > that... :-P > > It''s this thing I''m trying to document: > > if (libxl_defbool_val(info->numa_placement)) { > int rc; > > if (!libxl_bitmap_is_full(&info->cpumap)) { > LOG(ERROR, "Can run NUMA placement only if no vcpu " > "affinity is specified"); > return ERROR_INVAL; > } > > rc = numa_place_domain(gc, domid, info); > if (rc) > return rc; > } > > Am I missing something?No. I''m conflating cpupools and vcpu affinity. Ian.
Reasonably Related Threads
- [PATCH 00 of 10 v3] Automatic NUMA placement for xl
- [PATCH v2 0/5] xl: allow for node-wise specification of vcpu pinning
- [PATCH 00/13] Coverity fixes for libxl
- Re: [libvirt] [PATCH 1/4] libxl: implement NUMA capabilities reporting
- [PATCH 15 of 15] libxl: ocaml: add bindings for libxl_domain_create_new