thr3ads.net - Xen devel - [PATCH v8] Some automatic NUMA placement documentation [Jul 2012]

If this information is useful, please help other people find it:
Share via:

Dario Faggioli

2012-Jul-26 15:46 UTC

[PATCH v8] Some automatic NUMA placement documentation

About rationale, usage and (some small bits of) API.

Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
Acked-by: Ian Campbell <ian.campbell@citrix.com>

---
Changes from v7:
 * avoid referring to 4.2 release as "upcoming".
 * libxl placement disabling key explicitly mentioned.
 * Limit of max 16 NUMA nodes explicitly mentioned.

Changes from v6:
 * text updated to reflect the modified behaviour.
 * Lines wrapped to a smaller column number.

Changes from v5:
 * text updated to reflect the modified behaviour.

Changes from v3:
 * typos and rewording of some sentences, as suggested during review.

Changes from v1:
 * API documentation moved close to the actual functions.

diff --git a/docs/misc/xl-numa-placement.markdown
b/docs/misc/xl-numa-placement.markdown
new file mode 100644
--- /dev/null
+++ b/docs/misc/xl-numa-placement.markdown
@@ -0,0 +1,112 @@
+# Guest Automatic NUMA Placement in libxl and xl #
+
+## Rationale ##
+
+NUMA (which stands for Non-Uniform Memory Access) means that the memory
+accessing times of a program running on a CPU depends on the relative
+distance between that CPU and that memory. In fact, most of the NUMA
+systems are built in such a way that each processor has its local memory,
+on which it can operate very fast. On the other hand, getting and storing
+data from and on remote memory (that is, memory local to some other processor)
+is quite more complex and slow. On these machines, a NUMA node is usually
+defined as a set of processor cores (typically a physical CPU package) and
+the memory directly attached to the set of cores.
+
+The Xen hypervisor deals with NUMA machines by assigning to each domain
+a "node affinity", i.e., a set of NUMA nodes of the host from which
they
+get their memory allocated.
+
+NUMA awareness becomes very important as soon as many domains start
+running memory-intensive workloads on a shared host. In fact, the cost
+of accessing non node-local memory locations is very high, and the
+performance degradation is likely to be noticeable.
+
+## Guest Placement in xl ##
+
+If using xl for creating and managing guests, it is very easy to ask for
+both manual or automatic placement of them across the host''s NUMA
nodes.
+
+Note that xm/xend does the very same thing, the only differences residing
+in the details of the heuristics adopted for the placement (see below).
+
+### Manual Guest Placement with xl ###
+
+Thanks to the "cpus=" option, it is possible to specify where a
domain
+should be created and scheduled on, directly in its config file. This
+affects NUMA placement and memory accesses as the hypervisor constructs
+the node affinity of a VM basing right on its CPU affinity when it is
+created.
+
+This is very simple and effective, but requires the user/system
+administrator to explicitly specify affinities for each and every domain,
+or Xen won''t be able to guarantee the locality for their memory
accesses.
+
+It is also possible to deal with NUMA by partitioning the system using
+cpupools. Again, this could be "The Right Answer" for many needs and
+occasions, but has to be carefully considered and setup by hand.
+
+### Automatic Guest Placement with xl ###
+
+If no "cpus=" option is specified in the config file, libxl tries
+to figure out on its own on which node(s) the domain could fit best.
+It is worthwhile noting that optimally fitting a set of VMs on the NUMA
+nodes of an host is an incarnation of the Bin Packing Problem. In fact,
+the various VMs with different memory sizes are the items to be packed,
+and the host nodes are the bins. As such problem is known to be NP-hard,
+we will be using some heuristics.
+
+The first thing to do is find the nodes or the sets of nodes (from now
+on referred to as ''candidates'') that have enough free memory
and enough
+physical CPUs for accommodating the new domain. The idea is to find a
+spot for the domain with at least as much free memory as it has configured
+to have, and as much pCPUs as it has vCPUs.  After that, the actual
+decision on which candidate to pick happens accordingly to the following
+heuristics:
+
+  *  candidates involving fewer nodes are considered better. In case
+     two (or more) candidates span the same number of nodes,
+  *  candidates with a smaller number of vCPUs runnable on them (due
+     to previous placement and/or plain vCPU pinning) are considered
+     better. In case the same number of vCPUs can run on two (or more)
+     candidates,
+  *  the candidate with with the greatest amount of free memory is
+     considered to be the best one.
+
+Giving preference to candidates with fewer nodes ensures better
+performance for the guest, as it avoid spreading its memory among
+different nodes. Favoring candidates with fewer vCPUs already runnable
+there ensures a good balance of the overall host load. Finally, if more
+candidates fulfil these criteria, prioritizing the nodes that have the
+largest amounts of free memory helps keeping the memory fragmentation
+small, and maximizes the probability of being able to put more domains
+there.
+
+## Guest Placement within libxl ##
+
+xl achieves automatic NUMA placement because that is what libxl does
+by default. No API is provided (yet) for modifying the behaviour of
+the placement algorithm. However, if your program is calling libxl,
+it is possible to set the `numa_placement` build info key to `false`
+(it is `true` by default) with something like the below, to prevent
+any placement from happening:
+
+    libxl_defbool_set(&domain_build_info->numa_placement, false);
+
+Also, if `numa_placement` is set to `true`, the domain must not
+have any cpu affinity (i.e., `domain_build_info->cpumap` must
+have all its bits set, as it is by default), or domain creation
+will fail returning `ERROR_INVAL`.
+
+Besides than that, looking and/or tweaking the placement algorithm
+search "Automatic NUMA placement" in libxl\_internal.h.
+
+Note this may change in future versions of Xen/libxl.
+
+## Limitations ##
+
+Analyzing various possible placement solutions is what makes the
+algorithm flexible and quite effective. However, that also means
+it won''t scale well to systems with arbitrary number of nodes.
+For this reason, if machines with more than 16 NUMA nodes will
+ever exist, placement will automatically disables itself when
+running libxl on them.

Ian Jackson

2012-Jul-26 16:00 UTC

head link

Re: [PATCH v8] Some automatic NUMA placement documentation

Dario Faggioli writes ("[PATCH v8] Some automatic NUMA placement
documentation"):> About rationale, usage and (some small bits of) API.
> +Also, if `numa_placement` is set to `true`, the domain must not
> +have any cpu affinity (i.e., `domain_build_info->cpumap` must
> +have all its bits set, as it is by default), or domain creation
> +will fail returning `ERROR_INVAL`.
This is no longer true, is it ?

And:
> +Analyzing various possible placement solutions is what makes the
> +algorithm flexible and quite effective. However, that also means
> +it won''t scale well to systems with arbitrary number of nodes.
> +For this reason, if machines with more than 16 NUMA nodes will
> +ever exist, placement will automatically disables itself when
> +running libxl on them.
This last is very awkwardly phrased.  How about:

   For this reason, automatic numa placement is disabled (with
   a warning) if it is requested on a host with more than 16 NUMA
   nodes.

Ian.

Dario Faggioli

2012-Jul-26 16:09 UTC

head link

Re: [PATCH v8] Some automatic NUMA placement documentation

On Thu, 2012-07-26 at 17:00 +0100, Ian Jackson wrote: > Dario Faggioli writes ("[PATCH v8] Some automatic NUMA placement
documentation"):
> > About rationale, usage and (some small bits of) API.
> 
> > +Also, if `numa_placement` is set to `true`, the domain must not
> > +have any cpu affinity (i.e., `domain_build_info->cpumap` must
> > +have all its bits set, as it is by default), or domain creation
> > +will fail returning `ERROR_INVAL`.
> 
> This is no longer true, is it ?
> I think it is. Actually, it was you that suggested turning the code into
that... :-P

It''s this thing I''m trying to document:

    if (libxl_defbool_val(info->numa_placement)) {
        int rc;

        if (!libxl_bitmap_is_full(&info->cpumap)) {
            LOG(ERROR, "Can run NUMA placement only if no vcpu "
                       "affinity is specified");
            return ERROR_INVAL;
        }

        rc = numa_place_domain(gc, domid, info);
        if (rc)
            return rc;
    }

Am I missing something?
> > +Analyzing various possible placement solutions is what makes the
> > +algorithm flexible and quite effective. However, that also means
> > +it won''t scale well to systems with arbitrary number of
nodes.
> > +For this reason, if machines with more than 16 NUMA nodes will
> > +ever exist, placement will automatically disables itself when
> > +running libxl on them.
> 
> This last is very awkwardly phrased.  How about:
> 
>    For this reason, automatic numa placement is disabled (with
>    a warning) if it is requested on a host with more than 16 NUMA
>    nodes.
> Fine with it. Will change it.

Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2012-Jul-26 16:12 UTC

head link

Re: [PATCH v8] Some automatic NUMA placement documentation

Dario Faggioli writes ("Re: [PATCH v8] Some automatic NUMA placement
documentation"):> I think it is. Actually, it was you that suggested turning the code into
> that... :-P
> 
> It''s this thing I''m trying to document:
> 
>     if (libxl_defbool_val(info->numa_placement)) {
>         int rc;
> 
>         if (!libxl_bitmap_is_full(&info->cpumap)) {
>             LOG(ERROR, "Can run NUMA placement only if no vcpu "
>                        "affinity is specified");
>             return ERROR_INVAL;
>         }
> 
>         rc = numa_place_domain(gc, domid, info);
>         if (rc)
>             return rc;
>     }
> 
> Am I missing something?
No.  I''m conflating cpupools and vcpu affinity.

Ian.

Reasonably Related Threads

Search for more seemingly similar threads

Xen devel - Jul 2012 - [PATCH v8] Some automatic NUMA placement documentation

[PATCH v8] Some automatic NUMA placement documentation

Re: [PATCH v8] Some automatic NUMA placement documentation

Re: [PATCH v8] Some automatic NUMA placement documentation

Re: [PATCH v8] Some automatic NUMA placement documentation

Reasonably Related Threads