thr3ads.net - Xen devel - [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning [Aug 2013]

If this information is useful, please help other people find it:
Share via:

Yechen Li

2013-Aug-16 04:13 UTC

[RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

This patch contains the details about the design, the benefits and
the indended usage of NUMA-awareness in the ballooning driver, coming
in the form of a markdown document in docs.

Given this is an RFC, it has bits explaining what is still missing,
that will of course need to disappear for the final version, but
that are useful for reviewing the whole series at this stage.

Signed-off-by: Yechen Li <lccycc123@gmail.com>
Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com>
---
 docs/misc/numa-aware-ballooning.markdown | 376 +++++++++++++++++++++++++++++++
 1 file changed, 376 insertions(+)
 create mode 100644 docs/misc/numa-aware-ballooning.markdown

diff --git a/docs/misc/numa-aware-ballooning.markdown
b/docs/misc/numa-aware-ballooning.markdown
new file mode 100644
index 0000000..3a899f1
--- /dev/null
+++ b/docs/misc/numa-aware-ballooning.markdown
@@ -0,0 +1,376 @@
+# NUMA-Aware Ballooning Design and Implementation #
+
+## Rationale ##
+
+The purpose of this document is describing how the ballooning driver for
+Linux is made NUMA-aware, why this is important, and its intended usage on
+a NUMA host.
+
+NUMA stands for Non-Uniform Memory Access, which means that a typical NUMA
+host will have more than one memory controller and that accessing a specific
+memory locations can have a varying cost, depending on from which physical
+CPU such accesses come from. In presence of such architecture, it is very
+important to keep all the memory of a domain on the same pysical NUMA node
+(or on the smallest possible set of physical NUMA nodes), in order to get
+good and predictable performance.
+
+Ballooning is an effective mean for making space, when there is the need to
+create a new domain. However, on a NUMA host, one could want to not only
+free some memory, but also to free it from a specific node. That could be,
+for instance, the case for making sure the new domain will fit in just one
+node.
+
+For more information on NUMA and on how Xen deals with it, see [Xen NUMA
+Introduction][numa_intro] on the Xen Wiki, and/or
+[xl-numa-placement.markdown][numa_placement].
+
+## Definitions ##
+
+Some definitions that could came handy for reading the rest of this
+document:
+
+* physical topology -- the hardware NUMA characteristics of the host. It
+  typically is comprised of the number of NUMA nodes, the memory each node
+  has and the number of (and the which ones) physical CPUs belongs to each
+  node.
+* virtual topology -- similar information to the physical topology, but it
+  applies to a guest, rather than to the host (hence the virtual). It is
+  constructed at guest creation time and is comprised of the number of
+  virtual NUMA nodes, the amount of guest memory each virtual node is
+  entitled of and the number of (and the which ones) virtual vCPUs belongs
+  to each virtual node.
+
+Physical node may be abbreviated as pnode, or p-node, and virtual node is
+often abbreviated as vnode, or v-node (similarly to what happens to pCPU
+and vCPU).
+
+### Physical topology and virtual topology ###
+
+It is __not__ __at__ __all__ __required__ that there is any relationship
+between the physical and virtual topology, exactly as it is not at all
+required that there is any relationship between physical CPUs and virtual
+CPUs.
+
+That is to say, on an host with 4 NUMA nodes, each node with 8 GiB RAM and
+4 pCPUs, there can be guests with 1 vnode, 2 vnodes and 4 vnodes (it is not
+clear yet whether it would make sense to allow overbooking, perhaps yes, but
+only after having warn the user about it). In all the cases, vnodes can have
+arbitrary sizes, with respect to both the amount of memory and the number of
+vCPUs in each of them. Similarly, there is neither the requirement nor the
+guarantee that, either:
+
+* the memory of a vnode is actually allocated on a particular pnode, nor
+  even that it is allocated on __only__ one pnode
+* the vCPUs belonging to a vnode are actually bound or pinned to the pCPUS
+  of a particular pnode.
+
+Of course, although it is not required, having some kind of relationship in
+place between the physical and virtual topology would be beneficial for both
+performance and resource usage optimization. This is really similar to what
+happens with vCPU to pCPU pinning: it is not required, but if it is possible
+to do that properly, performance will increase.
+
+In this specific case, the ideal situation would be as follows:
+
+* all the guest memory pages that constitutes one vnode, say vnode #1, are
+  backed by frames that, on the host, belongs to the same pnode, say pnode #3
+  (and that should be true for all the vnodes);
+* all the guest vCPUs that belongs to vnode #1 have, as far as the Xen
+  scheduler is concerned, NUMA node affinity with (some of) the pCPUs
+  belonging to pnode #3 (and that again should hold for all the vnodes).
+
+For that reason, a mechanism for the user to specify the relationship between
+virtual and physical topologies at guest creation time will be implemented, as
+well as some automated logic for coming up with a sane default, in case the
+user does not say anything (like it happens with NUMA automatic placement).
+
+## Current status ##
+
+At the time of writing this document, the work to allow a virtual NUMA topology
+to be specified and properly constructed within a domain is still in progress.
+Patches will be released soon, both for Xen and Linux, implementing right what
+has been described in the previous section.
+
+NUMA aware ballooning needs such feature to be present, or it won''t
work
+properly, so the reminder of this document will assume that we have virtual
+NUMA in place.
+
+We will also assume that we have the following "services" in place
too,
+provided by the Xen hypervisor, via a suitable hypercall interface, and
+available for the toolstack and to the ballooning driver.
+
+### nodemask VNODE\_TO\_PNODE(int vnode) ###
+
+This service is provided by the hypervisor (and wired, if necessary, all the
+way up to the proper toolstack layer or guest kernel), since it is only Xen
+that knows both the virtual and the physical topologies.
+
+It takes the ID of a vnode, as they are identified within the virtual topology
+of a guest, and returns a bitmask. The bits set to 1 in there, corresponds to
+the pnodes from which, on the host, the memory of the passed vnode comes from.
+
+The ideal situation is that such bitmask only has 1 bit set, since having the
+memory of a vnode coming from more than one pnode is clearly bad for access
+locality.
+
+### nodemask PNODE\_TO\_VNODE(int pnode) ###
+
+This service is provided by the hypervisor (and wired, if necessary, all the
+way up to the proper toolstack layer or guest kernel), since it is only Xen
+that knows both the virtual and the physical topologies.
+
+It takes the ID of a pnode, as they are identified within the physical
+topology of the host, and returns a bitmask. The bits set to 1 in there,
+corresponds to the vnodes, in the guest, that uses memory from the passed
+pnode.
+
+The ideal situation is that such bitmask only has 1 bit set, but is less of a
+problem (wrt the case above) it there are more.
+
+## Description of the problem ##
+
+Let us use an example. Let''s assume that guest _G_ has a virtual 2
vnodes,
+and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2,
+respectively.
+
+Now, the user wants to create a new guest, but the system is under high memory
+pressure, so he decides to try ballooning _G_ down. He sees that pnode #2 has
+the best chances to accommodate all the memory for the new guest, which would
+be really good for performance, if only he can make space there. _G_ is the
+only domain eating some memory from pnode, #2 but, as said above, not all of
+its memory comes from there.
+
+So, right now, the user has no way to specify that he wants to balloon down
+_G_ in such a way that he will get as much as possible free pages from pnode
+#2, rather than from pnode #0. He can ask _G_ to balloon down, but there is
+no guarantee on from what pnode the memory will be freed.
+
+The same applies to the ballooning up case, when the user, for some specific
+reasons, wants to be sure that it is the memory of some (other) specific pnode
+that will be used.
+
+What is necessary is something allowing the user to specify from what pnode
+he wants the free pages, and allowing the ballooning driver to make that
+happen.
+
+## The current situation ##
+
+This section describes the details of how ballooning works right now. It is
+possible to skip this if you already possess such knowledge.
+
+Ballooning happens inside the guest, with of course some "help" from
Xen.
+The notification that ballooning up or down is necessary reaches the
+ballooning driver in the guest via xenstore. See [here][xenstore] and
+[here][xenstore_paths] for more details.
+
+Typically, the user --via xl or via any other high level toolstack-- calls
+libxl\_set\_memory\_target(), where the new desired memory target for the
+guest is written in a xenstore key: ~/memory/target.  
+The guest has a xenstore watch on this key, and every time it changes, work
+inside the guest itself is scheduled (the body is in the balloon\_process()
+function). It is responsibility of this component of the ballooning driver
+to read the key and take proper action.
+
+There are two possible situations:
+
+* new target < current usage -- the ballooning driver maintains a list of
+  free pages (similar to what OS does, for tracking free memory). Despite
+  being actually free, they are hidden to the OS, which can''t use
them,
+  unless they are handed to it by the ballooning. From the OS perspective,
+  the ballooning driver can be seen as some sort of monster, eating some of
+  the guest free memory.  
+  To steal a free page from the guest OS, and add it to the list of ballooned
+  pages, the driver calls alloc\_page(). From the point of view of the OS,
+  the ballooning monster ate that page, decreasing the amount of free memory
+  (inside the guest).  
+  At this point, a XENMEM\_decrease\_reservation hypercall is issued, which
+  hands the newly allocated page to Xen so that he can unmap it, turning it,
+  to all the effects, into free host memory.
+* new target > current usage -- the ballooning driver picks up one page from
+  the list of ballooned pages it maintains and issue a
+  XENMEM\_populate\_physmap hypercall, with the PFN of the chosen page as an
+  argument.  
+  Xen goes looking for some free memory in the host and, when it finds it,
+  allocate a page and map that in the guest''s memory space, using the
PFN it
+  has been provided with.  
+  At this point, the ballooning driver calls \_\_free\_reserved\_page() to
+  make the guest OS know that the PFN can now be considered a free page
+  (again), which also means removing the page from the ballooned pages list.
+
+Operations like the ones described above goes on within the ballooning driver,
+perhaps one after the other, until the new target is reached.
+
+## NUMA-aware ballooning ##
+
+The new NUMA-aware ballooning logic works as follows.
+
+There is room, in libxl\_set\_memory\_target() for two more parameters, in
+addition to the new memory target:
+
+* _pnid_ -- which is the pnode id of which node the user wants to try get some
+  free memory on
+* _nodeexact_ -- which is a bool specifying whether or not, in case it is not
+  possible to reach the new ballooning target only with memory from pnode
+  _pnid_, the user is fine with using memory from other pnodes.  
+  If _nodeexact_ is true, it is possible that the new target is not reached; if
+  it is false, the new target will (probably) be reached, but it is possible
+  that some memory is freed on pnodes other than _pnid_.
+
+To let the ballooning driver know about these new parameters, a new xenstore
+key exists in ~/memory/target\_nid. So, for a proper NUMA aware ballooning
+operation to occur, the user should write the proper values in both the keys:
+~/memory/target\_nid and ~/memory/target.
+
+So, in NUMA aware ballooning, ballooning down and up works as follows:
+
+* target < current usage -- first of all, the ballooning driver uses the
+  PNODE\_TO\_VNODE() service (provided by the virtual topology implementation,
+  as explained above) to translate _pnid_ (that it reads from xenstore) to
+  the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which will
+  be be a one element only set, in case there is only one vnode in the guest
+  allocated out of pnode _pnid_). It then uses alloc_pages_node(), with one
+  or more elements from _{vnids}_ as nodes, in order to rip the proper amount
+  of ballooned pages out of the guest OS, and hand them to Xen (with the same
+  hypercal as above).  
+  If not enough pages can be found, and only if _nodeexact_ is false, it starts
+  considering other nodes (by just calling alloc\_page()). If _nodeexact_ is
+  true, it just returns.
+* target > current usage -- first of all, the ballooning driver uses the
+  PNODE\_TO\_VNODE() service to translate _pnid_ to the ID(s) of the
+  corresponding set of vnode IDs, say _{vnids}_. It then looks for enough
+  ballooned pages, among the ones belonging to the vnodes in _{vnids}_, and
+  asks Xen to map them via XENMEM\_polulate\_physmap. While doing the latter,
+  it explicitly tells the hypervisor to allocate the actual host pages on
+  pnode _pnid_.  
+  If not enough ballooned pages can be found among vnodes in _{vnids}_, and
+  only if node\_exact is false, it falls back looking for ballooned pages in
+  other vnodes. For each one he finds, it calls VNODE\_TO\_PNODE(), to see on
+  what pnode pnode it belongs, and then asks Xen to map it, exactly as above.
+
+The biggest difference between current and NUMA-aware ballooning is that the
+latter needs to keep multiple lists of the ballooned pages in an array, with
+one element for each virtual node. This way, it is always evident, at any
+given time, what ballooned pages belong to what vnode.
+
+Regarding the stealing a page from the OS part, it is enough to use the Linux
+function alloc_page_node(), in place of alloc\_page().
+
+Finally, from the hypercall point of view, both XENMEM\_decrease\_reservation
+and XENMEM\_populate\_physmap already are NUMA-aware, so it is just a matter
+of passing them the proper arguments.
+
+### Usage examples ###
+
+Let us assume we have a 4 vnodes guest on an 8 pnodes host. Virtual to
+pysical topology mapping is as follows:
+
+* vnode #0 --> pnode #1
+* vnode #1 --> pnode #3
+* vnode #2 --> pnode #3
+* vnode #3 --> pnode #5
+
+Let us also assume that the user has just decided that he want to change
+the current memory allocation scheme, by ballooning (up or down) $DOMID.
+
+Here they come some usage examples, for the NUMA-aware ballooning, with an
+exemplified explanation of what happens in the various situations.
+
+#### Ballooning down, example 1 ####
+
+The user wants N free pages from pnode #3, and only from there:
+
+1. user invokes `xl mem-set -n 3 -e  $DOMID N
+2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid
+3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets
+   {1,2}
+4. ballooning driver tries to steal N pages from vnode #1 and/or vnode #3
+   (both are fine) and add them to ballooned out pages (i.e., allocate
+   them in guest and unmap them in Xen)
+5. whether phase 4 succeeds or not, return, having potentially freed less
+   than N pages on pnode #3
+
+#### Ballooning down, example 2 ####
+
+The user wants N free pages from pnode #1, but if impossible, is fine with
+other pnodes to contribute to that:
+
+1. user invokes `xl mem-set -n 1 $DOMID N
+2. libxl writes N to ~/memory/target and {1,false} to ~/memory/target\_nid
+3. ballooning driver asks Xen what vnodes insists on pnode #1, and gets {0}
+4. ballooning driver tries to steal N pages from vnode #0 and add them to
+   ballooned out pages
+5. if less than N pages are found in phase 4, ballooning driver tries
+   to steal from any vnode
+
+#### Ballooning up, example 1 ####
+
+The user wants to give N more pages, taking them from pnode #3, and only
+from there:
+
+1. user invokes `xl mem-set -n 3 -e $DOMID N
+2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid
+3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets
+   {1,2}
+4. ballooning driver locates N ballooned pages belonging to vnode #1 or
+   vnode #2 (both would do)
+5. ballooning driver asks Xen to allocate N pages on pnode #3
+6. ballooning driver ask Xen to map the new pages on the ballooned pages
+   and hands the result back to the guest OS
+
+NOTICE that, since node_exact was true, if phase 4 fails to find N pages
+(say it finds Y<N) then phases 5 and 6 will work with Y
+
+## Limitations ##
+
+This features represents a performance (and resource usage) optimization.
+As it is common in these cases, it does its best under certain conditions.
+In case it is put to work under different conditions, it is possible that
+its actual beneficial potential is diminished or, at worst, lost completely.
+Nevertheless, it is required that everything keeps working and that the
+performance at least do not drop below the level they where without the
+feature itself.
+
+In this case, the best or worst working conditions have to do with how well
+the guest domain is placed on the host NUMA nodes, i.e., on how many and what
+physical NUMA nodes its memory comes from. If the guest has a virtual NUMA
+topology, they also have to do with the relationship between the virtual and
+the physical topology.
+
+It is possible to identify three situations:
+
+1. multiple vnodes is backed by host memory from only one pnode (but not
+   vice-versa, i.e., there is no single vnode backed by host memory from two
+   or more pnodes). We can call this scenario many-to-one;
+2. for all the vnodes, each one of them is backed by host memory coming from
+   one specific (and all different among each others) pnode. We can call this
+   scenario one-to-one;
+3. there are vnodes that are backed by host memory coming at the same time
+   from multiple pnodes. We can call this scenario one-to-many.
+
+If a guest does not have a virtual NUMA topology, it can be seen as having
+only one virtual NUMA node, accommodating all the memory.
+
+Among the three scenarios above, 1 and 2 are fine, meaning that NUMA aware
+ballooning will succeed in using host memory from the correct pnodes,
+guaranteeing improved performance and better resource utilization than the
+current situation (if there is enough free memory in the proper place, of
+course).
+
+The third situation (one-to-many) is the only problematic one. In fact, if
+we try to get several pages from pnode #1, with the guest vnode #0 having
+pages on both pnode #1 and pnode #2, the ballooning driver will pick up the
+pages from the vnode #0''s pool of ballooned pages, without any chance
of
+knowing whether they actually are backed by host pages from pnode #1 or
+from pnode #2.
+
+However, this just means that, in this case, NUMA-aware ballooning behaves
+pretty much like the current (i.e., __non__ NUMA-aware ballooning) which is
+certainly undesirable, but still acceptable. On a related note, having the 
+memory of a vnode split among two (or more) pnodes is a non optimal situation
+already, at least from a NUMA perspective, so it would be unrealistic to ask
+the ballooning driver to do anything better than the above.
+
+[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction
+[numa_placement]:
http://xenbits.xen.org/docs/unstable/misc/xl-numa-placement.markdown
+[xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt
+[xenstore_paths]:
http://xenbits.xen.org/docs/unstable/misc/xenstore-paths.markdown
-- 
1.8.1.4

Jan Beulich

2013-Aug-16 09:09 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

>>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com> wrote:
> +So, in NUMA aware ballooning, ballooning down and up works as follows:
> +
> +* target < current usage -- first of all, the ballooning driver uses
the
> +  PNODE\_TO\_VNODE() service (provided by the virtual topology
implementation,
> +  as explained above) to translate _pnid_ (that it reads from xenstore) to
> +  the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which
will
This looks conceptually wrong: The balloon driver should have no
need to know about pNID-s; it should be the tool stack doing the
translation prior to writing the xenstore node.

Further, the new xenstore node would presumably better be a mask
than a single vNID, since in order to e.g. balloon up another guest
already spanning multiple nodes, giving the tool stack a way to ask
for memory on any of the spanned nodes.

And finally, coming back what Tim had already pointed out - doing
things the way you propose can cause an imbalance in the
ballooned down guest, penalizing it in favor of not penalizing the
intended consumer of the recovered memory. Therefore I wonder
whether, without any new xenstore node, it wouldn''t be better to
simply require conforming balloon drivers to balloon out memory
evenly across the domain''s virtual nodes.
> +The biggest difference between current and NUMA-aware ballooning is that
the
> +latter needs to keep multiple lists of the ballooned pages in an array,
with
> +one element for each virtual node. This way, it is always evident, at any
> +given time, what ballooned pages belong to what vnode.
That''s wrong afaict: ballooned out pages aren''t associated
with any
memory, and hence can''t be associated with any vNID. Once they
get re-populated, which vNID the memory belongs to is an attribute
of the memory coming in, not the control structure that it''s to be
associated with.

I believe this thinking of yours stems from the fact that in Linux the
page control structures are associated with nodes by way of the
physical memory map being split into larger pieces, each coming from
a particular node. But other OSes don''t need to follow this model,
and what you propose would also exclude extending the spanned
nodes set if memory gets ballooned in that''s not associated with
any node the domain so far was "knowing" of.
> +Regarding the stealing a page from the OS part, it is enough to use the
Linux
> +function alloc_page_node(), in place of alloc\_page().
Such statement seems to confirm that you''re thinking Linux centric
instead of defining a generic model.

Jan

Li Yechen

2013-Aug-16 10:18 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Hi Jan,

On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com> wrote:
> This looks conceptually wrong: The balloon driver should have no
> need to know about pNID-s; it should be the tool stack doing the
> translation prior to writing the xenstore node.
>
> Further, the new xenstore node would presumably better be a mask
> than a single vNID, since in order to e.g. balloon up another guest
> already spanning multiple nodes, giving the tool stack a way to ask
> for memory on any of the spanned nodes.
>
> Yeah, you are right. I''m also telling myself that it''s
not a good idea tolet guest OS knows the physical IDs.
These two transformation between p-nid and v-nid could be put either
inside of Xen, or inside of balloon, which is its current state.
Anyway, the interfaces from guest NUMA topology have not been
implemented yet. I''ll mark this as an to-do issue and move it into Xen
in the future.
> And finally, coming back what Tim had already pointed out - doing
> things the way you propose can cause an imbalance in the
> ballooned down guest, penalizing it in favor of not penalizing the
> intended consumer of the recovered memory. Therefore I wonder
> whether, without any new xenstore node, it wouldn''t be better to
> simply require conforming balloon drivers to balloon out memory
> evenly across the domain''s virtual nodes.
I should say sorry here, but I''m not quite understand the
"whether" part.
the "new xenstore node" just store the requirement from user, so that
balloon could read it. It''s similar to ~/memory/target. This new node
could store either p-nodeid, or v-nodeid, according to the interfaces we
talked above is placed inside of xen, or inside of guest OS.
Do you have a better way to pass this requirement to balloon, instead of
create a new xenstore node? I''d be very happy if you have one, since
nor do I like the way I have done(create a new node) already!
> +The biggest difference between current and NUMA-aware ballooning is that
> the
> > +latter needs to keep multiple lists of the ballooned pages in an
array,
> with
> > +one element for each virtual node. This way, it is always evident, at
> any
> > +given time, what ballooned pages belong to what vnode.
>
> That''s wrong afaict: ballooned out pages aren''t
associated with any
> memory, and hence can''t be associated with any vNID. Once they
> get re-populated, which vNID the memory belongs to is an attribute
> of the memory coming in, not the control structure that it''s to be
> associated with.
>
> I believe this thinking of yours stems from the fact that in Linux the
> page control structures are associated with nodes by way of the
> physical memory map being split into larger pieces, each coming from
> a particular node. But other OSes don''t need to follow this model,
> and what you propose would also exclude extending the spanned
> nodes set if memory gets ballooned in that''s not associated with
> any node the domain so far was "knowing" of.
You are exactly right again, this design is only for Linux balloon driver.
For Linux, balloon can choose which page to balloon in/out. So we can
assocate the pages with v-nodeid.
For the other kinds of architechure, please forgive me that I haven''t
think
of that far...

 > +Regarding the stealing a page from the OS part, it is enough to use
the> Linux
> > +function alloc_page_node(), in place of alloc\_page().
>
> Such statement seems to confirm that you''re thinking Linux centric
> instead of defining a generic model.
>
> Jan
>
> Yes.
And thank you again to spend your valuable time reviewing my patch!
I hope my answer could solve your questions. If not, please point it out
for me!


On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com> wrote:
> >>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com>
wrote:
> > +So, in NUMA aware ballooning, ballooning down and up works as
follows:
> > +
> > +* target < current usage -- first of all, the ballooning driver
uses the
> > +  PNODE\_TO\_VNODE() service (provided by the virtual topology
> implementation,
> > +  as explained above) to translate _pnid_ (that it reads from
xenstore)
> to
> > +  the id(s) of the corresponding set of vnode IDs, say _{vnids}_
(which
> will
>
> This looks conceptually wrong: The balloon driver should have no
> need to know about pNID-s; it should be the tool stack doing the
> translation prior to writing the xenstore node.
>
> Further, the new xenstore node would presumably better be a mask
> than a single vNID, since in order to e.g. balloon up another guest
> already spanning multiple nodes, giving the tool stack a way to ask
> for memory on any of the spanned nodes.
>
> And finally, coming back what Tim had already pointed out - doing
> things the way you propose can cause an imbalance in the
> ballooned down guest, penalizing it in favor of not penalizing the
> intended consumer of the recovered memory. Therefore I wonder
> whether, without any new xenstore node, it wouldn''t be better to
> simply require conforming balloon drivers to balloon out memory
> evenly across the domain''s virtual nodes.
>
> > +The biggest difference between current and NUMA-aware ballooning is
> that the
> > +latter needs to keep multiple lists of the ballooned pages in an
array,
> with
> > +one element for each virtual node. This way, it is always evident, at
> any
> > +given time, what ballooned pages belong to what vnode.
>
> That''s wrong afaict: ballooned out pages aren''t
associated with any
> memory, and hence can''t be associated with any vNID. Once they
> get re-populated, which vNID the memory belongs to is an attribute
> of the memory coming in, not the control structure that it''s to be
> associated with.
>
> I believe this thinking of yours stems from the fact that in Linux the
> page control structures are associated with nodes by way of the
> physical memory map being split into larger pieces, each coming from
> a particular node. But other OSes don''t need to follow this model,
> and what you propose would also exclude extending the spanned
> nodes set if memory gets ballooned in that''s not associated with
> any node the domain so far was "knowing" of.
>
> > +Regarding the stealing a page from the OS part, it is enough to use
the
> Linux
> > +function alloc_page_node(), in place of alloc\_page().
>
> Such statement seems to confirm that you''re thinking Linux centric
> instead of defining a generic model.
>
> Jan
>
>

-- 
Yechen Li

Team of System Virtualization and Cloud Computing
School of Electronic Engineering  and Computer Science,
Peking University, China

Nothing is impossible because impossible itself  says: " I''m
possible "
lccycc From PKU


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Aug-16 13:21 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

(re-adding xen-devel to Cc)
>>> On 16.08.13 at 12:12, Li Yechen <lccycc123@gmail.com> wrote:
> On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com>
wrote:
>> And finally, coming back what Tim had already pointed out - doing
>> things the way you propose can cause an imbalance in the
>> ballooned down guest, penalizing it in favor of not penalizing the
>> intended consumer of the recovered memory. Therefore I wonder
>> whether, without any new xenstore node, it wouldn''t be better
to
>> simply require conforming balloon drivers to balloon out memory
>> evenly across the domain''s virtual nodes.
> 
> I should say sorry here, but I''m not quite understand the
"whether" part.
> the "new xenstore node" just store the requirement from user, so
that
> balloon could read it. It''s similar to ~/memory/target. This new
node
> could store either p-nodeid, or v-nodeid, according to the interfaces we
> talked above is placed inside of xen, or inside of guest OS.
> Do you have a better way to pass this requirement to balloon, instead of
> create a new xenstore node? I''d be very happy if you have one,
since
> nor do I like the way I have done(create a new node) already!
As said - I''d want you to evaluate a model without such a new node,
and with instead the requirement placed on the balloon driver to
balloon out pages evenly allocated across the guest''s virtual nodes.
> You are exactly right again, this design is only for Linux balloon driver.
> For Linux, balloon can choose which page to balloon in/out. So we can
> assocate the pages with v-nodeid.
> For the other kinds of architechure, please forgive me that I
haven''t think
> of that far...
The abstract model shouldn''t take OS implementation details or
policies into account; the implementation later of course can (and
frequently will need to).

Jan

Li Yechen

2013-Aug-16 14:17 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On Fri, Aug 16, 2013 at 9:21 PM, Jan Beulich <JBeulich@suse.com> wrote:
>
> As said - I''d want you to evaluate a model without such a new
node,
> and with instead the requirement placed on the balloon driver to
> balloon out pages evenly allocated across the guest''s virtual
nodes.
Oh, so you need the experiments'' result without this patch?
I see. I''ll do it and send the result.
> You are exactly right again, this design is only for Linux balloon driver.
> > For Linux, balloon can choose which page to balloon in/out. So we can
> > assocate the pages with v-nodeid.
> > For the other kinds of architechure, please forgive me that I
haven''t
> think
> > of that far...
>
> The abstract model shouldn''t take OS implementation details or
> policies into account; the implementation later of course can (and
> frequently will need to).
>
> Jan
>So, you mean that the abstract model should consider that OS could not
allocate pages by virtual node IDs?
That''s a question.. Let me think about it :-)



-- 
Yechen Li

Team of System Virtualization and Cloud Computing
School of Electronic Engineering  and Computer Science,
Peking University, China

Nothing is impossible because impossible itself  says: " I''m
possible "
lccycc From PKU


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Aug-16 14:55 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

>>> On 16.08.13 at 16:17, Li Yechen <lccycc123@gmail.com> wrote:
> On Fri, Aug 16, 2013 at 9:21 PM, Jan Beulich <JBeulich@suse.com>
wrote:
>> As said - I''d want you to evaluate a model without such a new
node,
>> and with instead the requirement placed on the balloon driver to
>> balloon out pages evenly allocated across the guest''s virtual
nodes.
> 
> Oh, so you need the experiments'' result without this patch?
> I see. I''ll do it and send the result.
What experiment?
>> You are exactly right again, this design is only for Linux balloon
driver.
>> > For Linux, balloon can choose which page to balloon in/out. So we
can
>> > assocate the pages with v-nodeid.
>> > For the other kinds of architechure, please forgive me that I
haven''t
>> think
>> > of that far...
>>
>> The abstract model shouldn''t take OS implementation details or
>> policies into account; the implementation later of course can (and
>> frequently will need to).
>>
> So, you mean that the abstract model should consider that OS could not
> allocate pages by virtual node IDs?
No. What I said is that associating ballooned out pages with a
particular vNID seems wrong. If the balloon driver gets back a
fresh page during re-population, it shouldn''t depend on having
a suitable vacated page control structure available, but instead
should be able to absorb the page in any case. But again, this
all is taking Linux concepts into consideration, which don''t belong
in the architectural model (or at most as an example, but your
examples started _after_ you already started dealing with Linux
specifics).

Jan

Dario Faggioli

2013-Aug-16 22:53 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On ven, 2013-08-16 at 14:21 +0100, Jan Beulich wrote:> >>> On 16.08.13 at 12:12, Li Yechen <lccycc123@gmail.com>
wrote:
> > On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com>
wrote:
> >> And finally, coming back what Tim had already pointed out - doing
> >> things the way you propose can cause an imbalance in the
> >> ballooned down guest, penalizing it in favor of not penalizing the
> >> intended consumer of the recovered memory. Therefore I wonder
> >> whether, without any new xenstore node, it wouldn''t be
better to
> >> simply require conforming balloon drivers to balloon out memory
> >> evenly across the domain''s virtual nodes.
> > 
> > I should say sorry here, but I''m not quite understand the
"whether" part.
> > the "new xenstore node" just store the requirement from
user, so that
> > balloon could read it. It''s similar to ~/memory/target. This
new node
> > could store either p-nodeid, or v-nodeid, according to the interfaces
we
> > talked above is placed inside of xen, or inside of guest OS.
> > Do you have a better way to pass this requirement to balloon, instead
of
> > create a new xenstore node? I''d be very happy if you have
one, since
> > nor do I like the way I have done(create a new node) already!
> 
> As said - I''d want you to evaluate a model without such a new
node,
> and with instead the requirement placed on the balloon driver to
> balloon out pages evenly allocated across the guest''s virtual
nodes.
> Why not supporting both modes? I mean, Jan, I totally see what you mean,
and I agree, a very important use case is where the user just says
"balloon down/up", in which case reclaiming/populating evenly is the
most sane thing to do (as also said by David --I think he was him rather
than Tim).

However, what about the use case when the user actually want to make
space on a specific p-node, no matter what it will cost to the existing
guests? I don''t have that much "ballooning experience", so I
am
genuinely asking here, is that use case completely irrelevant? I
personally thing it would be something nice to have, although, again,
probably not as the default behaviour...

What about something like, the default is the even distribution, but if
the user makes it clear he wants a specific p-node (whatever v-node or
v-nodes that will mean for the guest), we also allow that?

For actually doing it, I''m not sure what the best interface would be...
The new xenstore key did not look perfect, but not that bad even. FWIW,
if we''d stick with it, I agree with you that it should host v-nodes (so
the hypercall doing the translation should happen in the toolstack), and
that it should host a mask.
> > You are exactly right again, this design is only for Linux balloon
driver.
> > For Linux, balloon can choose which page to balloon in/out. So we can
> > assocate the pages with v-nodeid.
> > For the other kinds of architechure, please forgive me that I
haven''t think
> > of that far...
> 
> The abstract model shouldn''t take OS implementation details or
> policies into account; the implementation later of course can (and
> frequently will need to).
> You are right. Although it is true that this series is specifically for
Linux, Linux specific concepts populates the design document too much,
or at least in the wrong places. :-)

That being said (and perhaps Yechen could make a not about this, so that
if he send another version, he could reorganize this patch/document a
bit, to achieve a better separation between the generic model
description and the implementation details), if you consider all this
the description of the Linux specific implementation, it would be
interesting to know what you think of such specific implementation. :-)

Thanks a lot for taking a look and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Aug-16 23:30 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On ven, 2013-08-16 at 10:09 +0100, Jan Beulich wrote:> >>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com>
wrote:
> > +The biggest difference between current and NUMA-aware ballooning is
that the
> > +latter needs to keep multiple lists of the ballooned pages in an
array, with
> > +one element for each virtual node. This way, it is always evident, at
any
> > +given time, what ballooned pages belong to what vnode.
> 
> That''s wrong afaict: ballooned out pages aren''t
associated with any
> memory, and hence can''t be associated with any vNID. Once they
> get re-populated, which vNID the memory belongs to is an attribute
> of the memory coming in, not the control structure that it''s to be
> associated with.
> I may be wrong (I''m sorry, I had very few chance to look at the
ballooning code, and won''t be able to do so for a while), but I think
what we want here is the other way around, i.e., having a way to make
sure that the memory that will come in will also end up --in the guest--
within a specific v-node.

I don''t know if the only/best way to do this is the array of lists in
Yechen''s patches, and I agree (as per the other e-mail) that this more
an implementation detail than anything else, but I think the point here
is: do we want to support that operational mode (again, perhaps not as
the default node, even in a virtual NUMA enabled guest) ?
> I believe this thinking of yours stems from the fact that in Linux the
> page control structures are associated with nodes by way of the
> physical memory map being split into larger pieces, each coming from
> a particular node. But other OSes don''t need to follow this model,
> and what you propose would also exclude extending the spanned
> nodes set if memory gets ballooned in that''s not associated with
> any node the domain so far was "knowing" of.
> I agree on the first part of this comment... Too much Linux-ism in the
description of what should be a generic model.

The second part (the one about what happens if memory comes from an
"unknown" node), I''m not sure I get what you mean.

Suppose we have guest G with 2 v-nodes and with pages in v-node 0 (say,
page 0,1,2..N-1) are backed by frames on p-node 2, while pages in v-node
1 (say, N,N+1,N+2..2N-1) are backed by frames on p-node 4, and that is
because, at creation time, either the user or the toolstack decided this
was the way to go.
So, if page 2 was ballooned down, when ballooning it up, we would like
to retain the fact that it is backed by a frame in p-node 2, and we
could ask Xen to try make that happen. On failure (e.g., no free frames
on p-node 2), we could either fail or have Xen allocate the memory
somewhere else, i.e., not on p-node 2 or p-node 4, and live with it
(i.e., map G''s page 2 there), which I think is what you mean with
<<node
the domain so far was "knowing" of>>, isn''t it?

Or was it something different that you were asking?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Jan Beulich

2013-Aug-19 09:17 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

>>>> Dario Faggioli <dario.faggioli@citrix.com> 08/17/13 1:31
AM >>>
>On ven, 2013-08-16 at 10:09 +0100, Jan Beulich wrote:
>> I believe this thinking of yours stems from the fact that in Linux the
>> page control structures are associated with nodes by way of the
>> physical memory map being split into larger pieces, each coming from
>> a particular node. But other OSes don''t need to follow this
model,
>> and what you propose would also exclude extending the spanned
>> nodes set if memory gets ballooned in that''s not associated
with
>> any node the domain so far was "knowing" of.
>> 
>I agree on the first part of this comment... Too much Linux-ism in the
>description of what should be a generic model.
>
>The second part (the one about what happens if memory comes from an
>"unknown" node), I''m not sure I get what you mean.
>
>Suppose we have guest G with 2 v-nodes and with pages in v-node 0 (say,
>page 0,1,2..N-1) are backed by frames on p-node 2, while pages in v-node
>1 (say, N,N+1,N+2..2N-1) are backed by frames on p-node 4, and that is
>because, at creation time, either the user or the toolstack decided this
>was the way to go.
>So, if page 2 was ballooned down, when ballooning it up, we would like
>to retain the fact that it is backed by a frame in p-node 2, and we
>could ask Xen to try make that happen. On failure (e.g., no free frames
>on p-node 2), we could either fail or have Xen allocate the memory
>somewhere else, i.e., not on p-node 2 or p-node 4, and live with it
>(i.e., map G''s page 2 there), which I think is what you mean with
<<node
>the domain so far was "knowing" of>>, isn''t it?
Right. Or the guest could choose to create a new node on the fly.

Jan

Jan Beulich

2013-Aug-19 09:22 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

>>> Dario Faggioli <dario.faggioli@citrix.com> 08/17/13 12:53 AM
>>>
>On ven, 2013-08-16 at 14:21 +0100, Jan Beulich wrote:
>> >>> On 16.08.13 at 12:12, Li Yechen
<lccycc123@gmail.com> wrote:
>> > On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich
<JBeulich@suse.com> wrote:
>> >> And finally, coming back what Tim had already pointed out -
doing
>> >> things the way you propose can cause an imbalance in the
>> >> ballooned down guest, penalizing it in favor of not penalizing
the
>> >> intended consumer of the recovered memory. Therefore I wonder
>> >> whether, without any new xenstore node, it wouldn''t
be better to
>> >> simply require conforming balloon drivers to balloon out
memory
>> >> evenly across the domain''s virtual nodes.
>> > 
>> > I should say sorry here, but I''m not quite understand the
"whether" part.
>> > the "new xenstore node" just store the requirement from
user, so that
>> > balloon could read it. It''s similar to ~/memory/target.
This new node
>> > could store either p-nodeid, or v-nodeid, according to the
interfaces we
>> > talked above is placed inside of xen, or inside of guest OS.
>> > Do you have a better way to pass this requirement to balloon,
instead of
>> > create a new xenstore node? I''d be very happy if you have
one, since
>> > nor do I like the way I have done(create a new node) already!
>> 
>> As said - I''d want you to evaluate a model without such a new
node,
>> and with instead the requirement placed on the balloon driver to
>> balloon out pages evenly allocated across the guest''s virtual
nodes.
>> 
>Why not supporting both modes? I mean, Jan, I totally see what you mean,
>and I agree, a very important use case is where the user just says
>"balloon down/up", in which case reclaiming/populating evenly is
the
>most sane thing to do (as also said by David --I think he was him rather
>than Tim).
>
>However, what about the use case when the user actually want to make
>space on a specific p-node, no matter what it will cost to the existing
>guests? I don''t have that much "ballooning experience",
so I am
>genuinely asking here, is that use case completely irrelevant? I
>personally thing it would be something nice to have, although, again,
>probably not as the default behaviour...
>
>What about something like, the default is the even distribution, but if
>the user makes it clear he wants a specific p-node (whatever v-node or
>v-nodes that will mean for the guest), we also allow that?
Yes, supporting both modes is certainly desirable; I merely tried to
point out that the description went too far in the direction of advocating
only the enforce-a-node model (almost like this was the only sensible
one from the NUMA perspective). Penalizing another guest should
clearly not be a default action, but it may validly be a choice by the
administrator.

Jan

Jan

George Dunlap

2013-Aug-19 11:05 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On Fri, Aug 16, 2013 at 10:09 AM, Jan Beulich <JBeulich@suse.com>
wrote:>>>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com>
wrote:
>> +So, in NUMA aware ballooning, ballooning down and up works as follows:
>> +
>> +* target < current usage -- first of all, the ballooning driver
uses the
>> +  PNODE\_TO\_VNODE() service (provided by the virtual topology
implementation,
>> +  as explained above) to translate _pnid_ (that it reads from
xenstore) to
>> +  the id(s) of the corresponding set of vnode IDs, say _{vnids}_
(which will
>
> This looks conceptually wrong: The balloon driver should have no
> need to know about pNID-s; it should be the tool stack doing the
> translation prior to writing the xenstore node.
I agree with this -- I would like to point out that to make this work
for ballooning *up*, however, there will need to be a way for the
guest to specify, "please allocate from vnode X", and have Xen
translate the vnode into the appropriate pnode(s).

 -George

David Vrabel

2013-Aug-19 12:58 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On 16/08/13 05:13, Yechen Li wrote:> 
> +### nodemask VNODE\_TO\_PNODE(int vnode) ###
> +
> +This service is provided by the hypervisor (and wired, if necessary, all
the
> +way up to the proper toolstack layer or guest kernel), since it is only
Xen
> +that knows both the virtual and the physical topologies.
The physical NUMA topology must not be exposed to guests that have a
virtual NUMA topology -- only the toolstack and Xen should know the
mapping between the two.

A guest cannot make sensible use of a machine topology as it may be
migrated to a host with a different topology.
> +## Description of the problem ##
> +
> +Let us use an example. Let''s assume that guest _G_ has a virtual
2 vnodes,
> +and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2,
> +respectively.
> +
> +Now, the user wants to create a new guest, but the system is under high
memory
> +pressure, so he decides to try ballooning _G_ down. He sees that pnode #2
has
> +the best chances to accommodate all the memory for the new guest, which
would
> +be really good for performance, if only he can make space there. _G_ is
the
> +only domain eating some memory from pnode, #2 but, as said above, not all
of
> +its memory comes from there.
It is not clear to me that this is the optimal decision.  What
tools/information will be available that the user can use to make
sensible decisions here?  e.g., is the current layout available to tools?

Remember that the "user" in this example is most often some automated
process and not a human.
> +So, right now, the user has no way to specify that he wants to balloon
down
> +_G_ in such a way that he will get as much as possible free pages from
pnode
> +#2, rather than from pnode #0. He can ask _G_ to balloon down, but there
is
> +no guarantee on from what pnode the memory will be freed.
> +
> +The same applies to the ballooning up case, when the user, for some
specific
> +reasons, wants to be sure that it is the memory of some (other) specific
pnode
> +that will be used.
I would like to see some real world examples of cases where this is
sensible.

In general, I''m not keen on adding ABIs or interfaces that
don''t solve
real world problems, particularly if they''re easy to misuse and end up
with something that is very suboptimal.
> +## NUMA-aware ballooning ##
> +
> +The new NUMA-aware ballooning logic works as follows.
> +
> +There is room, in libxl\_set\_memory\_target() for two more parameters, in
> +addition to the new memory target:
The Xenstore interface should be the primary interface being documented.
 The libxl interface is secondary and (probably) a consequence of the
xenstore interface.
> +* _pnid_ -- which is the pnode id of which node the user wants to try get
some
> +  free memory on
> +* _nodeexact_ -- which is a bool specifying whether or not, in case it is
not
> +  possible to reach the new ballooning target only with memory from pnode
> +  _pnid_, the user is fine with using memory from other pnodes.  
> +  If _nodeexact_ is true, it is possible that the new target is not
reached; if
> +  it is false, the new target will (probably) be reached, but it is
possible
> +  that some memory is freed on pnodes other than _pnid_.
> +
> +To let the ballooning driver know about these new parameters, a new
xenstore
> +key exists in ~/memory/target\_nid. So, for a proper NUMA aware ballooning
> +operation to occur, the user should write the proper values in both the
keys:
> +~/memory/target\_nid and ~/memory/target.
If we decide we do need such control, I think the xenstore interface
should look more like:

memory/target

  as before

memory/target-by-nid/0

  target for virtual node 0

...

memory/target-by-nid/N

  target for virtual node N

I think this better reflects the goal which is an adjusted NUMA layout
for the guest rather than the steps required to reach it (release P
pages from node N).

The balloon driver attempts to reach target, whist simultaneously trying
to reach the individual node targets.  It should prefer to balloon
up/down on the node that is furthest from its node target.

In cases where target and the sum of target-by-nid/N don''t agree (or
are
not present) the balloon driver should use target, and balloon up/down
evenly across all NUMA nodes.

Thew libxl interface does not necessarily have to match the xenstore
interface if that''s the initial tools would prefer.

Finally a style comment, please avoid the use of a single gender
specific pronouns in documentation/comments (i.e., don''t always use
he/his etc.).  I prefer to use a singular "they" but you could
consider
"he or she" or using "he" for some examples and
"she" in others.

David

George Dunlap

2013-Aug-19 13:26 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On Mon, Aug 19, 2013 at 1:58 PM, David Vrabel <david.vrabel@citrix.com>
wrote:> On 16/08/13 05:13, Yechen Li wrote:
>>
>> +### nodemask VNODE\_TO\_PNODE(int vnode) ###
>> +
>> +This service is provided by the hypervisor (and wired, if necessary,
all the
>> +way up to the proper toolstack layer or guest kernel), since it is
only Xen
>> +that knows both the virtual and the physical topologies.
>
> The physical NUMA topology must not be exposed to guests that have a
> virtual NUMA topology -- only the toolstack and Xen should know the
> mapping between the two.
>
> A guest cannot make sensible use of a machine topology as it may be
> migrated to a host with a different topology.
>
>> +## Description of the problem ##
>> +
>> +Let us use an example. Let''s assume that guest _G_ has a
virtual 2 vnodes,
>> +and that the memory for vnode #0 and #1 comes from pnode #0 and pnode
#2,
>> +respectively.
>> +
>> +Now, the user wants to create a new guest, but the system is under
high memory
>> +pressure, so he decides to try ballooning _G_ down. He sees that pnode
#2 has
>> +the best chances to accommodate all the memory for the new guest,
which would
>> +be really good for performance, if only he can make space there. _G_
is the
>> +only domain eating some memory from pnode, #2 but, as said above, not
all of
>> +its memory comes from there.
>
> It is not clear to me that this is the optimal decision.  What
> tools/information will be available that the user can use to make
> sensible decisions here?  e.g., is the current layout available to tools?
>
> Remember that the "user" in this example is most often some
automated
> process and not a human.
>
>> +So, right now, the user has no way to specify that he wants to balloon
down
>> +_G_ in such a way that he will get as much as possible free pages from
pnode
>> +#2, rather than from pnode #0. He can ask _G_ to balloon down, but
there is
>> +no guarantee on from what pnode the memory will be freed.
>> +
>> +The same applies to the ballooning up case, when the user, for some
specific
>> +reasons, wants to be sure that it is the memory of some (other)
specific pnode
>> +that will be used.
>
> I would like to see some real world examples of cases where this is
> sensible.
>
> In general, I''m not keen on adding ABIs or interfaces that
don''t solve
> real world problems, particularly if they''re easy to misuse and
end up
> with something that is very suboptimal.
I think at very least the guest needs to be able to say, "allocate me
a page from vnode X", and have Xen translate that into pnode
internally, so that ballooning down and back up again doesn''t destroy
a guest''s NUMA memory affinity (e.g., the vnode->pnode memory
mapping).

[snip]
>
>> +* _pnid_ -- which is the pnode id of which node the user wants to try
get some
>> +  free memory on
>> +* _nodeexact_ -- which is a bool specifying whether or not, in case it
is not
>> +  possible to reach the new ballooning target only with memory from
pnode
>> +  _pnid_, the user is fine with using memory from other pnodes.
>> +  If _nodeexact_ is true, it is possible that the new target is not
reached; if
>> +  it is false, the new target will (probably) be reached, but it is
possible
>> +  that some memory is freed on pnodes other than _pnid_.
>> +
>> +To let the ballooning driver know about these new parameters, a new
xenstore
>> +key exists in ~/memory/target\_nid. So, for a proper NUMA aware
ballooning
>> +operation to occur, the user should write the proper values in both
the keys:
>> +~/memory/target\_nid and ~/memory/target.
>
> If we decide we do need such control, I think the xenstore interface
> should look more like:
>
> memory/target
>
>   as before
>
> memory/target-by-nid/0
>
>   target for virtual node 0
>
> ...
>
> memory/target-by-nid/N
>
>   target for virtual node N
>
> I think this better reflects the goal which is an adjusted NUMA layout
> for the guest rather than the steps required to reach it (release P
> pages from node N).
This seems more sensible than a mask (as Jan suggested); but is it
open to race conditions?
>
> The balloon driver attempts to reach target, whist simultaneously trying
> to reach the individual node targets.  It should prefer to balloon
> up/down on the node that is furthest from its node target.
>
> In cases where target and the sum of target-by-nid/N don''t agree
(or are
> not present) the balloon driver should use target, and balloon up/down
> evenly across all NUMA nodes.
>
> Thew libxl interface does not necessarily have to match the xenstore
> interface if that''s the initial tools would prefer.
>
> Finally a style comment, please avoid the use of a single gender
> specific pronouns in documentation/comments (i.e., don''t always
use
> he/his etc.).  I prefer to use a singular "they" but you could
consider
> "he or she" or using "he" for some examples and
"she" in others.
Doing half and half seems a bit strange to me; if we''re trying for
gender equity, I''d just go for "she" all the way.  There are
enough
"he"s in the wider literature to more than balance it out for many
years to come. :-)

 -George

Dario Faggioli

2013-Aug-20 14:05 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

[For this message and for all the othr I''m sending today, sorry for the
webmail :-P]

From: Jan Beulich [jbeulich@suse.com]>>>> Dario Faggioli <dario.faggioli@citrix.com> 08/17/13 1:31
AM >>>
>>Suppose we have guest G with 2 v-nodes and with pages in v-node 0 (say,
>>page 0,1,2..N-1) are backed by frames on p-node 2, while pages in v-node
>>1 (say, N,N+1,N+2..2N-1) are backed by frames on p-node 4, and that is
>>because, at creation time, either the user or the toolstack decided this
>>was the way to go.
>>So, if page 2 was ballooned down, when ballooning it up, we would like
>>to retain the fact that it is backed by a frame in p-node 2, and we
>>could ask Xen to try make that happen. On failure (e.g., no free frames
>>on p-node 2), we could either fail or have Xen allocate the memory
>>somewhere else, i.e., not on p-node 2 or p-node 4, and live with it
>>(i.e., map G''s page 2 there), which I think is what you mean
with <<node
>>the domain so far was "knowing" of>>, isn''t it?
>
>Right. Or the guest could choose to create a new node on the fly.
>
>Jan
>Are you talking of a guest creating a new virtual node, i.e., changing
it''s
own (virtual) NUMA topology on the fly? If yes, that could be an option
too, I guess.

It is not something we plan to support in the first implementation of
virtual NUMA for Linux, but since we are talking about making this
spec document more general, yes I think we should not rule-out
such a possibility.

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2013-Aug-20 14:18 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

>From: Jan Beulich [jbeulich@suse.com]
>
>Yes, supporting both modes is certainly desirable; I merely tried to
>point out that the description went too far in the direction of advocating
>only the enforce-a-node model (almost like this was the only sensible
>one from the NUMA perspective). Penalizing another guest should
>clearly not be a default action, but it may validly be a choice by the
>administrator.
>
>Jan
>Ok, cool to know, and the point you''re trying to make is a really
valuable
one, and I agree with it. :-)

So, Yechen, in case you''re up for another version of this series,
here''s
what I would recommend you to think about:

* take this design document and reorganize it a bit, in order to
  separate the generic description of the NUMA-aware ballooning
  concept, functioning, interface, etc., and the detail of the Linux
  implementation. I think those details could still live here
  (especially at this stage), but they should be in their own
  separate section(s), as an "example implementation"

* Given the issues/doubts about the new interface, could we
  have a first version of the code _without_ the new xenstore
  key that just does something as follows:
  - the user asks to balloon up/down by N pages
  - if the guest is a virtual NUMA enabled guest with Y
    virtual nodes, the ballooning driver gives/takes to/from
    him N/Y pages per each node.

Of course, when looking/implementing the latter point, do
not throw this code here away (I''m talking about the version
you submitted, with the new xenstore key)... Or at least,
I think it is a valuable addition, so if I were you, I wouldn''t
throw it away.

I think it would be entirely possible to reach consensus on
the "evenly spreading without new xenstore key" version, as
a first step and upstream it. Afterwards, we will enhance the
interface (if we decide we like it) with the per-node targets.

What do you think?

I''m of course up for helping you with that, bot not before a couple
of weeks. :-P

Thanks and Regard,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

David Vrabel

2013-Aug-20 14:20 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On 19/08/13 14:26, George Dunlap wrote:> On Mon, Aug 19, 2013 at 1:58 PM, David Vrabel
<david.vrabel@citrix.com> wrote:
>> If we decide we do need such control, I think the xenstore interface
>> should look more like:
>>
>> memory/target
>>
>>   as before
>>
>> memory/target-by-nid/0
>>
>>   target for virtual node 0
>>
>> ...
>>
>> memory/target-by-nid/N
>>
>>   target for virtual node N
>>
>> I think this better reflects the goal which is an adjusted NUMA layout
>> for the guest rather than the steps required to reach it (release P
>> pages from node N).
> 
> This seems more sensible than a mask (as Jan suggested); but is it
> open to race conditions?
xenstore transactions could be used to make the reads/writes of the set
of values atomic.

David

Jan Beulich

2013-Aug-20 14:24 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

>>> On 20.08.13 at 16:05, Dario Faggioli
<dario.faggioli@citrix.com> wrote:
> Are you talking of a guest creating a new virtual node, i.e., changing
it''s
> own (virtual) NUMA topology on the fly? If yes, that could be an option
> too, I guess.
Yes, albeit not necessarily in a way visible to the hypervisor (i.e.
the guest may do this just for its own purposes).

Jan

Dario Faggioli

2013-Aug-20 14:31 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

________________________________________>From: dunlapg@gmail.com [dunlapg@gmail.com] on behalf of George Dunlap
[George.Dunlap@eu.citrix.com]
>On Fri, Aug 16, 2013 at 10:09 AM, Jan Beulich <JBeulich@suse.com>
wrote:
>>>>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com>
wrote:
>>> +So, in NUMA aware ballooning, ballooning down and up works as
follows:
>>> +
>>> +* target < current usage -- first of all, the ballooning driver
uses the
>>> +  PNODE\_TO\_VNODE() service (provided by the virtual topology
implementation,
>>> +  as explained above) to translate _pnid_ (that it reads from
xenstore) to
>>> +  the id(s) of the corresponding set of vnode IDs, say _{vnids}_
(which will
>>
>> This looks conceptually wrong: The balloon driver should have no
>> need to know about pNID-s; it should be the tool stack doing the
>> translation prior to writing the xenstore node.
>
>I agree with this 
>Well, if we''re talking about the principle of host and guest (real and
virtual)
NUMA topology, I not only agree, I am one of its strongest advocates
(as George and Elena can testify! :-P)

What we''re talking here is some sort of translation service to call
every time some kind of hints about the relationship between virtual
and real is necessary, to perform a specific operation better.

IOW, the guest would not be allowed to store the result of this call
and use it in the future (and expect it to be accurate). Perhaps that
was not stated clearly enough in the description.

I''d be very happy to get rid of this too, but George himself points
out here below what this (or something really similar to this)
is necessary. 
>-- I would like to point out that to make this work
>for ballooning *up*, however, there will need to be a way for the
>guest to specify, "please allocate from vnode X", and have Xen
>translate the vnode into the appropriate pnode(s).
>Exactly. And since we already have all that is needed in place to
tell Xen something like "allocate from this p-node", Yechen went
for this two step approach: 1) translate, 2) ask Xen to allocate.
This did not require any change at the hypervisor level, so we
thought it could be nice. :-)

Of course, if it''s considered better to modify Xen to perform the
whole operation, fine, this is for sure something that can be done,
an that does not have much effect of the rest of the design and of
the implementation, right Yechen?

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
----------------------------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

Dario Faggioli

2013-Aug-20 14:55 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

From: David Vrabel>On 16/08/13 05:13, Yechen Li wrote:
>>
>> +### nodemask VNODE\_TO\_PNODE(int vnode) ###
>> +
>> +This service is provided by the hypervisor (and wired, if necessary,
all the
>> +way up to the proper toolstack layer or guest kernel), since it is
only Xen
>> +that knows both the virtual and the physical topologies.
>
>The physical NUMA topology must not be exposed to guests that have a
>virtual NUMA topology -- only the toolstack and Xen should know the
>mapping between the two.
>
>A guest cannot make sensible use of a machine topology as it may be
>migrated to a host with a different topology.
>See the other e-mail (me replying to George about something like
that being necessary for ballooning up, although, yes, it probably can
happen all in Xen, if that''s considered better).
>> +## Description of the problem ##
>> +
>> +Let us use an example. Let''s assume that guest _G_ has a
virtual 2 vnodes,
>> +and that the memory for vnode #0 and #1 comes from pnode #0 and pnode
#2,
>> +respectively.
>> +
>> +Now, the user wants to create a new guest, but the system is under
high memory
>> +pressure, so he decides to try ballooning _G_ down. He sees that pnode
#2 has
>> +the best chances to accommodate all the memory for the new guest,
which would
>> +be really good for performance, if only he can make space there. _G_
is the
>> +only domain eating some memory from pnode, #2 but, as said above, not
all of
>> +its memory comes from there.
>
>It is not clear to me that this is the optimal decision.  What
>tools/information will be available that the user can use to make
>sensible decisions here?  e.g., is the current layout available to tools?
>Well, the whole "free page from pnode #2" is more a tool than a
decision. It''s
a tool that will become available for better enact decisions made at some upper
level (i.e., admin, or toolstack). The current layout of how much memory is
occupied on what node by each guset is definitely something we should have
in place (even independently from this feature/series, I think). It''s
already
available via a Xen debug key, so it''s just a matter of wiring it up (I
think). I''ll
give it a try as soon as I''ll be back to work.
>Remember that the "user" in this example is most often some
automated
>process and not a human.
>Exactly. :-D
>I would like to see some real world examples of cases where this is
>sensible.
>
>In general, I''m not keen on adding ABIs or interfaces that
don''t solve
>real world problems, particularly if they''re easy to misuse and end
up
>with something that is very suboptimal.
>I see what you mean, and certainly I don''t disagree. It''s a
bit of a
chicken-&-egg, since I can''t find real examples of something that
does not exist, but yes, I think we can investigate a bit more whether
or not something like this would be useful.

The reason I think it is is that we have an automatic initial placement
algorithm for VM that tries to find the smallest set of nodes to place a
VM on, every time we create one, and I think it would be nice to give
the admin (or some advanced toolstack) all the tools to maximize the
probabilities of such algorithm finding a suitable and nice for performance
solution... Right now the only one of this tool is "kill or migrate some VM
somewhere else", which is not that much... :-P
>If we decide we do need such control, I think the xenstore interface
>should look more like:
>
>memory/target
>
>  as before
>
>memory/target-by-nid/0
>
>  target for virtual node 0
>
>...
>
>memory/target-by-nid/N
>
>  target for virtual node N
>
>I think this better reflects the goal which is an adjusted NUMA layout
>for the guest rather than the steps required to reach it (release P
>pages from node N).
>Oh, cool, I really like this. Yechen, what do you think?
>The balloon driver attempts to reach target, whist simultaneously trying
>to reach the individual node targets.  It should prefer to balloon
>up/down on the node that is furthest from its node target.
>And this is an interesting idea too.
>In cases where target and the sum of target-by-nid/N don''t agree
(or are
>not present) the balloon driver should use target, and balloon up/down
>evenly across all NUMA nodes.
>And that would be fine too... As I said in another e-mail, what I propose to
Yechen is to start dealing with this latter case, i.e., get rid of the new
controls
(or pretend they''re not there) and implement the evenly distribution of
ballooned pages across virtual NUMA nodes.

After that, we will move to a more advanced interface, if we''ll deem it
worthwhile.
>Finally a style comment, please avoid the use of a single gender
>specific pronouns in documentation/comments (i.e., don''t always use
>he/his etc.).  I prefer to use a singular "they" but you could
consider
>"he or she" or using "he" for some examples and
"she" in others.
>Good point. Personally, I think I prefer the "they" form, but
I''m fine with
both.

Thanks and Regards,
Dario

Li Yechen

2013-Aug-20 15:15 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Hi David,
On Mon, Aug 19, 2013 at 8:58 PM, David Vrabel
<david.vrabel@citrix.com>wrote:
> The physical NUMA topology must not be exposed to guests that have a
>  virtual NUMA topology -- only the toolstack and Xen should know the
> mapping between the two.
>
> A guest cannot make sensible use of a machine topology as it may be
> migrated to a host with a different topology.
>
> Most of you have the same option that the interface should be in Xen, notin Guest
balloon. I''m agree with it. In next version I''ll think of how
to implement
this interface
between xen and balloon.
> It is not clear to me that this is the optimal decision.  What
> tools/information will be available that the user can use to make
> sensible decisions here?  e.g., is the current layout available to tools?
>
> Remember that the "user" in this example is most often some
automated
> process and not a human.
>
> We''d like to have a libxl tool for user, or automated process to
changethe node affinity
of a Guest. The decision is made by who call this libxl tool.

> > +So, right now, the user has no way to specify that he wants to
balloon
> down
> > +_G_ in such a way that he will get as much as possible free pages
from
> pnode
> > +#2, rather than from pnode #0. He can ask _G_ to balloon down, but
> there is
> > +no guarantee on from what pnode the memory will be freed.
> > +
> > +The same applies to the ballooning up case, when the user, for some
> specific
> > +reasons, wants to be sure that it is the memory of some (other)
> specific pnode
> > +that will be used.
>
> I would like to see some real world examples of cases where this is
> sensible.
>
> In general, I''m not keen on adding ABIs or interfaces that
don''t solve
> real world problems, particularly if they''re easy to misuse and
end up
> with something that is very suboptimal.
>Dario, could the test examples that you sent to me several month be
represented
as a real-word example?
The example shows that, after several guests create and shut-up, the node
affinity is a mess
>
>
> The Xenstore interface should be the primary interface being documented.
>  The libxl interface is secondary and (probably) a consequence of the
> xenstore interface.
>
> > +* _pnid_ -- which is the pnode id of which node the user wants to try
> get some
> > +  free memory on
> > +* _nodeexact_ -- which is a bool specifying whether or not, in case
it
> is not
> > +  possible to reach the new ballooning target only with memory from
> pnode
> > +  _pnid_, the user is fine with using memory from other pnodes.
> > +  If _nodeexact_ is true, it is possible that the new target is not
> reached; if
> > +  it is false, the new target will (probably) be reached, but it is
> possible
> > +  that some memory is freed on pnodes other than _pnid_.
> > +
> > +To let the ballooning driver know about these new parameters, a new
> xenstore
> > +key exists in ~/memory/target\_nid. So, for a proper NUMA aware
> ballooning
> > +operation to occur, the user should write the proper values in both
the
> keys:
> > +~/memory/target\_nid and ~/memory/target.
>
> If we decide we do need such control, I think the xenstore interface
> should look more like:
>
> memory/target
>
>   as before
>
> memory/target-by-nid/0
>
>   target for virtual node 0
>
> ...
>
> memory/target-by-nid/N
>
>   target for virtual node N
>
> I think this better reflects the goal which is an adjusted NUMA layout
> for the guest rather than the steps required to reach it (release P
> pages from node N).
>
> The balloon driver attempts to reach target, whist simultaneously trying
> to reach the individual node targets.  It should prefer to balloon
> up/down on the node that is furthest from its node target.
>
> In cases where target and the sum of target-by-nid/N don''t agree
(or are
> not present) the balloon driver should use target, and balloon up/down
> evenly across all NUMA nodes.
>
> Thew libxl interface does not necessarily have to match the xenstore
> interface if that''s the initial tools would prefer.
>
> Finally a style comment, please avoid the use of a single gender
> specific pronouns in documentation/comments (i.e., don''t always
use
> he/his etc.).  I prefer to use a singular "they" but you could
consider
> "he or she" or using "he" for some examples and
"she" in others.
>
> David
>Oh, I think this is a better interface!
I''d very appreciate this than what I have now. However, the code I show
here
_does_not_really_work_. It just pass some small tests. And I''m afraid
that
there may be some bugs in my code.
Could I set up David''s interface as a secondary goal, waiting until
this
code
is full tested? I''m really not that confident :)


-- 
Yechen Li

Team of System Virtualization and Cloud Computing
School of Electronic Engineering  and Computer Science,
Peking University, China

Nothing is impossible because impossible itself  says: " I''m
possible "
lccycc From PKU


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Konrad Rzeszutek Wilk

2013-Aug-23 20:53 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On Mon, Aug 19, 2013 at 01:58:51PM +0100, David Vrabel
wrote:> On 16/08/13 05:13, Yechen Li wrote:
> > 
> > +### nodemask VNODE\_TO\_PNODE(int vnode) ###
> > +
> > +This service is provided by the hypervisor (and wired, if necessary,
all the
> > +way up to the proper toolstack layer or guest kernel), since it is
only Xen
> > +that knows both the virtual and the physical topologies.
> 
> The physical NUMA topology must not be exposed to guests that have a
> virtual NUMA topology -- only the toolstack and Xen should know the
> mapping between the two.
I think exposing any NUMA topology to guest - irregardless whether it is based
on real NUMA or not, is OK - and actually a pretty neat thing.

Meaning you could tell a PV guest that it is running on a 16 socket NUMA
box while in reality it is running on a single socket box. Or vice-versa.
It can serve as a way to increase performance (or decrease) - and also
do resource capping (This PV guest will only get 1G of real fast
memory and then 7GB of slow memory) and let the OS handle the details
of it (which it does nowadays).

The mapping thought - of which PV pages should belong to which fake
PV NUMA node - and how they bind to the real NUMA topology - that part
I am not sure how to solve. More on this later.> 
> A guest cannot make sensible use of a machine topology as it may be
> migrated to a host with a different topology.
Correct. And that is OK - it just means that the performance can suck
horribly while it is there. Or the guest can be migrated to even a better
NUMA machine where it will perform even better.

That is nothing new and this is no different if you had PV NUMA
or not in a guest.
> 
> > +## Description of the problem ##

I think you have to backup with the problem description. That is you
need to think of:
 - How a PV guest will allocate pages at bootup based on this
 - How it will balloon up/down within those "buckets".

If you are using the guests NUMA hints it usually is in the form of
''allocate pages on this node'' and the node information is of
type
''pfn X to pfn Y are on this NUMA''. That does not work very
well with
ballooning as it can be scattered across various nodes. But that
is mostly b/c the balloon driver is not even trying to use NUMA
APIs. It could use it and then it would do the best it can and
perhaps balloon round-robin across the NUMA pools. Or 
perhaps a better option would be to use the hotplug memory mechanism
(which is implemented in the balloon driver) and do large swaths of
memory.

But more problematic is the migration. If you migrate a guest
to node that has different NUMA topologies what you really really
want is:
	- unplug all of the memory in the guest
	- replug the memory with the new NUMA topology

Obviously this means you need some dynamic NUMA system - and I don''t
know of such. The unplug/plug can be done via the balloon driver
and or hotplug memory system. But then - the boundaries of the NUMA
pools is set a bootup time. And you would want to change them.
Is SRAT/SLIT dynamic? Could it change during runtime?

Then there is the concept of AutoNUMA were you would migrate
pages from one node to another. With a PV guest that would imply
that the hypervisor would poke the guest and say: "ok, time
alter your P2M table". Which I guess right now is done best
via the balloon driver - so what you really want is a callback
to tell the balloon driver: Hey, balloon down and up this
PFN block with on NUMA node X.

Perhaps what could be done is to setup in the cluster of hosts
the worst case NUMA topology and force it on all the guests.
Then when migrating the "pools" can be filled/unfilled
depending on which host the guest is - and whether it can
fill up the NUMA pools properly. For example it migrates
from a 1 node box to a 16 node box and all the memory
is remote. It will empty out the PV NUMA box of the "closest"
memory to zero and fill up the PV NUMA pool of the "farthest"
with all memory to balance it out and have some real
sense of the PV to machine host memory.

Dario Faggioli

2013-Aug-25 21:18 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On ven, 2013-08-23 at 13:53 -0700, Konrad Rzeszutek Wilk
wrote:> On Mon, Aug 19, 2013 at 01:58:51PM +0100, David Vrabel wrote:
> > On 16/08/13 05:13, Yechen Li wrote:
> > > 
> > > +### nodemask VNODE\_TO\_PNODE(int vnode) ###
> > > +
> > > +This service is provided by the hypervisor (and wired, if
necessary, all the
> > > +way up to the proper toolstack layer or guest kernel), since it
is only Xen
> > > +that knows both the virtual and the physical topologies.
> > 
> > The physical NUMA topology must not be exposed to guests that have a
> > virtual NUMA topology -- only the toolstack and Xen should know the
> > mapping between the two.
> 
> I think exposing any NUMA topology to guest - irregardless whether it is
based
> on real NUMA or not, is OK - and actually a pretty neat thing.
> Yes, that is exactly how Elena, which is doing such work for PV guests,
is doing it.
> Meaning you could tell a PV guest that it is running on a 16 socket NUMA
> box while in reality it is running on a single socket box. Or vice-versa.
> It can serve as a way to increase performance (or decrease) - and also
> do resource capping (This PV guest will only get 1G of real fast
> memory and then 7GB of slow memory) and let the OS handle the details
> of it (which it does nowadays).
> Yes, exactly... Again. :-)
> The mapping thought - of which PV pages should belong to which fake
> PV NUMA node - and how they bind to the real NUMA topology - that part
> I am not sure how to solve. More on this later.
>That is fine too. Again, Elena is working on both how to build up a
virtual topology and how to somehow map it to the real topology, for the
sake of performance.

However, this series is about NUMA-aware ballooning, which is something
that makes sense _ONLY_ after we''ll have all that virtual NUMA thing in
place. That being said, I told Yechen that submitting what he already
had as an RFC could have been helpful anyway, i.e., he could get some
comments on the design, the approach, the interface, etc., which is
actually what has happened. :-)

He should be more clear about the fact that some preliminary work was
missing, during the first submission. During the second submission, I
tried to help him make that more clear... If it still did not work, and
generated confusion instead, I am sorry about that.

About the technical part of this comment (guest knowledge about the real
NUMA topology), as I said already, I''m fine with letting the guest
completely in the dark, if it''s fine to provide a suitable interface
between Xen and the guest that will allow ballooning up to work (as
George pointed out in his e-mails).
> > 
> > > +## Description of the problem ##
> 
> 
> I think you have to backup with the problem description. That is you
> need to think of:
>  - How a PV guest will allocate pages at bootup based on this
>That''s not this series'' job...
>  - How it will balloon up/down within those "buckets".
> That, I''m not sure I got (more below)...
> If you are using the guests NUMA hints it usually is in the form of
> ''allocate pages on this node'' and the node information is
of type
> ''pfn X to pfn Y are on this NUMA''. That does not work
very well with
> ballooning as it can be scattered across various nodes. But that
> is mostly b/c the balloon driver is not even trying to use NUMA
> APIs. 
>Indeed. The whole point is this. "If it has been somehow established, at
boot time, that pfn X is from virtual NUMA node 2, and that all the
pfn-s from virtual node 2 are allocated --on the host-- on hardware NUMA
node 0, let''s, when ballooning pfn X down and then ballooning it back
up, make sure that: 1) in the guest it still belongs to virtual node 2,
and 2) on the host is still backed by a page on hardware node 0"

Does that make sense?
> It could use it and then it would do the best it can and
> perhaps balloon round-robin across the NUMA pools.
>Exactly, that is what David suggested and what I also think it would be
a nice first step (without any need of adding xenstore keys).
>  Or 
> perhaps a better option would be to use the hotplug memory mechanism
> (which is implemented in the balloon driver) and do large swaths of
> memory.
> Mmm... I think it should all be possible without bothering with memory
hotplug, but I may be wrong (I don''t really know much about memory
hotplug).
> But more problematic is the migration. If you migrate a guest
> to node that has different NUMA topologies what you really really
> want is:
> 	- unplug all of the memory in the guest
> 	- replug the memory with the new NUMA topology
> 
> Obviously this means you need some dynamic NUMA system - and I
don''t
> know of such. 
>We don''t plan to support dynamically varying virtual NUMA topologies in
the short term future. :-)
> The unplug/plug can be done via the balloon driver
> and or hotplug memory system. But then - the boundaries of the NUMA
> pools is set a bootup time. And you would want to change them.
> Is SRAT/SLIT dynamic? Could it change during runtime?
> I don''t know if the real hw tables can actually change, but again,
support for varying the virtual topology is not a priority right now.
> Then there is the concept of AutoNUMA were you would migrate
> pages from one node to another. With a PV guest that would imply
> that the hypervisor would poke the guest and say: "ok, time
> alter your P2M table". 
>Yes, and that''s what I am working on for a while. It''s
particularly
tricky for a PV guest and, although very similar in principle, it''s
going to be different than AutoNUMA in Linux, since for us, a migration
is way more expensive than for them.
> Which I guess right now is done best
> via the balloon driver - so what you really want is a callback
> to tell the balloon driver: Hey, balloon down and up this
> PFN block with on NUMA node X.
> I''m currently doing it via some sort of "lightweight
suspend/resume
cycle". I like the idea of trying to exploit the ballooning driver for
that, but that will probably happen in a subsequent step (I want it
working that way, before starting to think on how to improve it! :-P).

Anyway, the above just to say that this is also not this series'' job,
and although there surely are contact point, I think things can be
considered (and hence worked on/implemented) pretty independently.
> Perhaps what could be done is to setup in the cluster of hosts
> the worst case NUMA topology and force it on all the guests.
> Then when migrating the "pools" can be filled/unfilled
> depending on which host the guest is - and whether it can
> fill up the NUMA pools properly. For example it migrates
> from a 1 node box to a 16 node box and all the memory
> is remote. It will empty out the PV NUMA box of the "closest"
> memory to zero and fill up the PV NUMA pool of the "farthest"
> with all memory to balance it out and have some real
> sense of the PV to machine host memory.
> That is also nice... Perhaps this is meat for some sort of high (for
sure higher than xl/libxl) level management/orchestration layer, isn''t
it?

Anyway... I hope I helped clarifying things a bit.

Thanks for having a look and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dario Faggioli

2013-Aug-25 21:24 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

On mar, 2013-08-20 at 23:15 +0800, Li Yechen wrote:> Hi David, 
> On Mon, Aug 19, 2013 at 8:58 PM, David Vrabel
> <david.vrabel@citrix.com> wrote:
>         The physical NUMA topology must not be exposed to guests that
>         have a
>
> Most of you have the same option that the interface should be in Xen,
> not in Guest
> balloon. I''m agree with it. In next version I''ll think of
how to
> implement this interface
> between xen and balloon.  
>Perfect. That is not that much different from what you already have, the
only bit that will need some rework is the ballooning up path (see
George''s e-mails).
>         In general, I''m not keen on adding ABIs or interfaces that
>         don''t solve
>         real world problems, particularly if they''re easy to
misuse
>         and end up
>         with something that is very suboptimal.
> Dario, could the test examples that you sent to me several month be
> represented
> as a real-word example?
> The example shows that, after several guests create and shut-up, the
> node
> affinity is a mess
>         They were not real-world example. As I said before, that is a
chicken-&-egg problem: there are not real world examples until we
implement the feature! :-P

What I think you''re talking about is an old (2010 ?) Xen Summit
presentation from someone working on the same problem before, but then
not finishing it.

I don''t have the link handy right now... I''ll see if I can
find it and
post it here.
> Oh, I think this is a better interface!
> I''d very appreciate this than what I have now. However, the code I
> show here
> _does_not_really_work_. It just pass some small tests. And I''m
afraid
> that 
> there may be some bugs in my code.
> Could I set up David''s interface as a secondary goal, waiting
until
> this code
> is full tested? I''m really not that confident :)
> EhEh... All code has bugs. :-)

As I said, I also like this interface more. However, what I think you
should concentrate on (apart from, of course, debugging) is producing a
version of the series which does not use any new xenstore keys/interface
at all, and just balloons up and down taking pages evenly from all the
guest''s virtual NUMA nodes.

After that, we can come back to implement a more fine grained control,
probably via this interface David is proposing here.

What do you think?

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Li Yechen

2013-Sep-26 14:15 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Hi all,
Sorry for my being absent for a month.... And congratulate to Elena for
your patches!  I''ll read them if I''m free~~

In conclusion, there are two things for the NUMA support bubble:

First,>If we decide we do need such control, I think the xenstore interface
>should look more like:
>
>memory/target
>
>  as before
>
>memory/target-by-nid/0
>
>  target for virtual node 0
>
>...
>
>memory/target-by-nid/N
>
>  target for virtual node N
>
>I think this better reflects the goal which is an adjusted NUMA layout
>for the guest rather than the steps required to reach it (release P
>pages from node N).
Yes I think this is a very good idea. However, if target is conflict with
the sum of target-by-nid/xxx, bubble may be confused..
My idea is something as:
1 User can know the target tree, for example:
memory/target
memory/target-by-nid/0
memory/target-by-nid/1
memory/target-by-nid/2
2 User can use xl tool to set one of them to an specified value.
   For example: user set memory/target-by-nid/0 from 100M to 200M
   then that means: increase both memory/target and memory/target-by-nid/0
by 100M
   Another example: user set memory/target from 800M to 900M
   then that means: increase memory/target by 100M, but balloon could make
decision by its own.

  Does that make sencse?

3 In domU, balloon receive that which directory is changed. then it balloon
in/out pages from the node(s).



Second:>I think exposing any NUMA topology to guest - irregardless whether it is
based> on real NUMA or not, is OK - and actually a pretty neat thing.
>
>Meaning you could tell a PV guest that it is running on a 16 socket NUMA
>box while in reality it is running on a single socket box. Or vice-versa.
>It can serve as a way to increase performance (or decrease) - and also
>do resource capping (This PV guest will only get 1G of real fast
>memory and then 7GB of slow memory) and let the OS handle the details
> of it (which it does nowadays).
>
> The mapping thought - of which PV pages should belong to which fake
> PV NUMA node - and how they bind to the real NUMA topology - that part
> I am not sure how to solve. More on this later.
   Here is the question: should the interface:
machine_node_id_to_virtual_node_id   (and also the reverse)  be inplememted
inside kernel, or in Xen as a hypercall?
   Elena, I haven''t have time to look at your great patches so that I
have
no idea whether you had implememted it or no....
   If you had, I''d say sorry that we still need to talk about it : - )
   I think to implement them as a hypercall in xen is very nice, since domU
shouldn''t know hypervisior''s NUMA architecture....
             But that also means: I have to change three hypercalls about
memory operation.....

  On the other hand, implement them in Kernel is pretty neat, but it goes
against the rule of isolation....

  Anyway I prefer the previous one. Could you guys give me some ideas?

Thank you again for your suggestions on this topic : )

-- 

Yechen Li

Team of System Virtualization and Cloud Computing
School of Electronic Engineering  and Computer Science,
Peking University, China

Nothing is impossible because impossible itself  says: " I''m
possible "
lccycc From PKU


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Li Yechen

2013-Sep-26 14:15 UTC

head link

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Oh sorry, the "bubble" should be "balloon"


On Thu, Sep 26, 2013 at 10:15 PM, Li Yechen <lccycc123@gmail.com> wrote:
> Hi all,
> Sorry for my being absent for a month.... And congratulate to Elena for
> your patches!  I''ll read them if I''m free~~
>
> In conclusion, there are two things for the NUMA support bubble:
>
> First,
> >If we decide we do need such control, I think the xenstore interface
> >should look more like:
> >
> >memory/target
> >
> >  as before
> >
> >memory/target-by-nid/0
> >
> >  target for virtual node 0
> >
> >...
> >
> >memory/target-by-nid/N
> >
> >  target for virtual node N
> >
> >I think this better reflects the goal which is an adjusted NUMA layout
> >for the guest rather than the steps required to reach it (release P
> >pages from node N).
>
> Yes I think this is a very good idea. However, if target is conflict with
> the sum of target-by-nid/xxx, bubble may be confused..
> My idea is something as:
> 1 User can know the target tree, for example:
> memory/target
> memory/target-by-nid/0
> memory/target-by-nid/1
> memory/target-by-nid/2
> 2 User can use xl tool to set one of them to an specified value.
>    For example: user set memory/target-by-nid/0 from 100M to 200M
>    then that means: increase both memory/target and memory/target-by-nid/0
> by 100M
>    Another example: user set memory/target from 800M to 900M
>    then that means: increase memory/target by 100M, but balloon could make
> decision by its own.
>
>   Does that make sencse?
>
> 3 In domU, balloon receive that which directory is changed. then it
> balloon in/out pages from the node(s).
>
>
>
> Second:
>
> >I think exposing any NUMA topology to guest - irregardless whether it
is
> based
> > on real NUMA or not, is OK - and actually a pretty neat thing.
> >
> >Meaning you could tell a PV guest that it is running on a 16 socket
NUMA
> >box while in reality it is running on a single socket box. Or
vice-versa.
> >It can serve as a way to increase performance (or decrease) - and also
> >do resource capping (This PV guest will only get 1G of real fast
> >memory and then 7GB of slow memory) and let the OS handle the details
> > of it (which it does nowadays).
> >
> > The mapping thought - of which PV pages should belong to which fake
> > PV NUMA node - and how they bind to the real NUMA topology - that part
> > I am not sure how to solve. More on this later.
>
>    Here is the question: should the interface:
> machine_node_id_to_virtual_node_id   (and also the reverse)  be inplememted
> inside kernel, or in Xen as a hypercall?
>    Elena, I haven''t have time to look at your great patches so
that I have
> no idea whether you had implememted it or no....
>    If you had, I''d say sorry that we still need to talk about it :
- )
>    I think to implement them as a hypercall in xen is very nice, since
> domU shouldn''t know hypervisior''s NUMA architecture....
>              But that also means: I have to change three hypercalls about
> memory operation.....
>
>   On the other hand, implement them in Kernel is pretty neat, but it goes
> against the rule of isolation....
>
>   Anyway I prefer the previous one. Could you guys give me some ideas?
>
> Thank you again for your suggestions on this topic : )
>
> --
>
> Yechen Li
>
> Team of System Virtualization and Cloud Computing
> School of Electronic Engineering  and Computer Science,
> Peking University, China
>
> Nothing is impossible because impossible itself  says: " I''m
possible "
> lccycc From PKU
>
>

-- 

Yechen Li

Team of System Virtualization and Cloud Computing
School of Electronic Engineering  and Computer Science,
Peking University, China

Nothing is impossible because impossible itself  says: " I''m
possible "
lccycc From PKU


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Xen devel - Aug 2013 - [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

[RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning

Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning