Yechen Li
2013-Aug-16 04:13 UTC
[RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
This patch contains the details about the design, the benefits and the indended usage of NUMA-awareness in the ballooning driver, coming in the form of a markdown document in docs. Given this is an RFC, it has bits explaining what is still missing, that will of course need to disappear for the final version, but that are useful for reviewing the whole series at this stage. Signed-off-by: Yechen Li <lccycc123@gmail.com> Signed-off-by: Dario Faggioli <dario.faggioli@citrix.com> --- docs/misc/numa-aware-ballooning.markdown | 376 +++++++++++++++++++++++++++++++ 1 file changed, 376 insertions(+) create mode 100644 docs/misc/numa-aware-ballooning.markdown diff --git a/docs/misc/numa-aware-ballooning.markdown b/docs/misc/numa-aware-ballooning.markdown new file mode 100644 index 0000000..3a899f1 --- /dev/null +++ b/docs/misc/numa-aware-ballooning.markdown @@ -0,0 +1,376 @@ +# NUMA-Aware Ballooning Design and Implementation # + +## Rationale ## + +The purpose of this document is describing how the ballooning driver for +Linux is made NUMA-aware, why this is important, and its intended usage on +a NUMA host. + +NUMA stands for Non-Uniform Memory Access, which means that a typical NUMA +host will have more than one memory controller and that accessing a specific +memory locations can have a varying cost, depending on from which physical +CPU such accesses come from. In presence of such architecture, it is very +important to keep all the memory of a domain on the same pysical NUMA node +(or on the smallest possible set of physical NUMA nodes), in order to get +good and predictable performance. + +Ballooning is an effective mean for making space, when there is the need to +create a new domain. However, on a NUMA host, one could want to not only +free some memory, but also to free it from a specific node. That could be, +for instance, the case for making sure the new domain will fit in just one +node. + +For more information on NUMA and on how Xen deals with it, see [Xen NUMA +Introduction][numa_intro] on the Xen Wiki, and/or +[xl-numa-placement.markdown][numa_placement]. + +## Definitions ## + +Some definitions that could came handy for reading the rest of this +document: + +* physical topology -- the hardware NUMA characteristics of the host. It + typically is comprised of the number of NUMA nodes, the memory each node + has and the number of (and the which ones) physical CPUs belongs to each + node. +* virtual topology -- similar information to the physical topology, but it + applies to a guest, rather than to the host (hence the virtual). It is + constructed at guest creation time and is comprised of the number of + virtual NUMA nodes, the amount of guest memory each virtual node is + entitled of and the number of (and the which ones) virtual vCPUs belongs + to each virtual node. + +Physical node may be abbreviated as pnode, or p-node, and virtual node is +often abbreviated as vnode, or v-node (similarly to what happens to pCPU +and vCPU). + +### Physical topology and virtual topology ### + +It is __not__ __at__ __all__ __required__ that there is any relationship +between the physical and virtual topology, exactly as it is not at all +required that there is any relationship between physical CPUs and virtual +CPUs. + +That is to say, on an host with 4 NUMA nodes, each node with 8 GiB RAM and +4 pCPUs, there can be guests with 1 vnode, 2 vnodes and 4 vnodes (it is not +clear yet whether it would make sense to allow overbooking, perhaps yes, but +only after having warn the user about it). In all the cases, vnodes can have +arbitrary sizes, with respect to both the amount of memory and the number of +vCPUs in each of them. Similarly, there is neither the requirement nor the +guarantee that, either: + +* the memory of a vnode is actually allocated on a particular pnode, nor + even that it is allocated on __only__ one pnode +* the vCPUs belonging to a vnode are actually bound or pinned to the pCPUS + of a particular pnode. + +Of course, although it is not required, having some kind of relationship in +place between the physical and virtual topology would be beneficial for both +performance and resource usage optimization. This is really similar to what +happens with vCPU to pCPU pinning: it is not required, but if it is possible +to do that properly, performance will increase. + +In this specific case, the ideal situation would be as follows: + +* all the guest memory pages that constitutes one vnode, say vnode #1, are + backed by frames that, on the host, belongs to the same pnode, say pnode #3 + (and that should be true for all the vnodes); +* all the guest vCPUs that belongs to vnode #1 have, as far as the Xen + scheduler is concerned, NUMA node affinity with (some of) the pCPUs + belonging to pnode #3 (and that again should hold for all the vnodes). + +For that reason, a mechanism for the user to specify the relationship between +virtual and physical topologies at guest creation time will be implemented, as +well as some automated logic for coming up with a sane default, in case the +user does not say anything (like it happens with NUMA automatic placement). + +## Current status ## + +At the time of writing this document, the work to allow a virtual NUMA topology +to be specified and properly constructed within a domain is still in progress. +Patches will be released soon, both for Xen and Linux, implementing right what +has been described in the previous section. + +NUMA aware ballooning needs such feature to be present, or it won''t work +properly, so the reminder of this document will assume that we have virtual +NUMA in place. + +We will also assume that we have the following "services" in place too, +provided by the Xen hypervisor, via a suitable hypercall interface, and +available for the toolstack and to the ballooning driver. + +### nodemask VNODE\_TO\_PNODE(int vnode) ### + +This service is provided by the hypervisor (and wired, if necessary, all the +way up to the proper toolstack layer or guest kernel), since it is only Xen +that knows both the virtual and the physical topologies. + +It takes the ID of a vnode, as they are identified within the virtual topology +of a guest, and returns a bitmask. The bits set to 1 in there, corresponds to +the pnodes from which, on the host, the memory of the passed vnode comes from. + +The ideal situation is that such bitmask only has 1 bit set, since having the +memory of a vnode coming from more than one pnode is clearly bad for access +locality. + +### nodemask PNODE\_TO\_VNODE(int pnode) ### + +This service is provided by the hypervisor (and wired, if necessary, all the +way up to the proper toolstack layer or guest kernel), since it is only Xen +that knows both the virtual and the physical topologies. + +It takes the ID of a pnode, as they are identified within the physical +topology of the host, and returns a bitmask. The bits set to 1 in there, +corresponds to the vnodes, in the guest, that uses memory from the passed +pnode. + +The ideal situation is that such bitmask only has 1 bit set, but is less of a +problem (wrt the case above) it there are more. + +## Description of the problem ## + +Let us use an example. Let''s assume that guest _G_ has a virtual 2 vnodes, +and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2, +respectively. + +Now, the user wants to create a new guest, but the system is under high memory +pressure, so he decides to try ballooning _G_ down. He sees that pnode #2 has +the best chances to accommodate all the memory for the new guest, which would +be really good for performance, if only he can make space there. _G_ is the +only domain eating some memory from pnode, #2 but, as said above, not all of +its memory comes from there. + +So, right now, the user has no way to specify that he wants to balloon down +_G_ in such a way that he will get as much as possible free pages from pnode +#2, rather than from pnode #0. He can ask _G_ to balloon down, but there is +no guarantee on from what pnode the memory will be freed. + +The same applies to the ballooning up case, when the user, for some specific +reasons, wants to be sure that it is the memory of some (other) specific pnode +that will be used. + +What is necessary is something allowing the user to specify from what pnode +he wants the free pages, and allowing the ballooning driver to make that +happen. + +## The current situation ## + +This section describes the details of how ballooning works right now. It is +possible to skip this if you already possess such knowledge. + +Ballooning happens inside the guest, with of course some "help" from Xen. +The notification that ballooning up or down is necessary reaches the +ballooning driver in the guest via xenstore. See [here][xenstore] and +[here][xenstore_paths] for more details. + +Typically, the user --via xl or via any other high level toolstack-- calls +libxl\_set\_memory\_target(), where the new desired memory target for the +guest is written in a xenstore key: ~/memory/target. +The guest has a xenstore watch on this key, and every time it changes, work +inside the guest itself is scheduled (the body is in the balloon\_process() +function). It is responsibility of this component of the ballooning driver +to read the key and take proper action. + +There are two possible situations: + +* new target < current usage -- the ballooning driver maintains a list of + free pages (similar to what OS does, for tracking free memory). Despite + being actually free, they are hidden to the OS, which can''t use them, + unless they are handed to it by the ballooning. From the OS perspective, + the ballooning driver can be seen as some sort of monster, eating some of + the guest free memory. + To steal a free page from the guest OS, and add it to the list of ballooned + pages, the driver calls alloc\_page(). From the point of view of the OS, + the ballooning monster ate that page, decreasing the amount of free memory + (inside the guest). + At this point, a XENMEM\_decrease\_reservation hypercall is issued, which + hands the newly allocated page to Xen so that he can unmap it, turning it, + to all the effects, into free host memory. +* new target > current usage -- the ballooning driver picks up one page from + the list of ballooned pages it maintains and issue a + XENMEM\_populate\_physmap hypercall, with the PFN of the chosen page as an + argument. + Xen goes looking for some free memory in the host and, when it finds it, + allocate a page and map that in the guest''s memory space, using the PFN it + has been provided with. + At this point, the ballooning driver calls \_\_free\_reserved\_page() to + make the guest OS know that the PFN can now be considered a free page + (again), which also means removing the page from the ballooned pages list. + +Operations like the ones described above goes on within the ballooning driver, +perhaps one after the other, until the new target is reached. + +## NUMA-aware ballooning ## + +The new NUMA-aware ballooning logic works as follows. + +There is room, in libxl\_set\_memory\_target() for two more parameters, in +addition to the new memory target: + +* _pnid_ -- which is the pnode id of which node the user wants to try get some + free memory on +* _nodeexact_ -- which is a bool specifying whether or not, in case it is not + possible to reach the new ballooning target only with memory from pnode + _pnid_, the user is fine with using memory from other pnodes. + If _nodeexact_ is true, it is possible that the new target is not reached; if + it is false, the new target will (probably) be reached, but it is possible + that some memory is freed on pnodes other than _pnid_. + +To let the ballooning driver know about these new parameters, a new xenstore +key exists in ~/memory/target\_nid. So, for a proper NUMA aware ballooning +operation to occur, the user should write the proper values in both the keys: +~/memory/target\_nid and ~/memory/target. + +So, in NUMA aware ballooning, ballooning down and up works as follows: + +* target < current usage -- first of all, the ballooning driver uses the + PNODE\_TO\_VNODE() service (provided by the virtual topology implementation, + as explained above) to translate _pnid_ (that it reads from xenstore) to + the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which will + be be a one element only set, in case there is only one vnode in the guest + allocated out of pnode _pnid_). It then uses alloc_pages_node(), with one + or more elements from _{vnids}_ as nodes, in order to rip the proper amount + of ballooned pages out of the guest OS, and hand them to Xen (with the same + hypercal as above). + If not enough pages can be found, and only if _nodeexact_ is false, it starts + considering other nodes (by just calling alloc\_page()). If _nodeexact_ is + true, it just returns. +* target > current usage -- first of all, the ballooning driver uses the + PNODE\_TO\_VNODE() service to translate _pnid_ to the ID(s) of the + corresponding set of vnode IDs, say _{vnids}_. It then looks for enough + ballooned pages, among the ones belonging to the vnodes in _{vnids}_, and + asks Xen to map them via XENMEM\_polulate\_physmap. While doing the latter, + it explicitly tells the hypervisor to allocate the actual host pages on + pnode _pnid_. + If not enough ballooned pages can be found among vnodes in _{vnids}_, and + only if node\_exact is false, it falls back looking for ballooned pages in + other vnodes. For each one he finds, it calls VNODE\_TO\_PNODE(), to see on + what pnode pnode it belongs, and then asks Xen to map it, exactly as above. + +The biggest difference between current and NUMA-aware ballooning is that the +latter needs to keep multiple lists of the ballooned pages in an array, with +one element for each virtual node. This way, it is always evident, at any +given time, what ballooned pages belong to what vnode. + +Regarding the stealing a page from the OS part, it is enough to use the Linux +function alloc_page_node(), in place of alloc\_page(). + +Finally, from the hypercall point of view, both XENMEM\_decrease\_reservation +and XENMEM\_populate\_physmap already are NUMA-aware, so it is just a matter +of passing them the proper arguments. + +### Usage examples ### + +Let us assume we have a 4 vnodes guest on an 8 pnodes host. Virtual to +pysical topology mapping is as follows: + +* vnode #0 --> pnode #1 +* vnode #1 --> pnode #3 +* vnode #2 --> pnode #3 +* vnode #3 --> pnode #5 + +Let us also assume that the user has just decided that he want to change +the current memory allocation scheme, by ballooning (up or down) $DOMID. + +Here they come some usage examples, for the NUMA-aware ballooning, with an +exemplified explanation of what happens in the various situations. + +#### Ballooning down, example 1 #### + +The user wants N free pages from pnode #3, and only from there: + +1. user invokes `xl mem-set -n 3 -e $DOMID N +2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid +3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets + {1,2} +4. ballooning driver tries to steal N pages from vnode #1 and/or vnode #3 + (both are fine) and add them to ballooned out pages (i.e., allocate + them in guest and unmap them in Xen) +5. whether phase 4 succeeds or not, return, having potentially freed less + than N pages on pnode #3 + +#### Ballooning down, example 2 #### + +The user wants N free pages from pnode #1, but if impossible, is fine with +other pnodes to contribute to that: + +1. user invokes `xl mem-set -n 1 $DOMID N +2. libxl writes N to ~/memory/target and {1,false} to ~/memory/target\_nid +3. ballooning driver asks Xen what vnodes insists on pnode #1, and gets {0} +4. ballooning driver tries to steal N pages from vnode #0 and add them to + ballooned out pages +5. if less than N pages are found in phase 4, ballooning driver tries + to steal from any vnode + +#### Ballooning up, example 1 #### + +The user wants to give N more pages, taking them from pnode #3, and only +from there: + +1. user invokes `xl mem-set -n 3 -e $DOMID N +2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid +3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets + {1,2} +4. ballooning driver locates N ballooned pages belonging to vnode #1 or + vnode #2 (both would do) +5. ballooning driver asks Xen to allocate N pages on pnode #3 +6. ballooning driver ask Xen to map the new pages on the ballooned pages + and hands the result back to the guest OS + +NOTICE that, since node_exact was true, if phase 4 fails to find N pages +(say it finds Y<N) then phases 5 and 6 will work with Y + +## Limitations ## + +This features represents a performance (and resource usage) optimization. +As it is common in these cases, it does its best under certain conditions. +In case it is put to work under different conditions, it is possible that +its actual beneficial potential is diminished or, at worst, lost completely. +Nevertheless, it is required that everything keeps working and that the +performance at least do not drop below the level they where without the +feature itself. + +In this case, the best or worst working conditions have to do with how well +the guest domain is placed on the host NUMA nodes, i.e., on how many and what +physical NUMA nodes its memory comes from. If the guest has a virtual NUMA +topology, they also have to do with the relationship between the virtual and +the physical topology. + +It is possible to identify three situations: + +1. multiple vnodes is backed by host memory from only one pnode (but not + vice-versa, i.e., there is no single vnode backed by host memory from two + or more pnodes). We can call this scenario many-to-one; +2. for all the vnodes, each one of them is backed by host memory coming from + one specific (and all different among each others) pnode. We can call this + scenario one-to-one; +3. there are vnodes that are backed by host memory coming at the same time + from multiple pnodes. We can call this scenario one-to-many. + +If a guest does not have a virtual NUMA topology, it can be seen as having +only one virtual NUMA node, accommodating all the memory. + +Among the three scenarios above, 1 and 2 are fine, meaning that NUMA aware +ballooning will succeed in using host memory from the correct pnodes, +guaranteeing improved performance and better resource utilization than the +current situation (if there is enough free memory in the proper place, of +course). + +The third situation (one-to-many) is the only problematic one. In fact, if +we try to get several pages from pnode #1, with the guest vnode #0 having +pages on both pnode #1 and pnode #2, the ballooning driver will pick up the +pages from the vnode #0''s pool of ballooned pages, without any chance of +knowing whether they actually are backed by host pages from pnode #1 or +from pnode #2. + +However, this just means that, in this case, NUMA-aware ballooning behaves +pretty much like the current (i.e., __non__ NUMA-aware ballooning) which is +certainly undesirable, but still acceptable. On a related note, having the +memory of a vnode split among two (or more) pnodes is a non optimal situation +already, at least from a NUMA perspective, so it would be unrealistic to ask +the ballooning driver to do anything better than the above. + +[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction +[numa_placement]: http://xenbits.xen.org/docs/unstable/misc/xl-numa-placement.markdown +[xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt +[xenstore_paths]: http://xenbits.xen.org/docs/unstable/misc/xenstore-paths.markdown -- 1.8.1.4
Jan Beulich
2013-Aug-16 09:09 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
>>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com> wrote: > +So, in NUMA aware ballooning, ballooning down and up works as follows: > + > +* target < current usage -- first of all, the ballooning driver uses the > + PNODE\_TO\_VNODE() service (provided by the virtual topology implementation, > + as explained above) to translate _pnid_ (that it reads from xenstore) to > + the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which willThis looks conceptually wrong: The balloon driver should have no need to know about pNID-s; it should be the tool stack doing the translation prior to writing the xenstore node. Further, the new xenstore node would presumably better be a mask than a single vNID, since in order to e.g. balloon up another guest already spanning multiple nodes, giving the tool stack a way to ask for memory on any of the spanned nodes. And finally, coming back what Tim had already pointed out - doing things the way you propose can cause an imbalance in the ballooned down guest, penalizing it in favor of not penalizing the intended consumer of the recovered memory. Therefore I wonder whether, without any new xenstore node, it wouldn''t be better to simply require conforming balloon drivers to balloon out memory evenly across the domain''s virtual nodes.> +The biggest difference between current and NUMA-aware ballooning is that the > +latter needs to keep multiple lists of the ballooned pages in an array, with > +one element for each virtual node. This way, it is always evident, at any > +given time, what ballooned pages belong to what vnode.That''s wrong afaict: ballooned out pages aren''t associated with any memory, and hence can''t be associated with any vNID. Once they get re-populated, which vNID the memory belongs to is an attribute of the memory coming in, not the control structure that it''s to be associated with. I believe this thinking of yours stems from the fact that in Linux the page control structures are associated with nodes by way of the physical memory map being split into larger pieces, each coming from a particular node. But other OSes don''t need to follow this model, and what you propose would also exclude extending the spanned nodes set if memory gets ballooned in that''s not associated with any node the domain so far was "knowing" of.> +Regarding the stealing a page from the OS part, it is enough to use the Linux > +function alloc_page_node(), in place of alloc\_page().Such statement seems to confirm that you''re thinking Linux centric instead of defining a generic model. Jan
Li Yechen
2013-Aug-16 10:18 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
Hi Jan, On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com> wrote:> This looks conceptually wrong: The balloon driver should have no > need to know about pNID-s; it should be the tool stack doing the > translation prior to writing the xenstore node. > > Further, the new xenstore node would presumably better be a mask > than a single vNID, since in order to e.g. balloon up another guest > already spanning multiple nodes, giving the tool stack a way to ask > for memory on any of the spanned nodes. > > Yeah, you are right. I''m also telling myself that it''s not a good idea tolet guest OS knows the physical IDs. These two transformation between p-nid and v-nid could be put either inside of Xen, or inside of balloon, which is its current state. Anyway, the interfaces from guest NUMA topology have not been implemented yet. I''ll mark this as an to-do issue and move it into Xen in the future.> And finally, coming back what Tim had already pointed out - doing > things the way you propose can cause an imbalance in the > ballooned down guest, penalizing it in favor of not penalizing the > intended consumer of the recovered memory. Therefore I wonder > whether, without any new xenstore node, it wouldn''t be better to > simply require conforming balloon drivers to balloon out memory > evenly across the domain''s virtual nodes.I should say sorry here, but I''m not quite understand the "whether" part. the "new xenstore node" just store the requirement from user, so that balloon could read it. It''s similar to ~/memory/target. This new node could store either p-nodeid, or v-nodeid, according to the interfaces we talked above is placed inside of xen, or inside of guest OS. Do you have a better way to pass this requirement to balloon, instead of create a new xenstore node? I''d be very happy if you have one, since nor do I like the way I have done(create a new node) already!> +The biggest difference between current and NUMA-aware ballooning is that > the > > +latter needs to keep multiple lists of the ballooned pages in an array, > with > > +one element for each virtual node. This way, it is always evident, at > any > > +given time, what ballooned pages belong to what vnode. > > That''s wrong afaict: ballooned out pages aren''t associated with any > memory, and hence can''t be associated with any vNID. Once they > get re-populated, which vNID the memory belongs to is an attribute > of the memory coming in, not the control structure that it''s to be > associated with. > > I believe this thinking of yours stems from the fact that in Linux the > page control structures are associated with nodes by way of the > physical memory map being split into larger pieces, each coming from > a particular node. But other OSes don''t need to follow this model, > and what you propose would also exclude extending the spanned > nodes set if memory gets ballooned in that''s not associated with > any node the domain so far was "knowing" of.You are exactly right again, this design is only for Linux balloon driver. For Linux, balloon can choose which page to balloon in/out. So we can assocate the pages with v-nodeid. For the other kinds of architechure, please forgive me that I haven''t think of that far... > +Regarding the stealing a page from the OS part, it is enough to use the> Linux > > +function alloc_page_node(), in place of alloc\_page(). > > Such statement seems to confirm that you''re thinking Linux centric > instead of defining a generic model. > > Jan > > Yes.And thank you again to spend your valuable time reviewing my patch! I hope my answer could solve your questions. If not, please point it out for me! On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com> wrote:> >>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com> wrote: > > +So, in NUMA aware ballooning, ballooning down and up works as follows: > > + > > +* target < current usage -- first of all, the ballooning driver uses the > > + PNODE\_TO\_VNODE() service (provided by the virtual topology > implementation, > > + as explained above) to translate _pnid_ (that it reads from xenstore) > to > > + the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which > will > > This looks conceptually wrong: The balloon driver should have no > need to know about pNID-s; it should be the tool stack doing the > translation prior to writing the xenstore node. > > Further, the new xenstore node would presumably better be a mask > than a single vNID, since in order to e.g. balloon up another guest > already spanning multiple nodes, giving the tool stack a way to ask > for memory on any of the spanned nodes. > > And finally, coming back what Tim had already pointed out - doing > things the way you propose can cause an imbalance in the > ballooned down guest, penalizing it in favor of not penalizing the > intended consumer of the recovered memory. Therefore I wonder > whether, without any new xenstore node, it wouldn''t be better to > simply require conforming balloon drivers to balloon out memory > evenly across the domain''s virtual nodes. > > > +The biggest difference between current and NUMA-aware ballooning is > that the > > +latter needs to keep multiple lists of the ballooned pages in an array, > with > > +one element for each virtual node. This way, it is always evident, at > any > > +given time, what ballooned pages belong to what vnode. > > That''s wrong afaict: ballooned out pages aren''t associated with any > memory, and hence can''t be associated with any vNID. Once they > get re-populated, which vNID the memory belongs to is an attribute > of the memory coming in, not the control structure that it''s to be > associated with. > > I believe this thinking of yours stems from the fact that in Linux the > page control structures are associated with nodes by way of the > physical memory map being split into larger pieces, each coming from > a particular node. But other OSes don''t need to follow this model, > and what you propose would also exclude extending the spanned > nodes set if memory gets ballooned in that''s not associated with > any node the domain so far was "knowing" of. > > > +Regarding the stealing a page from the OS part, it is enough to use the > Linux > > +function alloc_page_node(), in place of alloc\_page(). > > Such statement seems to confirm that you''re thinking Linux centric > instead of defining a generic model. > > Jan > >-- Yechen Li Team of System Virtualization and Cloud Computing School of Electronic Engineering and Computer Science, Peking University, China Nothing is impossible because impossible itself says: " I''m possible " lccycc From PKU _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Aug-16 13:21 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
(re-adding xen-devel to Cc)>>> On 16.08.13 at 12:12, Li Yechen <lccycc123@gmail.com> wrote: > On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com> wrote: >> And finally, coming back what Tim had already pointed out - doing >> things the way you propose can cause an imbalance in the >> ballooned down guest, penalizing it in favor of not penalizing the >> intended consumer of the recovered memory. Therefore I wonder >> whether, without any new xenstore node, it wouldn''t be better to >> simply require conforming balloon drivers to balloon out memory >> evenly across the domain''s virtual nodes. > > I should say sorry here, but I''m not quite understand the "whether" part. > the "new xenstore node" just store the requirement from user, so that > balloon could read it. It''s similar to ~/memory/target. This new node > could store either p-nodeid, or v-nodeid, according to the interfaces we > talked above is placed inside of xen, or inside of guest OS. > Do you have a better way to pass this requirement to balloon, instead of > create a new xenstore node? I''d be very happy if you have one, since > nor do I like the way I have done(create a new node) already!As said - I''d want you to evaluate a model without such a new node, and with instead the requirement placed on the balloon driver to balloon out pages evenly allocated across the guest''s virtual nodes.> You are exactly right again, this design is only for Linux balloon driver. > For Linux, balloon can choose which page to balloon in/out. So we can > assocate the pages with v-nodeid. > For the other kinds of architechure, please forgive me that I haven''t think > of that far...The abstract model shouldn''t take OS implementation details or policies into account; the implementation later of course can (and frequently will need to). Jan
Li Yechen
2013-Aug-16 14:17 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On Fri, Aug 16, 2013 at 9:21 PM, Jan Beulich <JBeulich@suse.com> wrote:> > As said - I''d want you to evaluate a model without such a new node, > and with instead the requirement placed on the balloon driver to > balloon out pages evenly allocated across the guest''s virtual nodes.Oh, so you need the experiments'' result without this patch? I see. I''ll do it and send the result.> You are exactly right again, this design is only for Linux balloon driver. > > For Linux, balloon can choose which page to balloon in/out. So we can > > assocate the pages with v-nodeid. > > For the other kinds of architechure, please forgive me that I haven''t > think > > of that far... > > The abstract model shouldn''t take OS implementation details or > policies into account; the implementation later of course can (and > frequently will need to). > > Jan >So, you mean that the abstract model should consider that OS could not allocate pages by virtual node IDs? That''s a question.. Let me think about it :-) -- Yechen Li Team of System Virtualization and Cloud Computing School of Electronic Engineering and Computer Science, Peking University, China Nothing is impossible because impossible itself says: " I''m possible " lccycc From PKU _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Aug-16 14:55 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
>>> On 16.08.13 at 16:17, Li Yechen <lccycc123@gmail.com> wrote: > On Fri, Aug 16, 2013 at 9:21 PM, Jan Beulich <JBeulich@suse.com> wrote: >> As said - I''d want you to evaluate a model without such a new node, >> and with instead the requirement placed on the balloon driver to >> balloon out pages evenly allocated across the guest''s virtual nodes. > > Oh, so you need the experiments'' result without this patch? > I see. I''ll do it and send the result.What experiment?>> You are exactly right again, this design is only for Linux balloon driver. >> > For Linux, balloon can choose which page to balloon in/out. So we can >> > assocate the pages with v-nodeid. >> > For the other kinds of architechure, please forgive me that I haven''t >> think >> > of that far... >> >> The abstract model shouldn''t take OS implementation details or >> policies into account; the implementation later of course can (and >> frequently will need to). >> > So, you mean that the abstract model should consider that OS could not > allocate pages by virtual node IDs?No. What I said is that associating ballooned out pages with a particular vNID seems wrong. If the balloon driver gets back a fresh page during re-population, it shouldn''t depend on having a suitable vacated page control structure available, but instead should be able to absorb the page in any case. But again, this all is taking Linux concepts into consideration, which don''t belong in the architectural model (or at most as an example, but your examples started _after_ you already started dealing with Linux specifics). Jan
Dario Faggioli
2013-Aug-16 22:53 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On ven, 2013-08-16 at 14:21 +0100, Jan Beulich wrote:> >>> On 16.08.13 at 12:12, Li Yechen <lccycc123@gmail.com> wrote: > > On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com> wrote: > >> And finally, coming back what Tim had already pointed out - doing > >> things the way you propose can cause an imbalance in the > >> ballooned down guest, penalizing it in favor of not penalizing the > >> intended consumer of the recovered memory. Therefore I wonder > >> whether, without any new xenstore node, it wouldn''t be better to > >> simply require conforming balloon drivers to balloon out memory > >> evenly across the domain''s virtual nodes. > > > > I should say sorry here, but I''m not quite understand the "whether" part. > > the "new xenstore node" just store the requirement from user, so that > > balloon could read it. It''s similar to ~/memory/target. This new node > > could store either p-nodeid, or v-nodeid, according to the interfaces we > > talked above is placed inside of xen, or inside of guest OS. > > Do you have a better way to pass this requirement to balloon, instead of > > create a new xenstore node? I''d be very happy if you have one, since > > nor do I like the way I have done(create a new node) already! > > As said - I''d want you to evaluate a model without such a new node, > and with instead the requirement placed on the balloon driver to > balloon out pages evenly allocated across the guest''s virtual nodes. >Why not supporting both modes? I mean, Jan, I totally see what you mean, and I agree, a very important use case is where the user just says "balloon down/up", in which case reclaiming/populating evenly is the most sane thing to do (as also said by David --I think he was him rather than Tim). However, what about the use case when the user actually want to make space on a specific p-node, no matter what it will cost to the existing guests? I don''t have that much "ballooning experience", so I am genuinely asking here, is that use case completely irrelevant? I personally thing it would be something nice to have, although, again, probably not as the default behaviour... What about something like, the default is the even distribution, but if the user makes it clear he wants a specific p-node (whatever v-node or v-nodes that will mean for the guest), we also allow that? For actually doing it, I''m not sure what the best interface would be... The new xenstore key did not look perfect, but not that bad even. FWIW, if we''d stick with it, I agree with you that it should host v-nodes (so the hypercall doing the translation should happen in the toolstack), and that it should host a mask.> > You are exactly right again, this design is only for Linux balloon driver. > > For Linux, balloon can choose which page to balloon in/out. So we can > > assocate the pages with v-nodeid. > > For the other kinds of architechure, please forgive me that I haven''t think > > of that far... > > The abstract model shouldn''t take OS implementation details or > policies into account; the implementation later of course can (and > frequently will need to). >You are right. Although it is true that this series is specifically for Linux, Linux specific concepts populates the design document too much, or at least in the wrong places. :-) That being said (and perhaps Yechen could make a not about this, so that if he send another version, he could reorganize this patch/document a bit, to achieve a better separation between the generic model description and the implementation details), if you consider all this the description of the Linux specific implementation, it would be interesting to know what you think of such specific implementation. :-) Thanks a lot for taking a look and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Aug-16 23:30 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On ven, 2013-08-16 at 10:09 +0100, Jan Beulich wrote:> >>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com> wrote: > > +The biggest difference between current and NUMA-aware ballooning is that the > > +latter needs to keep multiple lists of the ballooned pages in an array, with > > +one element for each virtual node. This way, it is always evident, at any > > +given time, what ballooned pages belong to what vnode. > > That''s wrong afaict: ballooned out pages aren''t associated with any > memory, and hence can''t be associated with any vNID. Once they > get re-populated, which vNID the memory belongs to is an attribute > of the memory coming in, not the control structure that it''s to be > associated with. >I may be wrong (I''m sorry, I had very few chance to look at the ballooning code, and won''t be able to do so for a while), but I think what we want here is the other way around, i.e., having a way to make sure that the memory that will come in will also end up --in the guest-- within a specific v-node. I don''t know if the only/best way to do this is the array of lists in Yechen''s patches, and I agree (as per the other e-mail) that this more an implementation detail than anything else, but I think the point here is: do we want to support that operational mode (again, perhaps not as the default node, even in a virtual NUMA enabled guest) ?> I believe this thinking of yours stems from the fact that in Linux the > page control structures are associated with nodes by way of the > physical memory map being split into larger pieces, each coming from > a particular node. But other OSes don''t need to follow this model, > and what you propose would also exclude extending the spanned > nodes set if memory gets ballooned in that''s not associated with > any node the domain so far was "knowing" of. >I agree on the first part of this comment... Too much Linux-ism in the description of what should be a generic model. The second part (the one about what happens if memory comes from an "unknown" node), I''m not sure I get what you mean. Suppose we have guest G with 2 v-nodes and with pages in v-node 0 (say, page 0,1,2..N-1) are backed by frames on p-node 2, while pages in v-node 1 (say, N,N+1,N+2..2N-1) are backed by frames on p-node 4, and that is because, at creation time, either the user or the toolstack decided this was the way to go. So, if page 2 was ballooned down, when ballooning it up, we would like to retain the fact that it is backed by a frame in p-node 2, and we could ask Xen to try make that happen. On failure (e.g., no free frames on p-node 2), we could either fail or have Xen allocate the memory somewhere else, i.e., not on p-node 2 or p-node 4, and live with it (i.e., map G''s page 2 there), which I think is what you mean with <<node the domain so far was "knowing" of>>, isn''t it? Or was it something different that you were asking? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Jan Beulich
2013-Aug-19 09:17 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
>>>> Dario Faggioli <dario.faggioli@citrix.com> 08/17/13 1:31 AM >>> >On ven, 2013-08-16 at 10:09 +0100, Jan Beulich wrote: >> I believe this thinking of yours stems from the fact that in Linux the >> page control structures are associated with nodes by way of the >> physical memory map being split into larger pieces, each coming from >> a particular node. But other OSes don''t need to follow this model, >> and what you propose would also exclude extending the spanned >> nodes set if memory gets ballooned in that''s not associated with >> any node the domain so far was "knowing" of. >> >I agree on the first part of this comment... Too much Linux-ism in the >description of what should be a generic model. > >The second part (the one about what happens if memory comes from an >"unknown" node), I''m not sure I get what you mean. > >Suppose we have guest G with 2 v-nodes and with pages in v-node 0 (say, >page 0,1,2..N-1) are backed by frames on p-node 2, while pages in v-node >1 (say, N,N+1,N+2..2N-1) are backed by frames on p-node 4, and that is >because, at creation time, either the user or the toolstack decided this >was the way to go. >So, if page 2 was ballooned down, when ballooning it up, we would like >to retain the fact that it is backed by a frame in p-node 2, and we >could ask Xen to try make that happen. On failure (e.g., no free frames >on p-node 2), we could either fail or have Xen allocate the memory >somewhere else, i.e., not on p-node 2 or p-node 4, and live with it >(i.e., map G''s page 2 there), which I think is what you mean with <<node >the domain so far was "knowing" of>>, isn''t it?Right. Or the guest could choose to create a new node on the fly. Jan
Jan Beulich
2013-Aug-19 09:22 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
>>> Dario Faggioli <dario.faggioli@citrix.com> 08/17/13 12:53 AM >>> >On ven, 2013-08-16 at 14:21 +0100, Jan Beulich wrote: >> >>> On 16.08.13 at 12:12, Li Yechen <lccycc123@gmail.com> wrote: >> > On Fri, Aug 16, 2013 at 5:09 PM, Jan Beulich <JBeulich@suse.com> wrote: >> >> And finally, coming back what Tim had already pointed out - doing >> >> things the way you propose can cause an imbalance in the >> >> ballooned down guest, penalizing it in favor of not penalizing the >> >> intended consumer of the recovered memory. Therefore I wonder >> >> whether, without any new xenstore node, it wouldn''t be better to >> >> simply require conforming balloon drivers to balloon out memory >> >> evenly across the domain''s virtual nodes. >> > >> > I should say sorry here, but I''m not quite understand the "whether" part. >> > the "new xenstore node" just store the requirement from user, so that >> > balloon could read it. It''s similar to ~/memory/target. This new node >> > could store either p-nodeid, or v-nodeid, according to the interfaces we >> > talked above is placed inside of xen, or inside of guest OS. >> > Do you have a better way to pass this requirement to balloon, instead of >> > create a new xenstore node? I''d be very happy if you have one, since >> > nor do I like the way I have done(create a new node) already! >> >> As said - I''d want you to evaluate a model without such a new node, >> and with instead the requirement placed on the balloon driver to >> balloon out pages evenly allocated across the guest''s virtual nodes. >> >Why not supporting both modes? I mean, Jan, I totally see what you mean, >and I agree, a very important use case is where the user just says >"balloon down/up", in which case reclaiming/populating evenly is the >most sane thing to do (as also said by David --I think he was him rather >than Tim). > >However, what about the use case when the user actually want to make >space on a specific p-node, no matter what it will cost to the existing >guests? I don''t have that much "ballooning experience", so I am >genuinely asking here, is that use case completely irrelevant? I >personally thing it would be something nice to have, although, again, >probably not as the default behaviour... > >What about something like, the default is the even distribution, but if >the user makes it clear he wants a specific p-node (whatever v-node or >v-nodes that will mean for the guest), we also allow that?Yes, supporting both modes is certainly desirable; I merely tried to point out that the description went too far in the direction of advocating only the enforce-a-node model (almost like this was the only sensible one from the NUMA perspective). Penalizing another guest should clearly not be a default action, but it may validly be a choice by the administrator. Jan Jan
George Dunlap
2013-Aug-19 11:05 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On Fri, Aug 16, 2013 at 10:09 AM, Jan Beulich <JBeulich@suse.com> wrote:>>>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com> wrote: >> +So, in NUMA aware ballooning, ballooning down and up works as follows: >> + >> +* target < current usage -- first of all, the ballooning driver uses the >> + PNODE\_TO\_VNODE() service (provided by the virtual topology implementation, >> + as explained above) to translate _pnid_ (that it reads from xenstore) to >> + the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which will > > This looks conceptually wrong: The balloon driver should have no > need to know about pNID-s; it should be the tool stack doing the > translation prior to writing the xenstore node.I agree with this -- I would like to point out that to make this work for ballooning *up*, however, there will need to be a way for the guest to specify, "please allocate from vnode X", and have Xen translate the vnode into the appropriate pnode(s). -George
David Vrabel
2013-Aug-19 12:58 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On 16/08/13 05:13, Yechen Li wrote:> > +### nodemask VNODE\_TO\_PNODE(int vnode) ### > + > +This service is provided by the hypervisor (and wired, if necessary, all the > +way up to the proper toolstack layer or guest kernel), since it is only Xen > +that knows both the virtual and the physical topologies.The physical NUMA topology must not be exposed to guests that have a virtual NUMA topology -- only the toolstack and Xen should know the mapping between the two. A guest cannot make sensible use of a machine topology as it may be migrated to a host with a different topology.> +## Description of the problem ## > + > +Let us use an example. Let''s assume that guest _G_ has a virtual 2 vnodes, > +and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2, > +respectively. > + > +Now, the user wants to create a new guest, but the system is under high memory > +pressure, so he decides to try ballooning _G_ down. He sees that pnode #2 has > +the best chances to accommodate all the memory for the new guest, which would > +be really good for performance, if only he can make space there. _G_ is the > +only domain eating some memory from pnode, #2 but, as said above, not all of > +its memory comes from there.It is not clear to me that this is the optimal decision. What tools/information will be available that the user can use to make sensible decisions here? e.g., is the current layout available to tools? Remember that the "user" in this example is most often some automated process and not a human.> +So, right now, the user has no way to specify that he wants to balloon down > +_G_ in such a way that he will get as much as possible free pages from pnode > +#2, rather than from pnode #0. He can ask _G_ to balloon down, but there is > +no guarantee on from what pnode the memory will be freed. > + > +The same applies to the ballooning up case, when the user, for some specific > +reasons, wants to be sure that it is the memory of some (other) specific pnode > +that will be used.I would like to see some real world examples of cases where this is sensible. In general, I''m not keen on adding ABIs or interfaces that don''t solve real world problems, particularly if they''re easy to misuse and end up with something that is very suboptimal.> +## NUMA-aware ballooning ## > + > +The new NUMA-aware ballooning logic works as follows. > + > +There is room, in libxl\_set\_memory\_target() for two more parameters, in > +addition to the new memory target:The Xenstore interface should be the primary interface being documented. The libxl interface is secondary and (probably) a consequence of the xenstore interface.> +* _pnid_ -- which is the pnode id of which node the user wants to try get some > + free memory on > +* _nodeexact_ -- which is a bool specifying whether or not, in case it is not > + possible to reach the new ballooning target only with memory from pnode > + _pnid_, the user is fine with using memory from other pnodes. > + If _nodeexact_ is true, it is possible that the new target is not reached; if > + it is false, the new target will (probably) be reached, but it is possible > + that some memory is freed on pnodes other than _pnid_. > + > +To let the ballooning driver know about these new parameters, a new xenstore > +key exists in ~/memory/target\_nid. So, for a proper NUMA aware ballooning > +operation to occur, the user should write the proper values in both the keys: > +~/memory/target\_nid and ~/memory/target.If we decide we do need such control, I think the xenstore interface should look more like: memory/target as before memory/target-by-nid/0 target for virtual node 0 ... memory/target-by-nid/N target for virtual node N I think this better reflects the goal which is an adjusted NUMA layout for the guest rather than the steps required to reach it (release P pages from node N). The balloon driver attempts to reach target, whist simultaneously trying to reach the individual node targets. It should prefer to balloon up/down on the node that is furthest from its node target. In cases where target and the sum of target-by-nid/N don''t agree (or are not present) the balloon driver should use target, and balloon up/down evenly across all NUMA nodes. Thew libxl interface does not necessarily have to match the xenstore interface if that''s the initial tools would prefer. Finally a style comment, please avoid the use of a single gender specific pronouns in documentation/comments (i.e., don''t always use he/his etc.). I prefer to use a singular "they" but you could consider "he or she" or using "he" for some examples and "she" in others. David
George Dunlap
2013-Aug-19 13:26 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On Mon, Aug 19, 2013 at 1:58 PM, David Vrabel <david.vrabel@citrix.com> wrote:> On 16/08/13 05:13, Yechen Li wrote: >> >> +### nodemask VNODE\_TO\_PNODE(int vnode) ### >> + >> +This service is provided by the hypervisor (and wired, if necessary, all the >> +way up to the proper toolstack layer or guest kernel), since it is only Xen >> +that knows both the virtual and the physical topologies. > > The physical NUMA topology must not be exposed to guests that have a > virtual NUMA topology -- only the toolstack and Xen should know the > mapping between the two. > > A guest cannot make sensible use of a machine topology as it may be > migrated to a host with a different topology. > >> +## Description of the problem ## >> + >> +Let us use an example. Let''s assume that guest _G_ has a virtual 2 vnodes, >> +and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2, >> +respectively. >> + >> +Now, the user wants to create a new guest, but the system is under high memory >> +pressure, so he decides to try ballooning _G_ down. He sees that pnode #2 has >> +the best chances to accommodate all the memory for the new guest, which would >> +be really good for performance, if only he can make space there. _G_ is the >> +only domain eating some memory from pnode, #2 but, as said above, not all of >> +its memory comes from there. > > It is not clear to me that this is the optimal decision. What > tools/information will be available that the user can use to make > sensible decisions here? e.g., is the current layout available to tools? > > Remember that the "user" in this example is most often some automated > process and not a human. > >> +So, right now, the user has no way to specify that he wants to balloon down >> +_G_ in such a way that he will get as much as possible free pages from pnode >> +#2, rather than from pnode #0. He can ask _G_ to balloon down, but there is >> +no guarantee on from what pnode the memory will be freed. >> + >> +The same applies to the ballooning up case, when the user, for some specific >> +reasons, wants to be sure that it is the memory of some (other) specific pnode >> +that will be used. > > I would like to see some real world examples of cases where this is > sensible. > > In general, I''m not keen on adding ABIs or interfaces that don''t solve > real world problems, particularly if they''re easy to misuse and end up > with something that is very suboptimal.I think at very least the guest needs to be able to say, "allocate me a page from vnode X", and have Xen translate that into pnode internally, so that ballooning down and back up again doesn''t destroy a guest''s NUMA memory affinity (e.g., the vnode->pnode memory mapping). [snip]> >> +* _pnid_ -- which is the pnode id of which node the user wants to try get some >> + free memory on >> +* _nodeexact_ -- which is a bool specifying whether or not, in case it is not >> + possible to reach the new ballooning target only with memory from pnode >> + _pnid_, the user is fine with using memory from other pnodes. >> + If _nodeexact_ is true, it is possible that the new target is not reached; if >> + it is false, the new target will (probably) be reached, but it is possible >> + that some memory is freed on pnodes other than _pnid_. >> + >> +To let the ballooning driver know about these new parameters, a new xenstore >> +key exists in ~/memory/target\_nid. So, for a proper NUMA aware ballooning >> +operation to occur, the user should write the proper values in both the keys: >> +~/memory/target\_nid and ~/memory/target. > > If we decide we do need such control, I think the xenstore interface > should look more like: > > memory/target > > as before > > memory/target-by-nid/0 > > target for virtual node 0 > > ... > > memory/target-by-nid/N > > target for virtual node N > > I think this better reflects the goal which is an adjusted NUMA layout > for the guest rather than the steps required to reach it (release P > pages from node N).This seems more sensible than a mask (as Jan suggested); but is it open to race conditions?> > The balloon driver attempts to reach target, whist simultaneously trying > to reach the individual node targets. It should prefer to balloon > up/down on the node that is furthest from its node target. > > In cases where target and the sum of target-by-nid/N don''t agree (or are > not present) the balloon driver should use target, and balloon up/down > evenly across all NUMA nodes. > > Thew libxl interface does not necessarily have to match the xenstore > interface if that''s the initial tools would prefer. > > Finally a style comment, please avoid the use of a single gender > specific pronouns in documentation/comments (i.e., don''t always use > he/his etc.). I prefer to use a singular "they" but you could consider > "he or she" or using "he" for some examples and "she" in others.Doing half and half seems a bit strange to me; if we''re trying for gender equity, I''d just go for "she" all the way. There are enough "he"s in the wider literature to more than balance it out for many years to come. :-) -George
Dario Faggioli
2013-Aug-20 14:05 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
[For this message and for all the othr I''m sending today, sorry for the webmail :-P] From: Jan Beulich [jbeulich@suse.com]>>>> Dario Faggioli <dario.faggioli@citrix.com> 08/17/13 1:31 AM >>> >>Suppose we have guest G with 2 v-nodes and with pages in v-node 0 (say, >>page 0,1,2..N-1) are backed by frames on p-node 2, while pages in v-node >>1 (say, N,N+1,N+2..2N-1) are backed by frames on p-node 4, and that is >>because, at creation time, either the user or the toolstack decided this >>was the way to go. >>So, if page 2 was ballooned down, when ballooning it up, we would like >>to retain the fact that it is backed by a frame in p-node 2, and we >>could ask Xen to try make that happen. On failure (e.g., no free frames >>on p-node 2), we could either fail or have Xen allocate the memory >>somewhere else, i.e., not on p-node 2 or p-node 4, and live with it >>(i.e., map G''s page 2 there), which I think is what you mean with <<node >>the domain so far was "knowing" of>>, isn''t it? > >Right. Or the guest could choose to create a new node on the fly. > >Jan >Are you talking of a guest creating a new virtual node, i.e., changing it''s own (virtual) NUMA topology on the fly? If yes, that could be an option too, I guess. It is not something we plan to support in the first implementation of virtual NUMA for Linux, but since we are talking about making this spec document more general, yes I think we should not rule-out such a possibility. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2013-Aug-20 14:18 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
>From: Jan Beulich [jbeulich@suse.com] > >Yes, supporting both modes is certainly desirable; I merely tried to >point out that the description went too far in the direction of advocating >only the enforce-a-node model (almost like this was the only sensible >one from the NUMA perspective). Penalizing another guest should >clearly not be a default action, but it may validly be a choice by the >administrator. > >Jan >Ok, cool to know, and the point you''re trying to make is a really valuable one, and I agree with it. :-) So, Yechen, in case you''re up for another version of this series, here''s what I would recommend you to think about: * take this design document and reorganize it a bit, in order to separate the generic description of the NUMA-aware ballooning concept, functioning, interface, etc., and the detail of the Linux implementation. I think those details could still live here (especially at this stage), but they should be in their own separate section(s), as an "example implementation" * Given the issues/doubts about the new interface, could we have a first version of the code _without_ the new xenstore key that just does something as follows: - the user asks to balloon up/down by N pages - if the guest is a virtual NUMA enabled guest with Y virtual nodes, the ballooning driver gives/takes to/from him N/Y pages per each node. Of course, when looking/implementing the latter point, do not throw this code here away (I''m talking about the version you submitted, with the new xenstore key)... Or at least, I think it is a valuable addition, so if I were you, I wouldn''t throw it away. I think it would be entirely possible to reach consensus on the "evenly spreading without new xenstore key" version, as a first step and upstream it. Afterwards, we will enhance the interface (if we decide we like it) with the per-node targets. What do you think? I''m of course up for helping you with that, bot not before a couple of weeks. :-P Thanks and Regard, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
David Vrabel
2013-Aug-20 14:20 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On 19/08/13 14:26, George Dunlap wrote:> On Mon, Aug 19, 2013 at 1:58 PM, David Vrabel <david.vrabel@citrix.com> wrote: >> If we decide we do need such control, I think the xenstore interface >> should look more like: >> >> memory/target >> >> as before >> >> memory/target-by-nid/0 >> >> target for virtual node 0 >> >> ... >> >> memory/target-by-nid/N >> >> target for virtual node N >> >> I think this better reflects the goal which is an adjusted NUMA layout >> for the guest rather than the steps required to reach it (release P >> pages from node N). > > This seems more sensible than a mask (as Jan suggested); but is it > open to race conditions?xenstore transactions could be used to make the reads/writes of the set of values atomic. David
Jan Beulich
2013-Aug-20 14:24 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
>>> On 20.08.13 at 16:05, Dario Faggioli <dario.faggioli@citrix.com> wrote: > Are you talking of a guest creating a new virtual node, i.e., changing it''s > own (virtual) NUMA topology on the fly? If yes, that could be an option > too, I guess.Yes, albeit not necessarily in a way visible to the hypervisor (i.e. the guest may do this just for its own purposes). Jan
Dario Faggioli
2013-Aug-20 14:31 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
________________________________________>From: dunlapg@gmail.com [dunlapg@gmail.com] on behalf of George Dunlap [George.Dunlap@eu.citrix.com] >On Fri, Aug 16, 2013 at 10:09 AM, Jan Beulich <JBeulich@suse.com> wrote: >>>>> On 16.08.13 at 06:13, Yechen Li <lccycc123@gmail.com> wrote: >>> +So, in NUMA aware ballooning, ballooning down and up works as follows: >>> + >>> +* target < current usage -- first of all, the ballooning driver uses the >>> + PNODE\_TO\_VNODE() service (provided by the virtual topology implementation, >>> + as explained above) to translate _pnid_ (that it reads from xenstore) to >>> + the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which will >> >> This looks conceptually wrong: The balloon driver should have no >> need to know about pNID-s; it should be the tool stack doing the >> translation prior to writing the xenstore node. > >I agree with this >Well, if we''re talking about the principle of host and guest (real and virtual) NUMA topology, I not only agree, I am one of its strongest advocates (as George and Elena can testify! :-P) What we''re talking here is some sort of translation service to call every time some kind of hints about the relationship between virtual and real is necessary, to perform a specific operation better. IOW, the guest would not be allowed to store the result of this call and use it in the future (and expect it to be accurate). Perhaps that was not stated clearly enough in the description. I''d be very happy to get rid of this too, but George himself points out here below what this (or something really similar to this) is necessary.>-- I would like to point out that to make this work >for ballooning *up*, however, there will need to be a way for the >guest to specify, "please allocate from vnode X", and have Xen >translate the vnode into the appropriate pnode(s). >Exactly. And since we already have all that is needed in place to tell Xen something like "allocate from this p-node", Yechen went for this two step approach: 1) translate, 2) ask Xen to allocate. This did not require any change at the hypervisor level, so we thought it could be nice. :-) Of course, if it''s considered better to modify Xen to perform the whole operation, fine, this is for sure something that can be done, an that does not have much effect of the rest of the design and of the implementation, right Yechen? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ---------------------------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
Dario Faggioli
2013-Aug-20 14:55 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
From: David Vrabel>On 16/08/13 05:13, Yechen Li wrote: >> >> +### nodemask VNODE\_TO\_PNODE(int vnode) ### >> + >> +This service is provided by the hypervisor (and wired, if necessary, all the >> +way up to the proper toolstack layer or guest kernel), since it is only Xen >> +that knows both the virtual and the physical topologies. > >The physical NUMA topology must not be exposed to guests that have a >virtual NUMA topology -- only the toolstack and Xen should know the >mapping between the two. > >A guest cannot make sensible use of a machine topology as it may be >migrated to a host with a different topology. >See the other e-mail (me replying to George about something like that being necessary for ballooning up, although, yes, it probably can happen all in Xen, if that''s considered better).>> +## Description of the problem ## >> + >> +Let us use an example. Let''s assume that guest _G_ has a virtual 2 vnodes, >> +and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2, >> +respectively. >> + >> +Now, the user wants to create a new guest, but the system is under high memory >> +pressure, so he decides to try ballooning _G_ down. He sees that pnode #2 has >> +the best chances to accommodate all the memory for the new guest, which would >> +be really good for performance, if only he can make space there. _G_ is the >> +only domain eating some memory from pnode, #2 but, as said above, not all of >> +its memory comes from there. > >It is not clear to me that this is the optimal decision. What >tools/information will be available that the user can use to make >sensible decisions here? e.g., is the current layout available to tools? >Well, the whole "free page from pnode #2" is more a tool than a decision. It''s a tool that will become available for better enact decisions made at some upper level (i.e., admin, or toolstack). The current layout of how much memory is occupied on what node by each guset is definitely something we should have in place (even independently from this feature/series, I think). It''s already available via a Xen debug key, so it''s just a matter of wiring it up (I think). I''ll give it a try as soon as I''ll be back to work.>Remember that the "user" in this example is most often some automated >process and not a human. >Exactly. :-D>I would like to see some real world examples of cases where this is >sensible. > >In general, I''m not keen on adding ABIs or interfaces that don''t solve >real world problems, particularly if they''re easy to misuse and end up >with something that is very suboptimal. >I see what you mean, and certainly I don''t disagree. It''s a bit of a chicken-&-egg, since I can''t find real examples of something that does not exist, but yes, I think we can investigate a bit more whether or not something like this would be useful. The reason I think it is is that we have an automatic initial placement algorithm for VM that tries to find the smallest set of nodes to place a VM on, every time we create one, and I think it would be nice to give the admin (or some advanced toolstack) all the tools to maximize the probabilities of such algorithm finding a suitable and nice for performance solution... Right now the only one of this tool is "kill or migrate some VM somewhere else", which is not that much... :-P>If we decide we do need such control, I think the xenstore interface >should look more like: > >memory/target > > as before > >memory/target-by-nid/0 > > target for virtual node 0 > >... > >memory/target-by-nid/N > > target for virtual node N > >I think this better reflects the goal which is an adjusted NUMA layout >for the guest rather than the steps required to reach it (release P >pages from node N). >Oh, cool, I really like this. Yechen, what do you think?>The balloon driver attempts to reach target, whist simultaneously trying >to reach the individual node targets. It should prefer to balloon >up/down on the node that is furthest from its node target. >And this is an interesting idea too.>In cases where target and the sum of target-by-nid/N don''t agree (or are >not present) the balloon driver should use target, and balloon up/down >evenly across all NUMA nodes. >And that would be fine too... As I said in another e-mail, what I propose to Yechen is to start dealing with this latter case, i.e., get rid of the new controls (or pretend they''re not there) and implement the evenly distribution of ballooned pages across virtual NUMA nodes. After that, we will move to a more advanced interface, if we''ll deem it worthwhile.>Finally a style comment, please avoid the use of a single gender >specific pronouns in documentation/comments (i.e., don''t always use >he/his etc.). I prefer to use a singular "they" but you could consider >"he or she" or using "he" for some examples and "she" in others. >Good point. Personally, I think I prefer the "they" form, but I''m fine with both. Thanks and Regards, Dario
Li Yechen
2013-Aug-20 15:15 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
Hi David, On Mon, Aug 19, 2013 at 8:58 PM, David Vrabel <david.vrabel@citrix.com>wrote:> The physical NUMA topology must not be exposed to guests that have a > virtual NUMA topology -- only the toolstack and Xen should know the > mapping between the two. > > A guest cannot make sensible use of a machine topology as it may be > migrated to a host with a different topology. > > Most of you have the same option that the interface should be in Xen, notin Guest balloon. I''m agree with it. In next version I''ll think of how to implement this interface between xen and balloon.> It is not clear to me that this is the optimal decision. What > tools/information will be available that the user can use to make > sensible decisions here? e.g., is the current layout available to tools? > > Remember that the "user" in this example is most often some automated > process and not a human. > > We''d like to have a libxl tool for user, or automated process to changethe node affinity of a Guest. The decision is made by who call this libxl tool.> > +So, right now, the user has no way to specify that he wants to balloon > down > > +_G_ in such a way that he will get as much as possible free pages from > pnode > > +#2, rather than from pnode #0. He can ask _G_ to balloon down, but > there is > > +no guarantee on from what pnode the memory will be freed. > > + > > +The same applies to the ballooning up case, when the user, for some > specific > > +reasons, wants to be sure that it is the memory of some (other) > specific pnode > > +that will be used. > > I would like to see some real world examples of cases where this is > sensible. > > In general, I''m not keen on adding ABIs or interfaces that don''t solve > real world problems, particularly if they''re easy to misuse and end up > with something that is very suboptimal. >Dario, could the test examples that you sent to me several month be represented as a real-word example? The example shows that, after several guests create and shut-up, the node affinity is a mess> > > The Xenstore interface should be the primary interface being documented. > The libxl interface is secondary and (probably) a consequence of the > xenstore interface. > > > +* _pnid_ -- which is the pnode id of which node the user wants to try > get some > > + free memory on > > +* _nodeexact_ -- which is a bool specifying whether or not, in case it > is not > > + possible to reach the new ballooning target only with memory from > pnode > > + _pnid_, the user is fine with using memory from other pnodes. > > + If _nodeexact_ is true, it is possible that the new target is not > reached; if > > + it is false, the new target will (probably) be reached, but it is > possible > > + that some memory is freed on pnodes other than _pnid_. > > + > > +To let the ballooning driver know about these new parameters, a new > xenstore > > +key exists in ~/memory/target\_nid. So, for a proper NUMA aware > ballooning > > +operation to occur, the user should write the proper values in both the > keys: > > +~/memory/target\_nid and ~/memory/target. > > If we decide we do need such control, I think the xenstore interface > should look more like: > > memory/target > > as before > > memory/target-by-nid/0 > > target for virtual node 0 > > ... > > memory/target-by-nid/N > > target for virtual node N > > I think this better reflects the goal which is an adjusted NUMA layout > for the guest rather than the steps required to reach it (release P > pages from node N). > > The balloon driver attempts to reach target, whist simultaneously trying > to reach the individual node targets. It should prefer to balloon > up/down on the node that is furthest from its node target. > > In cases where target and the sum of target-by-nid/N don''t agree (or are > not present) the balloon driver should use target, and balloon up/down > evenly across all NUMA nodes. > > Thew libxl interface does not necessarily have to match the xenstore > interface if that''s the initial tools would prefer. > > Finally a style comment, please avoid the use of a single gender > specific pronouns in documentation/comments (i.e., don''t always use > he/his etc.). I prefer to use a singular "they" but you could consider > "he or she" or using "he" for some examples and "she" in others. > > David >Oh, I think this is a better interface! I''d very appreciate this than what I have now. However, the code I show here _does_not_really_work_. It just pass some small tests. And I''m afraid that there may be some bugs in my code. Could I set up David''s interface as a secondary goal, waiting until this code is full tested? I''m really not that confident :) -- Yechen Li Team of System Virtualization and Cloud Computing School of Electronic Engineering and Computer Science, Peking University, China Nothing is impossible because impossible itself says: " I''m possible " lccycc From PKU _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Konrad Rzeszutek Wilk
2013-Aug-23 20:53 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On Mon, Aug 19, 2013 at 01:58:51PM +0100, David Vrabel wrote:> On 16/08/13 05:13, Yechen Li wrote: > > > > +### nodemask VNODE\_TO\_PNODE(int vnode) ### > > + > > +This service is provided by the hypervisor (and wired, if necessary, all the > > +way up to the proper toolstack layer or guest kernel), since it is only Xen > > +that knows both the virtual and the physical topologies. > > The physical NUMA topology must not be exposed to guests that have a > virtual NUMA topology -- only the toolstack and Xen should know the > mapping between the two.I think exposing any NUMA topology to guest - irregardless whether it is based on real NUMA or not, is OK - and actually a pretty neat thing. Meaning you could tell a PV guest that it is running on a 16 socket NUMA box while in reality it is running on a single socket box. Or vice-versa. It can serve as a way to increase performance (or decrease) - and also do resource capping (This PV guest will only get 1G of real fast memory and then 7GB of slow memory) and let the OS handle the details of it (which it does nowadays). The mapping thought - of which PV pages should belong to which fake PV NUMA node - and how they bind to the real NUMA topology - that part I am not sure how to solve. More on this later.> > A guest cannot make sensible use of a machine topology as it may be > migrated to a host with a different topology.Correct. And that is OK - it just means that the performance can suck horribly while it is there. Or the guest can be migrated to even a better NUMA machine where it will perform even better. That is nothing new and this is no different if you had PV NUMA or not in a guest.> > > +## Description of the problem ##I think you have to backup with the problem description. That is you need to think of: - How a PV guest will allocate pages at bootup based on this - How it will balloon up/down within those "buckets". If you are using the guests NUMA hints it usually is in the form of ''allocate pages on this node'' and the node information is of type ''pfn X to pfn Y are on this NUMA''. That does not work very well with ballooning as it can be scattered across various nodes. But that is mostly b/c the balloon driver is not even trying to use NUMA APIs. It could use it and then it would do the best it can and perhaps balloon round-robin across the NUMA pools. Or perhaps a better option would be to use the hotplug memory mechanism (which is implemented in the balloon driver) and do large swaths of memory. But more problematic is the migration. If you migrate a guest to node that has different NUMA topologies what you really really want is: - unplug all of the memory in the guest - replug the memory with the new NUMA topology Obviously this means you need some dynamic NUMA system - and I don''t know of such. The unplug/plug can be done via the balloon driver and or hotplug memory system. But then - the boundaries of the NUMA pools is set a bootup time. And you would want to change them. Is SRAT/SLIT dynamic? Could it change during runtime? Then there is the concept of AutoNUMA were you would migrate pages from one node to another. With a PV guest that would imply that the hypervisor would poke the guest and say: "ok, time alter your P2M table". Which I guess right now is done best via the balloon driver - so what you really want is a callback to tell the balloon driver: Hey, balloon down and up this PFN block with on NUMA node X. Perhaps what could be done is to setup in the cluster of hosts the worst case NUMA topology and force it on all the guests. Then when migrating the "pools" can be filled/unfilled depending on which host the guest is - and whether it can fill up the NUMA pools properly. For example it migrates from a 1 node box to a 16 node box and all the memory is remote. It will empty out the PV NUMA box of the "closest" memory to zero and fill up the PV NUMA pool of the "farthest" with all memory to balance it out and have some real sense of the PV to machine host memory.
Dario Faggioli
2013-Aug-25 21:18 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On ven, 2013-08-23 at 13:53 -0700, Konrad Rzeszutek Wilk wrote:> On Mon, Aug 19, 2013 at 01:58:51PM +0100, David Vrabel wrote: > > On 16/08/13 05:13, Yechen Li wrote: > > > > > > +### nodemask VNODE\_TO\_PNODE(int vnode) ### > > > + > > > +This service is provided by the hypervisor (and wired, if necessary, all the > > > +way up to the proper toolstack layer or guest kernel), since it is only Xen > > > +that knows both the virtual and the physical topologies. > > > > The physical NUMA topology must not be exposed to guests that have a > > virtual NUMA topology -- only the toolstack and Xen should know the > > mapping between the two. > > I think exposing any NUMA topology to guest - irregardless whether it is based > on real NUMA or not, is OK - and actually a pretty neat thing. >Yes, that is exactly how Elena, which is doing such work for PV guests, is doing it.> Meaning you could tell a PV guest that it is running on a 16 socket NUMA > box while in reality it is running on a single socket box. Or vice-versa. > It can serve as a way to increase performance (or decrease) - and also > do resource capping (This PV guest will only get 1G of real fast > memory and then 7GB of slow memory) and let the OS handle the details > of it (which it does nowadays). >Yes, exactly... Again. :-)> The mapping thought - of which PV pages should belong to which fake > PV NUMA node - and how they bind to the real NUMA topology - that part > I am not sure how to solve. More on this later. >That is fine too. Again, Elena is working on both how to build up a virtual topology and how to somehow map it to the real topology, for the sake of performance. However, this series is about NUMA-aware ballooning, which is something that makes sense _ONLY_ after we''ll have all that virtual NUMA thing in place. That being said, I told Yechen that submitting what he already had as an RFC could have been helpful anyway, i.e., he could get some comments on the design, the approach, the interface, etc., which is actually what has happened. :-) He should be more clear about the fact that some preliminary work was missing, during the first submission. During the second submission, I tried to help him make that more clear... If it still did not work, and generated confusion instead, I am sorry about that. About the technical part of this comment (guest knowledge about the real NUMA topology), as I said already, I''m fine with letting the guest completely in the dark, if it''s fine to provide a suitable interface between Xen and the guest that will allow ballooning up to work (as George pointed out in his e-mails).> > > > > +## Description of the problem ## > > > I think you have to backup with the problem description. That is you > need to think of: > - How a PV guest will allocate pages at bootup based on this >That''s not this series'' job...> - How it will balloon up/down within those "buckets". >That, I''m not sure I got (more below)...> If you are using the guests NUMA hints it usually is in the form of > ''allocate pages on this node'' and the node information is of type > ''pfn X to pfn Y are on this NUMA''. That does not work very well with > ballooning as it can be scattered across various nodes. But that > is mostly b/c the balloon driver is not even trying to use NUMA > APIs. >Indeed. The whole point is this. "If it has been somehow established, at boot time, that pfn X is from virtual NUMA node 2, and that all the pfn-s from virtual node 2 are allocated --on the host-- on hardware NUMA node 0, let''s, when ballooning pfn X down and then ballooning it back up, make sure that: 1) in the guest it still belongs to virtual node 2, and 2) on the host is still backed by a page on hardware node 0" Does that make sense?> It could use it and then it would do the best it can and > perhaps balloon round-robin across the NUMA pools. >Exactly, that is what David suggested and what I also think it would be a nice first step (without any need of adding xenstore keys).> Or > perhaps a better option would be to use the hotplug memory mechanism > (which is implemented in the balloon driver) and do large swaths of > memory. >Mmm... I think it should all be possible without bothering with memory hotplug, but I may be wrong (I don''t really know much about memory hotplug).> But more problematic is the migration. If you migrate a guest > to node that has different NUMA topologies what you really really > want is: > - unplug all of the memory in the guest > - replug the memory with the new NUMA topology > > Obviously this means you need some dynamic NUMA system - and I don''t > know of such. >We don''t plan to support dynamically varying virtual NUMA topologies in the short term future. :-)> The unplug/plug can be done via the balloon driver > and or hotplug memory system. But then - the boundaries of the NUMA > pools is set a bootup time. And you would want to change them. > Is SRAT/SLIT dynamic? Could it change during runtime? >I don''t know if the real hw tables can actually change, but again, support for varying the virtual topology is not a priority right now.> Then there is the concept of AutoNUMA were you would migrate > pages from one node to another. With a PV guest that would imply > that the hypervisor would poke the guest and say: "ok, time > alter your P2M table". >Yes, and that''s what I am working on for a while. It''s particularly tricky for a PV guest and, although very similar in principle, it''s going to be different than AutoNUMA in Linux, since for us, a migration is way more expensive than for them.> Which I guess right now is done best > via the balloon driver - so what you really want is a callback > to tell the balloon driver: Hey, balloon down and up this > PFN block with on NUMA node X. >I''m currently doing it via some sort of "lightweight suspend/resume cycle". I like the idea of trying to exploit the ballooning driver for that, but that will probably happen in a subsequent step (I want it working that way, before starting to think on how to improve it! :-P). Anyway, the above just to say that this is also not this series'' job, and although there surely are contact point, I think things can be considered (and hence worked on/implemented) pretty independently.> Perhaps what could be done is to setup in the cluster of hosts > the worst case NUMA topology and force it on all the guests. > Then when migrating the "pools" can be filled/unfilled > depending on which host the guest is - and whether it can > fill up the NUMA pools properly. For example it migrates > from a 1 node box to a 16 node box and all the memory > is remote. It will empty out the PV NUMA box of the "closest" > memory to zero and fill up the PV NUMA pool of the "farthest" > with all memory to balance it out and have some real > sense of the PV to machine host memory. >That is also nice... Perhaps this is meat for some sort of high (for sure higher than xl/libxl) level management/orchestration layer, isn''t it? Anyway... I hope I helped clarifying things a bit. Thanks for having a look and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dario Faggioli
2013-Aug-25 21:24 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
On mar, 2013-08-20 at 23:15 +0800, Li Yechen wrote:> Hi David, > On Mon, Aug 19, 2013 at 8:58 PM, David Vrabel > <david.vrabel@citrix.com> wrote: > The physical NUMA topology must not be exposed to guests that > have a > > Most of you have the same option that the interface should be in Xen, > not in Guest > balloon. I''m agree with it. In next version I''ll think of how to > implement this interface > between xen and balloon. >Perfect. That is not that much different from what you already have, the only bit that will need some rework is the ballooning up path (see George''s e-mails).> In general, I''m not keen on adding ABIs or interfaces that > don''t solve > real world problems, particularly if they''re easy to misuse > and end up > with something that is very suboptimal. > Dario, could the test examples that you sent to me several month be > represented > as a real-word example? > The example shows that, after several guests create and shut-up, the > node > affinity is a mess >They were not real-world example. As I said before, that is a chicken-&-egg problem: there are not real world examples until we implement the feature! :-P What I think you''re talking about is an old (2010 ?) Xen Summit presentation from someone working on the same problem before, but then not finishing it. I don''t have the link handy right now... I''ll see if I can find it and post it here.> Oh, I think this is a better interface! > I''d very appreciate this than what I have now. However, the code I > show here > _does_not_really_work_. It just pass some small tests. And I''m afraid > that > there may be some bugs in my code. > Could I set up David''s interface as a secondary goal, waiting until > this code > is full tested? I''m really not that confident :) >EhEh... All code has bugs. :-) As I said, I also like this interface more. However, what I think you should concentrate on (apart from, of course, debugging) is producing a version of the series which does not use any new xenstore keys/interface at all, and just balloons up and down taking pages evenly from all the guest''s virtual NUMA nodes. After that, we can come back to implement a more fine grained control, probably via this interface David is proposing here. What do you think? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://about.me/dario.faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Li Yechen
2013-Sep-26 14:15 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
Hi all, Sorry for my being absent for a month.... And congratulate to Elena for your patches! I''ll read them if I''m free~~ In conclusion, there are two things for the NUMA support bubble: First,>If we decide we do need such control, I think the xenstore interface >should look more like: > >memory/target > > as before > >memory/target-by-nid/0 > > target for virtual node 0 > >... > >memory/target-by-nid/N > > target for virtual node N > >I think this better reflects the goal which is an adjusted NUMA layout >for the guest rather than the steps required to reach it (release P >pages from node N).Yes I think this is a very good idea. However, if target is conflict with the sum of target-by-nid/xxx, bubble may be confused.. My idea is something as: 1 User can know the target tree, for example: memory/target memory/target-by-nid/0 memory/target-by-nid/1 memory/target-by-nid/2 2 User can use xl tool to set one of them to an specified value. For example: user set memory/target-by-nid/0 from 100M to 200M then that means: increase both memory/target and memory/target-by-nid/0 by 100M Another example: user set memory/target from 800M to 900M then that means: increase memory/target by 100M, but balloon could make decision by its own. Does that make sencse? 3 In domU, balloon receive that which directory is changed. then it balloon in/out pages from the node(s). Second:>I think exposing any NUMA topology to guest - irregardless whether it isbased> on real NUMA or not, is OK - and actually a pretty neat thing. > >Meaning you could tell a PV guest that it is running on a 16 socket NUMA >box while in reality it is running on a single socket box. Or vice-versa. >It can serve as a way to increase performance (or decrease) - and also >do resource capping (This PV guest will only get 1G of real fast >memory and then 7GB of slow memory) and let the OS handle the details > of it (which it does nowadays). > > The mapping thought - of which PV pages should belong to which fake > PV NUMA node - and how they bind to the real NUMA topology - that part > I am not sure how to solve. More on this later.Here is the question: should the interface: machine_node_id_to_virtual_node_id (and also the reverse) be inplememted inside kernel, or in Xen as a hypercall? Elena, I haven''t have time to look at your great patches so that I have no idea whether you had implememted it or no.... If you had, I''d say sorry that we still need to talk about it : - ) I think to implement them as a hypercall in xen is very nice, since domU shouldn''t know hypervisior''s NUMA architecture.... But that also means: I have to change three hypercalls about memory operation..... On the other hand, implement them in Kernel is pretty neat, but it goes against the rule of isolation.... Anyway I prefer the previous one. Could you guys give me some ideas? Thank you again for your suggestions on this topic : ) -- Yechen Li Team of System Virtualization and Cloud Computing School of Electronic Engineering and Computer Science, Peking University, China Nothing is impossible because impossible itself says: " I''m possible " lccycc From PKU _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Li Yechen
2013-Sep-26 14:15 UTC
Re: [RFC v2][PATCH 1/3] docs: design and intended usage for NUMA-aware ballooning
Oh sorry, the "bubble" should be "balloon" On Thu, Sep 26, 2013 at 10:15 PM, Li Yechen <lccycc123@gmail.com> wrote:> Hi all, > Sorry for my being absent for a month.... And congratulate to Elena for > your patches! I''ll read them if I''m free~~ > > In conclusion, there are two things for the NUMA support bubble: > > First, > >If we decide we do need such control, I think the xenstore interface > >should look more like: > > > >memory/target > > > > as before > > > >memory/target-by-nid/0 > > > > target for virtual node 0 > > > >... > > > >memory/target-by-nid/N > > > > target for virtual node N > > > >I think this better reflects the goal which is an adjusted NUMA layout > >for the guest rather than the steps required to reach it (release P > >pages from node N). > > Yes I think this is a very good idea. However, if target is conflict with > the sum of target-by-nid/xxx, bubble may be confused.. > My idea is something as: > 1 User can know the target tree, for example: > memory/target > memory/target-by-nid/0 > memory/target-by-nid/1 > memory/target-by-nid/2 > 2 User can use xl tool to set one of them to an specified value. > For example: user set memory/target-by-nid/0 from 100M to 200M > then that means: increase both memory/target and memory/target-by-nid/0 > by 100M > Another example: user set memory/target from 800M to 900M > then that means: increase memory/target by 100M, but balloon could make > decision by its own. > > Does that make sencse? > > 3 In domU, balloon receive that which directory is changed. then it > balloon in/out pages from the node(s). > > > > Second: > > >I think exposing any NUMA topology to guest - irregardless whether it is > based > > on real NUMA or not, is OK - and actually a pretty neat thing. > > > >Meaning you could tell a PV guest that it is running on a 16 socket NUMA > >box while in reality it is running on a single socket box. Or vice-versa. > >It can serve as a way to increase performance (or decrease) - and also > >do resource capping (This PV guest will only get 1G of real fast > >memory and then 7GB of slow memory) and let the OS handle the details > > of it (which it does nowadays). > > > > The mapping thought - of which PV pages should belong to which fake > > PV NUMA node - and how they bind to the real NUMA topology - that part > > I am not sure how to solve. More on this later. > > Here is the question: should the interface: > machine_node_id_to_virtual_node_id (and also the reverse) be inplememted > inside kernel, or in Xen as a hypercall? > Elena, I haven''t have time to look at your great patches so that I have > no idea whether you had implememted it or no.... > If you had, I''d say sorry that we still need to talk about it : - ) > I think to implement them as a hypercall in xen is very nice, since > domU shouldn''t know hypervisior''s NUMA architecture.... > But that also means: I have to change three hypercalls about > memory operation..... > > On the other hand, implement them in Kernel is pretty neat, but it goes > against the rule of isolation.... > > Anyway I prefer the previous one. Could you guys give me some ideas? > > Thank you again for your suggestions on this topic : ) > > -- > > Yechen Li > > Team of System Virtualization and Cloud Computing > School of Electronic Engineering and Computer Science, > Peking University, China > > Nothing is impossible because impossible itself says: " I''m possible " > lccycc From PKU > >-- Yechen Li Team of System Virtualization and Cloud Computing School of Electronic Engineering and Computer Science, Peking University, China Nothing is impossible because impossible itself says: " I''m possible " lccycc From PKU _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel