Dan Magenheimer
2012-Sep-26 21:17 UTC
domain creation vs querying free memory (xend and xl)
I was asked a question that seems like it should be obvious but it doesn''t seem to be, at least in xm-land. I''ll look into it further, as well as for xl, but I thought I''d ask first to see if there is a known answer or if this is a known problem: Suppose that xm/xl create is issued on a large-memory domain (PV or HVM or, future, PVH). It takes awhile for this domain to launch and during at least part of this time, the toolset hasn''t yet requested all of the required memory from the hypervisor to complete the launch of the domain... or perhaps the toolset has, but the hypervisor is slow about calling the long sequence of page allocations (e.g. maybe because it is zeroing each page?). Then it is desired to launch a second large-memory domain. The tools can query Xen to see if there is sufficient RAM and there is, because the first launch has not yet allocated all the RAM assigned to it. But the second domain launch fails, possibly after several minutes because, actually, there isn''t enough physical RAM for both. Does this make sense? Should the tools "reserve" maxmem as a "transaction" and/or ensure that "xm/xl free" calls account for the entire requested amount of RAM? Or maybe xl _does_ work this way? Thanks for any comments or discussion! Dan
Konrad Rzeszutek Wilk
2012-Sep-27 11:26 UTC
Re: domain creation vs querying free memory (xend and xl)
On Wed, Sep 26, 2012 at 02:17:06PM -0700, Dan Magenheimer wrote:> I was asked a question that seems like it should be obvious > but it doesn''t seem to be, at least in xm-land. I''ll look > into it further, as well as for xl, but I thought I''d ask > first to see if there is a known answer or if this is a known > problem: > > Suppose that xm/xl create is issued on a large-memory > domain (PV or HVM or, future, PVH). It takes awhile > for this domain to launch and during at least part of this > time, the toolset hasn''t yet requested all of the > required memory from the hypervisor to complete the > launch of the domain... or perhaps the toolset has, > but the hypervisor is slow about calling the long sequence > of page allocations (e.g. maybe because it is zeroing > each page?). > > Then it is desired to launch a second large-memory domain. > The tools can query Xen to see if there is sufficient RAM > and there is, because the first launch has not yet > allocated all the RAM assigned to it. > > But the second domain launch fails, possibly after > several minutes because, actually, there isn''t enough > physical RAM for both. > > Does this make sense? Should the tools "reserve" > maxmem as a "transaction" and/or ensure that "xm/xl > free" calls account for the entire requested amount > of RAM? Or maybe xl _does_ work this way?So say "freeze" the amount of free memory. Lets CC the XCP folks> > Thanks for any comments or discussion! > > Dan
George Shuklin
2012-Sep-27 15:24 UTC
Re: domain creation vs querying free memory (xend and xl)
not sure about xl/xm, but xapi performs one start at time, so there is no race between domains for memory or other resources. 27.09.2012 01:17, Dan Magenheimer пишет:> I was asked a question that seems like it should be obvious > but it doesn't seem to be, at least in xm-land. I'll look > into it further, as well as for xl, but I thought I'd ask > first to see if there is a known answer or if this is a known > problem: > > Suppose that xm/xl create is issued on a large-memory > domain (PV or HVM or, future, PVH). It takes awhile > for this domain to launch and during at least part of this > time, the toolset hasn't yet requested all of the > required memory from the hypervisor to complete the > launch of the domain... or perhaps the toolset has, > but the hypervisor is slow about calling the long sequence > of page allocations (e.g. maybe because it is zeroing > each page?). > > Then it is desired to launch a second large-memory domain. > The tools can query Xen to see if there is sufficient RAM > and there is, because the first launch has not yet > allocated all the RAM assigned to it. > > But the second domain launch fails, possibly after > several minutes because, actually, there isn't enough > physical RAM for both. > > Does this make sense? Should the tools "reserve" > maxmem as a "transaction" and/or ensure that "xm/xl > free" calls account for the entire requested amount > of RAM? Or maybe xl _does_ work this way? > > Thanks for any comments or discussion! > > Dan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dan Magenheimer
2012-Sep-27 15:32 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Konrad Rzeszutek Wilk > Subject: Re: domain creation vs querying free memory (xend and xl) > > On Wed, Sep 26, 2012 at 02:17:06PM -0700, Dan Magenheimer wrote: > > I was asked a question that seems like it should be obvious > > but it doesn''t seem to be, at least in xm-land. I''ll look > > into it further, as well as for xl, but I thought I''d ask > > first to see if there is a known answer or if this is a known > > problem: > > > > Suppose that xm/xl create is issued on a large-memory > > domain (PV or HVM or, future, PVH). It takes awhile > > for this domain to launch and during at least part of this > > time, the toolset hasn''t yet requested all of the > > required memory from the hypervisor to complete the > > launch of the domain... or perhaps the toolset has, > > but the hypervisor is slow about calling the long sequence > > of page allocations (e.g. maybe because it is zeroing > > each page?). > > > > Then it is desired to launch a second large-memory domain. > > The tools can query Xen to see if there is sufficient RAM > > and there is, because the first launch has not yet > > allocated all the RAM assigned to it. > > > > But the second domain launch fails, possibly after > > several minutes because, actually, there isn''t enough > > physical RAM for both. > > > > Does this make sense? Should the tools "reserve" > > maxmem as a "transaction" and/or ensure that "xm/xl > > free" calls account for the entire requested amount > > of RAM? Or maybe xl _does_ work this way? > > So say "freeze" the amount of free memory. Lets CC the XCP folksHmmm... the problem is the opposite (I think, since I don''t have hardware at hand to reproduce it). Assume a machine has 2TB of physical RAM and a "xm create" is started to launch a 1TB guest called "X". While X is being launched, another thread watches "xm free" and sees that it slowly goes down from 1.995TB. That thread does not know what the eventual "floor" will be. Now a third thread does a "xm create" to launch a second 1TB guest "Y". The "xm create" asks the hypervisor and sees, yep, there is, at this moment, 1.376TB of free memory so it commences launching the guest. Because the hypervisor and dom0 consume some RAM, both of these "xm create" will eventually fail, possibly after several minutes. Seems like a "xm unreserved" is needed, similar to "xm free" but takes into account the tools'' knowledge of what RAM is in the process of being reserved for launching domains, not just the allocation requests the hypervisor has already processed.
Dario Faggioli
2012-Sep-28 16:08 UTC
Re: domain creation vs querying free memory (xend and xl)
On Thu, 2012-09-27 at 19:24 +0400, George Shuklin wrote:> not sure about xl/xm, but xapi performs one start at time, so there is > no race between domains for memory or other resources. >IIRC, xl has a vary coarse grain locking mechanism in place for domain creation too. As a result of that, you shouldn''t be able to create two domains at the same time, which should be enough for preventing the situation described in the original e-mail for occurring. Looking at acquire_lock() and release_lock() (and at where they are called) in xl code should clarify whether or not that is enough to actually avoid the race (which I think it is, but I might be wrong :-D). That being said, there still is the room for races, although not wrt domain creation, as, for instance, there isn''t any synchronization between creation and ballooning, which both manipulate memory. So, maybe thinking about some kind of reservation-based/transactional mechanism at some level might make actual sense. Unfortunately, I''ve no idea about how xm works in that respect. Hope this at least help clarifying the situation a bit. :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Ian Jackson
2012-Sep-28 17:12 UTC
Re: domain creation vs querying free memory (xend and xl)
Dan Magenheimer writes ("[Xen-devel] domain creation vs querying free memory (xend and xl)"):> But the second domain launch fails, possibly after > several minutes because, actually, there isn''t enough > physical RAM for both.This is a real problem. The solution is not easy, and may not make it for 4.3. It would involve a rework of the memory handling code in libxl. Ian.
Dan Magenheimer
2012-Oct-01 20:03 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > Sent: Friday, September 28, 2012 11:12 AM > To: Dan Magenheimer > Cc: xen-devel@lists.xen.org; Kurt Hackel; Konrad Wilk > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > Dan Magenheimer writes ("[Xen-devel] domain creation vs querying free memory (xend and xl)"): > > But the second domain launch fails, possibly after > > several minutes because, actually, there isn''t enough > > physical RAM for both. > > This is a real problem. The solution is not easy, and may not make it > for 4.3. It would involve a rework of the memory handling code in > libxl.[broadening cc to "Xen memory technology people", please forward/add if I missed someone] Hi Ian -- If you can estimate the difficulty, it would appear you have a specific libxl design in mind? Maybe it would be useful to brainstorm a bit to see if there might be a simpler/different solution? Bearing in mind that I know almost nothing about xl or the tools layer, and that, as a result, I tend to look for hypervisor solutions, I''m thinking it''s not possible to solve this without direct participation of the hypervisor anyway, at least while ensuring the solution will successfully work with any memory technology that involves ballooning with the possibility of overcommit (i.e. tmem, page sharing and host-swapping, manual ballooning, PoD)... EVEN if the toolset is single threaded (i.e. only one domain may be created at a time, such as xapi). [1] As a result, I''ve cc''ed other parties involved in memory technologies who can chime in if they think the above statement is incorrect for their technology... Back to design brainstorming: The way I am thinking about it, the tools need to be involved to the extent that they would need to communicate to the hypervisor the following facts (probably via new hypercall): X1) I am launching a domain X and it is eventually going to consume up to a maximum of N MB. Please tell me if there is sufficient RAM available AND, if so, reserve it until I tell you I am done. ("AND" implies transactional semantics) X2) The launch of X is complete and I will not be requesting the allocation of any more RAM for it. Please release the reservation, whether or not I''ve requested a total of N MB. The calls may be nested or partially ordered, i.e. X1...Y1...Y2...X2 X1...Y1...X2...Y2 and the hypervisor must be able to deal with this. Then there would need to be two "versions" of "xm/xl free". We can quibble about which should be the default, but they would be: - "xl --reserved free" asks the hypervisor how much RAM is available taking into account reservations - "xm --raw free" asks the hypervisor for the instantaneous amount of RAM unallocated, not counting reservations When the tools are not launching a domain (that is there has been a matching X2 for all X1), the results of the above "free" queries are always identical. So, IanJ, does this match up with the design you were thinking about? Thanks, Dan [1] I think the core culprits are (a) the hypervisor accounts for memory allocation of pages strictly on a first-come-first-served basis and (b) the tools don''t have any form of need-this-much-memory "transaction" model
Tim Deegan
2012-Oct-02 09:10 UTC
Re: domain creation vs querying free memory (xend and xl)
At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote:> Bearing in mind that I know almost nothing about xl or > the tools layer, and that, as a result, I tend to look > for hypervisor solutions, I''m thinking it''s not possible to > solve this without direct participation of the hypervisor anyway, > at least while ensuring the solution will successfully > work with any memory technology that involves ballooning > with the possibility of overcommit (i.e. tmem, page sharing > and host-swapping, manual ballooning, PoD)... EVEN if the > toolset is single threaded (i.e. only one domain may > be created at a time, such as xapi). [1]TTBOMK, Xapi actually _has_ solved this problem, even with ballooning and PoD. I don''t know if they have any plans to support sharing, swapping or tmem, though. Adding a ''reservation'' of free pages that may only be allocated by a given domain should be straightforward enough, but I''m not sure it helps much. In the ''balloon-to-fit'' model where all memory is already allocated to some domain (or tmem), some part of the toolstack needs to sort out freeing up the memory before allocating it to another VM. Surely that component needs to handle the exclusion too - otherwise a series of small VM creations could stall a large one indefinitely. Cheers, Tim.
Ian Campbell
2012-Oct-02 09:47 UTC
Re: domain creation vs querying free memory (xend and xl)
On Tue, 2012-10-02 at 10:10 +0100, Tim Deegan wrote:> At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote: > > Bearing in mind that I know almost nothing about xl or > > the tools layer, and that, as a result, I tend to look > > for hypervisor solutions, I''m thinking it''s not possible to > > solve this without direct participation of the hypervisor anyway, > > at least while ensuring the solution will successfully > > work with any memory technology that involves ballooning > > with the possibility of overcommit (i.e. tmem, page sharing > > and host-swapping, manual ballooning, PoD)... EVEN if the > > toolset is single threaded (i.e. only one domain may > > be created at a time, such as xapi). [1] > > TTBOMK, Xapi actually _has_ solved this problem, even with ballooning > and PoD. I don''t know if they have any plans to support sharing, > swapping or tmem, though. > > Adding a ''reservation'' of free pages that may only be allocated by a > given domain should be straightforward enough, but I''m not sure it helps > much. In the ''balloon-to-fit'' model where all memory is already > allocated to some domain (or tmem), some part of the toolstack needs to > sort out freeing up the memory before allocating it to another VM. > Surely that component needs to handle the exclusion too - otherwise a > series of small VM creations could stall a large one indefinitely.xl today has a big lock around domain creation, which solves the original issue which Dan describes but has the issue which you describe. IIRC Dario was going to be looking at adding something to (one or more of) xen, libxl and xl to allow this to be handled more cleverly as part of the NUMA work in 4.3. I think that the intention was still that there would be a critical section within all of the colluding xl instances where memory was set aside for a particular domain, possibly with hypervisor assistance. Ian.
Dan Magenheimer
2012-Oct-02 18:17 UTC
Re: domain creation vs querying free memory (xend and xl)
(Rats, thought I sent this out yesterday...)> From: Dario Faggioli [mailto:raistlin@linux.it] > Sent: Friday, September 28, 2012 10:08 AM > To: George Shuklin > Cc: xen-devel@lists.xen.org > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > On Thu, 2012-09-27 at 19:24 +0400, George Shuklin wrote: > > not sure about xl/xm, but xapi performs one start at time, so there is > > no race between domains for memory or other resources.Oops, sorry, I missed this part of the thread because I wasn''t directly cc''ed and am behind on my xen-devel reading...> IIRC, xl has a vary coarse grain locking mechanism in place for domain > creation too. As a result of that, you shouldn''t be able to create two > domains at the same time, which should be enough for preventing the > situation described in the original e-mail for occurring. > > Looking at acquire_lock() and release_lock() (and at where they are > called) in xl code should clarify whether or not that is enough to > actually avoid the race (which I think it is, but I might be wrong :-D).This sounds like a pretty serious limitation, especially if it applies to migration as well as creation (or a combination)... I hope it is not a regression from xm to xl. For example, suppose a data center is trying to do a planned downtime for machine X by force-migrating all guests to machine Y. It sounds like xl would serialize this?> That being said, there still is the room for races, although not wrt > domain creation, as, for instance, there isn''t any synchronization > between creation and ballooning, which both manipulate memory. So, maybe > thinking about some kind of reservation-based/transactional mechanism at > some level might make actual sense.Which is mostly the reason I am interested ;-) though solving the superset of my problem is probably a good thing as well. Dan> Unfortunately, I''ve no idea about how xm works in that respect. > > Hope this at least help clarifying the situation a bit. :-) > > Thanks and Regards, > Dario > > -- > <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- > Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli > Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > >
Dan Magenheimer
2012-Oct-02 19:33 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Tim Deegan [mailto:tim@xen.org] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote: > > Bearing in mind that I know almost nothing about xl or > > the tools layer, and that, as a result, I tend to look > > for hypervisor solutions, I''m thinking it''s not possible to > > solve this without direct participation of the hypervisor anyway, > > at least while ensuring the solution will successfully > > work with any memory technology that involves ballooning > > with the possibility of overcommit (i.e. tmem, page sharing > > and host-swapping, manual ballooning, PoD)... EVEN if the > > toolset is single threaded (i.e. only one domain may > > be created at a time, such as xapi). [1] > > TTBOMK, Xapi actually _has_ solved this problem, even with ballooning > and PoD. I don''t know if they have any plans to support sharing, > swapping or tmem, though.Is this because PoD never independently increases the size of a domain''s allocation? If so, then I agree Xapi has solved the problem because in all cases the toolstack knows when the amount of memory allocated to a domain is increasing. However, given that George''s 4.3 plan contains: * Memory: Replace PoD with paging mechanism owner: george@citrix status: May need review xapi might want to (re)consider either the above 4.3 feature or see that this problem has been properly fixed prior to 4.3, because I am fairly sure that paging _will_ increase a domain''s current allocation without knowledge of the toolstack.> Adding a ''reservation'' of free pages that may only be allocated by a > given domain should be straightforward enough, but I''m not sure it helpsIt absolutely does help. With tmem (and I think with paging), the total allocation of a domain may be increased without knowledge by the toolset.> much. In the ''balloon-to-fit'' model where all memory is already > allocated to some domain (or tmem), some part of the toolstack needs to > sort out freeing up the memory before allocating it to another VM.By balloon-to-fit, do you mean that all RAM is occupied? Tmem handles the "sort out freeing up the memory" entirely in the hypervisor, so the toolstack never knows.> Surely that component needs to handle the exclusion too - otherwise a > series of small VM creations could stall a large one indefinitely.Not sure I understand this, but it seems feasible. Dan
Tim Deegan
2012-Oct-02 20:16 UTC
Re: domain creation vs querying free memory (xend and xl)
At 12:33 -0700 on 02 Oct (1349181195), Dan Magenheimer wrote:> > From: Tim Deegan [mailto:tim@xen.org] > > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > > At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote: > > > Bearing in mind that I know almost nothing about xl or > > > the tools layer, and that, as a result, I tend to look > > > for hypervisor solutions, I''m thinking it''s not possible to > > > solve this without direct participation of the hypervisor anyway, > > > at least while ensuring the solution will successfully > > > work with any memory technology that involves ballooning > > > with the possibility of overcommit (i.e. tmem, page sharing > > > and host-swapping, manual ballooning, PoD)... EVEN if the > > > toolset is single threaded (i.e. only one domain may > > > be created at a time, such as xapi). [1] > > > > TTBOMK, Xapi actually _has_ solved this problem, even with ballooning > > and PoD. I don''t know if they have any plans to support sharing, > > swapping or tmem, though. > > Is this because PoD never independently increases the size of a domain''s > allocation?AIUI xapi uses the domains'' maximum allocations, centrally controlled, to place an upper bound on the amount of guest memory that can be in use. Within those limits there can be ballooning activity. But TBH I don''t know the details.> > Adding a ''reservation'' of free pages that may only be allocated by a > > given domain should be straightforward enough, but I''m not sure it helps > > It absolutely does help. With tmem (and I think with paging), the > total allocation of a domain may be increased without knowledge by > the toolset.But not past the domains'' maximum allowance, right? That''s not the case with paging, anyway.> > much. In the ''balloon-to-fit'' model where all memory is already > > allocated to some domain (or tmem), some part of the toolstack needs to > > sort out freeing up the memory before allocating it to another VM. > > By balloon-to-fit, do you mean that all RAM is occupied? Tmem > handles the "sort out freeing up the memory" entirely in the > hypervisor, so the toolstack never knows.Does tmem replace ballooning/sharing/swapping entirely? I thought they could coexist. Or, if you jut mean that tmem owns all otherwise-free memory and will relinquish it on demand, then the same problems occur while the toolstack is moving memory from owned-by-guests to owned-by-tmem.> > Surely that component needs to handle the exclusion too - otherwise a > > series of small VM creations could stall a large one indefinitely. > > Not sure I understand this, but it seems feasible.If you ask for a large VM and a small VM to be started at about the same time, the small VM will always win (since you''ll free enough memory for the small VM before you free enough for the big one). If you then ask for another small VM it will win again, and so forth, indefinitely postponing the large VM in the waiting-for-memory state, unless some agent explicitly enforces that VMs be started in order. If you have such an agent you probably don''t need a hypervisor interlock as well. I think it would be better to back up a bit. Maybe you could sketch out how you think [lib]xl ought to be handling ballooning/swapping/sharing/tmem when it''s starting VMs. I don''t have a strong objection to accounting free memory to particular domains if it turns out to be useful, but as always I prefer not to have things happen in the hypervisor if they could happen in less privileged code. Tim.
Dan Magenheimer
2012-Oct-02 21:56 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Tim Deegan [mailto:tim@xen.org] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > At 12:33 -0700 on 02 Oct (1349181195), Dan Magenheimer wrote: > > > From: Tim Deegan [mailto:tim@xen.org] > > > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > > > > At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote: > > > > Bearing in mind that I know almost nothing about xl or > > > > the tools layer, and that, as a result, I tend to look > > > > for hypervisor solutions, I''m thinking it''s not possible to > > > > solve this without direct participation of the hypervisor anyway, > > > > at least while ensuring the solution will successfully > > > > work with any memory technology that involves ballooning > > > > with the possibility of overcommit (i.e. tmem, page sharing > > > > and host-swapping, manual ballooning, PoD)... EVEN if the > > > > toolset is single threaded (i.e. only one domain may > > > > be created at a time, such as xapi). [1] > > > > > > TTBOMK, Xapi actually _has_ solved this problem, even with ballooning > > > and PoD. I don''t know if they have any plans to support sharing, > > > swapping or tmem, though. > > > > Is this because PoD never independently increases the size of a domain''s > > allocation? > > AIUI xapi uses the domains'' maximum allocations, centrally controlled, > to place an upper bound on the amount of guest memory that can be in > use. Within those limits there can be ballooning activity. But TBH I > don''t know the details.Yes, that''s the same as saying there is no memory-overcommit. The original problem occurs only if there are multiple threads of execution that can be simultaneously asking the hypervisor to allocate memory without the knowledge of a single centralized "controller".> > > Adding a ''reservation'' of free pages that may only be allocated by a > > > given domain should be straightforward enough, but I''m not sure it helps > > > > It absolutely does help. With tmem (and I think with paging), the > > total allocation of a domain may be increased without knowledge by > > the toolset. > > But not past the domains'' maximum allowance, right? That''s not the case > with paging, anyway.Right. We can quibble about memory hot-add, depending on its design.> > > much. In the ''balloon-to-fit'' model where all memory is already > > > allocated to some domain (or tmem), some part of the toolstack needs to > > > sort out freeing up the memory before allocating it to another VM. > > > > By balloon-to-fit, do you mean that all RAM is occupied? Tmem > > handles the "sort out freeing up the memory" entirely in the > > hypervisor, so the toolstack never knows. > > Does tmem replace ballooning/sharing/swapping entirely? I thought they > could coexist. Or, if you jut mean that tmem owns all otherwise-free > memory and will relinquish it on demand, then the same problems occur > while the toolstack is moving memory from owned-by-guests to > owned-by-tmem.Tmem replaces sharing/swapping entirely for guests that support it. Since kernel changes are required to support it, not all guests will ever support it. Now with full tmem support in the Linux kernel, it is possible that eventually all non-legacy Linux guests will support it. Tmem dynamically handles all the transfer of owned-by memory capacity in the hypervisor, essentially augmenting the page allocator, so the hypervisor is the "controller". Oh, and tmem doesn''t replace ballooning at all... it works best with selfballooning (which is also now in the Linux kernel). Ballooning is still a useful mechanism for moving memory capacity between the guest and the hypervisor; tmem caches data and handles policy.> > > Surely that component needs to handle the exclusion too - otherwise a > > > series of small VM creations could stall a large one indefinitely. > > > > Not sure I understand this, but it seems feasible. > > If you ask for a large VM and a small VM to be started at about the same > time, the small VM will always win (since you''ll free enough memory for > the small VM before you free enough for the big one). If you then ask > for another small VM it will win again, and so forth, indefinitely > postponing the large VM in the waiting-for-memory state, unless some > agent explicitly enforces that VMs be started in order. If you have > such an agent you probably don''t need a hypervisor interlock as well.OK, I see, thanks.> I think it would be better to back up a bit. Maybe you could sketch out > how you think [lib]xl ought to be handling ballooning/swapping/sharing/tmem > when it''s starting VMs. I don''t have a strong objection to accounting > free memory to particular domains if it turns out to be useful, but as > always I prefer not to have things happen in the hypervisor if they > could happen in less privileged code.I sketched it out earlier in this thread, will attach again below. I agree with your last statement in general, but would modify it to "if they could happen efficiently and effectively in less privileged code". Obviously everything that Xen does can be done in less privileged code... in an emulator. Emulator''s just don''t do it fast enough. Tmem argues that doing "memory capacity transfers" at a page granularity can only be done efficiently in the hypervisor. This is true for page-sharing when it breaks a "share" also... it can''t go ask the toolstack to approve allocation of a new page every time a write to a shared page occurs. Does that make sense? So the original problem must be solved if: 1) Domain creation is not serialized 2) Any domain''s current memory allocation can be increased without approval of the toolstack. Problem (1) arose independently and my interest is that it gets solved in a way that (2) can also benefit. Dan (rough proposed design re-attached below)> From: Dan Magenheimer > Sent: Monday, October 01, 2012 2:04 PM > : > : > Back to design brainstorming: > > The way I am thinking about it, the tools need to be involved > to the extent that they would need to communicate to the > hypervisor the following facts (probably via new hypercall): > > X1) I am launching a domain X and it is eventually going to > consume up to a maximum of N MB. Please tell me if > there is sufficient RAM available AND, if so, reserve > it until I tell you I am done. ("AND" implies transactional > semantics) > X2) The launch of X is complete and I will not be requesting > the allocation of any more RAM for it. Please release > the reservation, whether or not I''ve requested a total > of N MB. > > The calls may be nested or partially ordered, i.e. > X1...Y1...Y2...X2 > X1...Y1...X2...Y2 > and the hypervisor must be able to deal with this. > > Then there would need to be two "versions" of "xm/xl free". > We can quibble about which should be the default, but > they would be: > > - "xl --reserved free" asks the hypervisor how much RAM > is available taking into account reservations > - "xm --raw free" asks the hypervisor for the instantaneous > amount of RAM unallocated, not counting reservations > > When the tools are not launching a domain (that is there > has been a matching X2 for all X1), the results of the > above "free" queries are always identical. > > So, IanJ, does this match up with the design you were thinking > about? > > Thanks, > Dan > > [1] I think the core culprits are (a) the hypervisor accounts for > memory allocation of pages strictly on a first-come-first-served > basis and (b) the tools don''t have any form of need-this-much-memory > "transaction" model
Tim Deegan
2012-Oct-04 10:06 UTC
Re: domain creation vs querying free memory (xend and xl)
At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote:> > AIUI xapi uses the domains'' maximum allocations, centrally controlled, > > to place an upper bound on the amount of guest memory that can be in > > use. Within those limits there can be ballooning activity. But TBH I > > don''t know the details. > > Yes, that''s the same as saying there is no memory-overcommit.I''d say there is - but it''s all done by ballooning, and it''s centrally enforced by lowering each domain''s maxmem to its balloon target, so a badly behaved guest can''t balloon up and confuse things.> The original problem occurs only if there are multiple threads > of execution that can be simultaneously asking the hypervisor > to allocate memory without the knowledge of a single centralized > "controller".Absolutely.> Tmem argues that doing "memory capacity transfers" at a page granularity > can only be done efficiently in the hypervisor. This is true for > page-sharing when it breaks a "share" also... it can''t go ask the > toolstack to approve allocation of a new page every time a write to a shared > page occurs. > > Does that make sense?Yes. The page-sharing version can be handled by having a pool of dedicated memory for breaking shares, and the toolstack asynchronously replenish that, rather than allowing CoW to use up all memory in the system.> (rough proposed design re-attached below)Thanks for that. It describes a sensible-looking hypervisor interface, but my question was really: what should xl do, in the presence of ballooning, sharing, paging and tmem, to - decide whether a VM can be started at all; - control those four systems to shuffle memory around; and - resolve races sensibly to avoid small VMs deferring large ones. (AIUI, xl already has some logic to handle the case of balloon-to-fit.) The second of those three is the interesting one. It seems to me that if the tools can''t force all other actors to give up memory (and not immediately take it back) then they can''t guarantee to be able to start a new VM, even with the new reservation hypercalls. Cheers, Tim.> > From: Dan Magenheimer > > Sent: Monday, October 01, 2012 2:04 PM > > : > > : > > Back to design brainstorming: > > > > The way I am thinking about it, the tools need to be involved > > to the extent that they would need to communicate to the > > hypervisor the following facts (probably via new hypercall): > > > > X1) I am launching a domain X and it is eventually going to > > consume up to a maximum of N MB. Please tell me if > > there is sufficient RAM available AND, if so, reserve > > it until I tell you I am done. ("AND" implies transactional > > semantics) > > X2) The launch of X is complete and I will not be requesting > > the allocation of any more RAM for it. Please release > > the reservation, whether or not I''ve requested a total > > of N MB. > > > > The calls may be nested or partially ordered, i.e. > > X1...Y1...Y2...X2 > > X1...Y1...X2...Y2 > > and the hypervisor must be able to deal with this. > > > > Then there would need to be two "versions" of "xm/xl free". > > We can quibble about which should be the default, but > > they would be: > > > > - "xl --reserved free" asks the hypervisor how much RAM > > is available taking into account reservations > > - "xm --raw free" asks the hypervisor for the instantaneous > > amount of RAM unallocated, not counting reservations > > > > When the tools are not launching a domain (that is there > > has been a matching X2 for all X1), the results of the > > above "free" queries are always identical. > > > > So, IanJ, does this match up with the design you were thinking > > about? > > > > Thanks, > > Dan > > > > [1] I think the core culprits are (a) the hypervisor accounts for > > memory allocation of pages strictly on a first-come-first-served > > basis and (b) the tools don''t have any form of need-this-much-memory > > "transaction" model
Ian Campbell
2012-Oct-04 10:17 UTC
Re: domain creation vs querying free memory (xend and xl)
On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote:> but my question was really: what should xl do, in the presence of > ballooning, sharing, paging and tmem, to > - decide whether a VM can be started at all; > - control those four systems to shuffle memory around; and > - resolve races sensibly to avoid small VMs deferring large ones. > (AIUI, xl already has some logic to handle the case of balloon-to-fit.) > > The second of those three is the interesting one. It seems to me that > if the tools can''t force all other actors to give up memory (and not > immediately take it back) then they can''t guarantee to be able to start > a new VM, even with the new reservation hypercalls.There was a bit of discussion in the spring about this sort of thing (well, three of the four), which seems to have fallen a bit by the wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g. http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html I''m sure there was earlier discussion which led to that, but I can''t seem to see it in the archives right now, perhaps I''m not looking for the right Subject. Olaf might have been intending to look into this (I can''t quite remember where we left it) Ian.
Andres Lagar-Cavilla
2012-Oct-04 13:20 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote:> On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote: >> but my question was really: what should xl do, in the presence of >> ballooning, sharing, paging and tmem, to >> - decide whether a VM can be started at all; >> - control those four systems to shuffle memory around; andAre we talking about a per-VM control, with one or more of those sub-systems colluding concurrently? Or are we talking about a global view, and how chunks of host memory get sub-allocated? Hopefully the latter...>> - resolve races sensibly to avoid small VMs deferring large ones. >> (AIUI, xl already has some logic to handle the case of balloon-to-fit.) >> >> The second of those three is the interesting one. It seems to me that >> if the tools can''t force all other actors to give up memory (and not >> immediately take it back) then they can''t guarantee to be able to start >> a new VM, even with the new reservation hypercalls. > > There was a bit of discussion in the spring about this sort of thing > (well, three of the four), which seems to have fallen a bit by the > wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g. > http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html > > I''m sure there was earlier discussion which led to that, but I can''t > seem to see it in the archives right now, perhaps I''m not looking for > the right Subject.IIRC, we had a bit of that conversation during the Santa Clara hackathon. The idea was to devise a scheme so that libxl can be told who the "actor" will be for memory management, and then hand-off appropriately. Add xl bindings, suitable defaults, and an implementation of the "balloon actor" by libxl, and the end result is the ability to start domains with a memory target suitably managed by balloon, xenpaging, tmem, foo, according to the user''s wish. With no need to know obscure knobs. To the extent that might be possible. Andres> > Olaf might have been intending to look into this (I can''t quite remember > where we left it) > > Ian. > >
Ian Campbell
2012-Oct-04 13:25 UTC
Re: domain creation vs querying free memory (xend and xl)
On Thu, 2012-10-04 at 14:20 +0100, Andres Lagar-Cavilla wrote:> On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote: > > > On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote: > >> but my question was really: what should xl do, in the presence of > >> ballooning, sharing, paging and tmem, to > >> - decide whether a VM can be started at all; > >> - control those four systems to shuffle memory around; and > > Are we talking about a per-VM control, with one or more of those sub-systems colluding concurrently? Or are we talking about a global view, and how chunks of host memory get sub-allocated? Hopefully the latter... > > >> - resolve races sensibly to avoid small VMs deferring large ones. > >> (AIUI, xl already has some logic to handle the case of balloon-to-fit.) > >> > >> The second of those three is the interesting one. It seems to me that > >> if the tools can''t force all other actors to give up memory (and not > >> immediately take it back) then they can''t guarantee to be able to start > >> a new VM, even with the new reservation hypercalls. > > > > There was a bit of discussion in the spring about this sort of thing > > (well, three of the four), which seems to have fallen a bit by the > > wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g. > > http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html > > > > I''m sure there was earlier discussion which led to that, but I can''t > > seem to see it in the archives right now, perhaps I''m not looking for > > the right Subject. > > IIRC, we had a bit of that conversation during the Santa Clara > hackathon. The idea was to devise a scheme so that libxl can be told > who the "actor" will be for memory management, and then hand-off > appropriately. Add xl bindings, suitable defaults, and an > implementation of the "balloon actor" by libxl, and the end result is > the ability to start domains with a memory target suitably managed by > balloon, xenpaging, tmem, foo, according to the user''s wish. With no > need to know obscure knobs. To the extent that might be possible.That''s right, I''d forgotten about that conversation. Yet some how the mail I referenced seems to be a result of that conversation -- which is a nice coincidence ;-)
Andres Lagar-Cavilla
2012-Oct-04 13:33 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote:> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote: >>> AIUI xapi uses the domains'' maximum allocations, centrally controlled, >>> to place an upper bound on the amount of guest memory that can be in >>> use. Within those limits there can be ballooning activity. But TBH I >>> don''t know the details. >> >> Yes, that''s the same as saying there is no memory-overcommit. > > I''d say there is - but it''s all done by ballooning, and it''s centrally > enforced by lowering each domain''s maxmem to its balloon target, so a > badly behaved guest can''t balloon up and confuse things. > >> The original problem occurs only if there are multiple threads >> of execution that can be simultaneously asking the hypervisor >> to allocate memory without the knowledge of a single centralized >> "controller". > > Absolutely. > >> Tmem argues that doing "memory capacity transfers" at a page granularity >> can only be done efficiently in the hypervisor. This is true for >> page-sharing when it breaks a "share" also... it can''t go ask the >> toolstack to approve allocation of a new page every time a write to a shared >> page occurs. >> >> Does that make sense? > > Yes. The page-sharing version can be handled by having a pool of > dedicated memory for breaking shares, and the toolstack asynchronously > replenish that, rather than allowing CoW to use up all memory in the > system.That is doable. One benefit is that it would minimize the chance of a VM hitting a CoW ENOMEM. I don''t see how it would altogether avoid it. If the objective is trying to put a cap to the unpredictable growth of memory allocations via CoW unsharing, two observations: (1) will never grow past nominal VM footprint (2) One can put a cap today by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep, and things can be kicked back into action at a later point.> >> (rough proposed design re-attached below) > > Thanks for that. It describes a sensible-looking hypervisor interface, > but my question was really: what should xl do, in the presence of > ballooning, sharing, paging and tmem, to > - decide whether a VM can be started at all; > - control those four systems to shuffle memory around; and > - resolve races sensibly to avoid small VMs deferring large ones. > (AIUI, xl already has some logic to handle the case of balloon-to-fit.) > > The second of those three is the interesting one. It seems to me that > if the tools can''t force all other actors to give up memory (and not > immediately take it back) then they can''t guarantee to be able to start > a new VM, even with the new reservation hypercalls. > > Cheers, > > Tim. > >>> From: Dan Magenheimer >>> Sent: Monday, October 01, 2012 2:04 PM >>> : >>> : >>> Back to design brainstorming: >>> >>> The way I am thinking about it, the tools need to be involved >>> to the extent that they would need to communicate to the >>> hypervisor the following facts (probably via new hypercall): >>> >>> X1) I am launching a domain X and it is eventually going to >>> consume up to a maximum of N MB. Please tell me if >>> there is sufficient RAM available AND, if so, reserve >>> it until I tell you I am done. ("AND" implies transactional >>> semantics)X1 does not need hypervisor support. We already coexist with a global daemon that is a single point of failure. I''m not arguing for xenstore to hold onto these reservations, but a daemon can. Xapi does it that way. Andres>>> X2) The launch of X is complete and I will not be requesting >>> the allocation of any more RAM for it. Please release >>> the reservation, whether or not I''ve requested a total >>> of N MB. >>> >>> The calls may be nested or partially ordered, i.e. >>> X1...Y1...Y2...X2 >>> X1...Y1...X2...Y2 >>> and the hypervisor must be able to deal with this. >>> >>> Then there would need to be two "versions" of "xm/xl free". >>> We can quibble about which should be the default, but >>> they would be: >>> >>> - "xl --reserved free" asks the hypervisor how much RAM >>> is available taking into account reservations >>> - "xm --raw free" asks the hypervisor for the instantaneous >>> amount of RAM unallocated, not counting reservations >>> >>> When the tools are not launching a domain (that is there >>> has been a matching X2 for all X1), the results of the >>> above "free" queries are always identical. >>> >>> So, IanJ, does this match up with the design you were thinking >>> about? >>> >>> Thanks, >>> Dan >>> >>> [1] I think the core culprits are (a) the hypervisor accounts for >>> memory allocation of pages strictly on a first-come-first-served >>> basis and (b) the tools don''t have any form of need-this-much-memory >>> "transaction" model
Dan Magenheimer
2012-Oct-04 16:36 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Tim Deegan [mailto:tim@xen.org] > Sent: Thursday, October 04, 2012 4:07 AM > To: Dan Magenheimer > Cc: Olaf Hering; Keir Fraser; Konrad Wilk; George Dunlap; Kurt Hackel; Ian Jackson; xen- > devel@lists.xen.org; George Shuklin; Dario Faggioli; Andres Lagar-Cavilla > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl)Hi Tim -- Good discussion!> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote: > > > AIUI xapi uses the domains'' maximum allocations, centrally controlled, > > > to place an upper bound on the amount of guest memory that can be in > > > use. Within those limits there can be ballooning activity. But TBH I > > > don''t know the details. > > > > Yes, that''s the same as saying there is no memory-overcommit. > > I''d say there is - but it''s all done by ballooning, and it''s centrally > enforced by lowering each domain''s maxmem to its balloon target, so a > badly behaved guest can''t balloon up and confuse things.While I agree this conceivably is a form of memory overcommit, I discarded it as a workable overcommit solution in 2008. The short reason is: EVERY guest is badly behaved in that they all want to suck up as much memory as possible and they all need it _now_. This observation actually is what led to tmem.> > The original problem occurs only if there are multiple threads > > of execution that can be simultaneously asking the hypervisor > > to allocate memory without the knowledge of a single centralized > > "controller". > > Absolutely. > > > Tmem argues that doing "memory capacity transfers" at a page granularity > > can only be done efficiently in the hypervisor. This is true for > > page-sharing when it breaks a "share" also... it can''t go ask the > > toolstack to approve allocation of a new page every time a write to a shared > > page occurs. > > > > Does that make sense? > > Yes. The page-sharing version can be handled by having a pool of > dedicated memory for breaking shares, and the toolstack asynchronously > replenish that, rather than allowing CoW to use up all memory in the > system.This is really just overcommit-by-undercommit. IMHO, any attempt to set aside a chunk of memory for a specific purpose just increases memory pressure on all the other memory users. Nobody has any clue a priori what the size of that dedicated memory pool should be; if it is too big, you are simply wasting memory and if it is too small, you haven''t solved the real problem. Workloads vary too dramatically, instantaneously, and unpredictably across time in their need for memory. Sharing makes it even more complex.> > (rough proposed design re-attached below) > > Thanks for that. It describes a sensible-looking hypervisor interface, > but my question was really: what should xl do, in the presence of > ballooning, sharing, paging and tmem, to > - decide whether a VM can be started at all; > - control those four systems to shuffle memory around; and > - resolve races sensibly to avoid small VMs deferring large ones. > (AIUI, xl already has some logic to handle the case of balloon-to-fit.) > > The second of those three is the interesting one. It seems to me that > if the tools can''t force all other actors to give up memory (and not > immediately take it back) then they can''t guarantee to be able to start > a new VM, even with the new reservation hypercalls.I agree the second one is interesting but the only real solution is for the controller to be an oracle for all the guests. That makes it less interesting to me, so balloon-to-fit is less interesting to me (even if it is the only overcommit option for legacy guests). IMHO, the problem is the same as for guest OS''s that compute pi in the kernel when there are no runnable tasks, i.e. a virtualization environment is sometimes forced to partition resources, not virtualize those guests. IOW, don''t overcommit "unenlightened" legacy guests. [1] So I don''t think the design I wrote up solves the second one, nor do I think it makes it any worse. The design I wrote up is intended to solve the first and third. I _think_ the reservation-transaction model described (X1 and X2) should work for libxl, in the presence of ballooning, sharing, paging, and tmem. And it neither helps nor hurts balloon-to-fit. Given that, can you shoot holes in the design? Or are there parts that aren''t clear? Or (admitting that I am a libxl idiot) is it unworkable for xl/libxl? Thanks, Dan [1] By "unenlightened" here, I mean guests that are still under the notion that they "own" all of a fixed amount of RAM. A balloon driver makes them "semi-enlightened" :-)> > > From: Dan Magenheimer > > > Sent: Monday, October 01, 2012 2:04 PM > > > : > > > : > > > Back to design brainstorming: > > > > > > The way I am thinking about it, the tools need to be involved > > > to the extent that they would need to communicate to the > > > hypervisor the following facts (probably via new hypercall): > > > > > > X1) I am launching a domain X and it is eventually going to > > > consume up to a maximum of N MB. Please tell me if > > > there is sufficient RAM available AND, if so, reserve > > > it until I tell you I am done. ("AND" implies transactional > > > semantics) > > > X2) The launch of X is complete and I will not be requesting > > > the allocation of any more RAM for it. Please release > > > the reservation, whether or not I''ve requested a total > > > of N MB. > > > > > > The calls may be nested or partially ordered, i.e. > > > X1...Y1...Y2...X2 > > > X1...Y1...X2...Y2 > > > and the hypervisor must be able to deal with this. > > > > > > Then there would need to be two "versions" of "xm/xl free". > > > We can quibble about which should be the default, but > > > they would be: > > > > > > - "xl --reserved free" asks the hypervisor how much RAM > > > is available taking into account reservations > > > - "xm --raw free" asks the hypervisor for the instantaneous > > > amount of RAM unallocated, not counting reservations > > > > > > When the tools are not launching a domain (that is there > > > has been a matching X2 for all X1), the results of the > > > above "free" queries are always identical. > > > > > > So, IanJ, does this match up with the design you were thinking > > > about? > > > > > > Thanks, > > > Dan > > > > > > [1] I think the core culprits are (a) the hypervisor accounts for > > > memory allocation of pages strictly on a first-come-first-served > > > basis and (b) the tools don''t have any form of need-this-much-memory > > > "transaction" model
Dan Magenheimer
2012-Oct-04 16:54 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote: > > > On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote: > >> but my question was really: what should xl do, in the presence of > >> ballooning, sharing, paging and tmem, to > >> - decide whether a VM can be started at all; > >> - control those four systems to shuffle memory around; and > > Are we talking about a per-VM control, with one or more of those sub-systems colluding concurrently? > Or are we talking about a global view, and how chunks of host memory get sub-allocated? Hopefully the > latter... > > >> - resolve races sensibly to avoid small VMs deferring large ones. > >> (AIUI, xl already has some logic to handle the case of balloon-to-fit.) > >> > >> The second of those three is the interesting one. It seems to me that > >> if the tools can''t force all other actors to give up memory (and not > >> immediately take it back) then they can''t guarantee to be able to start > >> a new VM, even with the new reservation hypercalls. > > > > There was a bit of discussion in the spring about this sort of thing > > (well, three of the four), which seems to have fallen a bit by the > > wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g. > > http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html > > > > I''m sure there was earlier discussion which led to that, but I can''t > > seem to see it in the archives right now, perhaps I''m not looking for > > the right Subject. > > IIRC, we had a bit of that conversation during the Santa Clara hackathon. The idea was to devise a > scheme so that libxl can be told who the "actor" will be for memory management, and then hand-off > appropriately. Add xl bindings, suitable defaults, and an implementation of the "balloon actor" byScanning through the archived message I am under the impression that the focus is on a single server... i.e. "punt if actor is not xl", i.e. it addressed "balloon-to-fit" and only tries to avoid stepping on other memory overcommit technologies. That makes it almost orthogonal, I think, to the problem I originally raised. But a bigger concern is that its focus on a single machine ignores the "cloud", where Xen seems to hold an advantage. In the cloud, the actor is "controlling" _many_ machines. In the problem I originally raised, this actor (a centralized management console) is simply looking for a server that has sufficient memory to house a new domain, and it (or the automation/sysadmin running it) gets unhappy if (xl running on) the server says "yes there is enough memory" but then later says, "oops, I guess there wasn''t enough after all".> libxl, and the end result is the ability to start domains with a memory target suitably managed by > balloon, xenpaging, tmem, foo, according to the user''s wish. With no need to know obscure knobs. To > the extent that might be possible.Am I detecting s[k|c]epticism? If so, I too am s[k|c]eptical. Dan
Dan Magenheimer
2012-Oct-04 16:59 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote: > > > At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote: > >> Tmem argues that doing "memory capacity transfers" at a page granularity > >> can only be done efficiently in the hypervisor. This is true for > >> page-sharing when it breaks a "share" also... it can''t go ask the > >> toolstack to approve allocation of a new page every time a write to a shared > >> page occurs. > >> > >> Does that make sense? > > > > Yes. The page-sharing version can be handled by having a pool of > > dedicated memory for breaking shares, and the toolstack asynchronously > > replenish that, rather than allowing CoW to use up all memory in the > > system. > > That is doable. One benefit is that it would minimize the chance of a VM hitting a CoW ENOMEM. I don''t > see how it would altogether avoid it.Agreed, so it doesn''t really solve the problem. (See longer reply to Tim.)> If the objective is trying to put a cap to the unpredictable growth of memory allocations via CoW > unsharing, two observations: (1) will never grow past nominal VM footprint (2) One can put a cap today > by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep, and things can be kicked back > into action at a later point.But IIRC isn''t it (2) that has given VMware memory overcommit a bad name? Any significant memory pressure due to overcommit leads to double-swapping, which leads to horrible performance?
Andres Lagar-Cavilla
2012-Oct-04 17:00 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 4, 2012, at 12:54 PM, Dan Magenheimer wrote:>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) >> >> >> On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote: >> >>> On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote: >>>> but my question was really: what should xl do, in the presence of >>>> ballooning, sharing, paging and tmem, to >>>> - decide whether a VM can be started at all; >>>> - control those four systems to shuffle memory around; and >> >> Are we talking about a per-VM control, with one or more of those sub-systems colluding concurrently? >> Or are we talking about a global view, and how chunks of host memory get sub-allocated? Hopefully the >> latter... >> >>>> - resolve races sensibly to avoid small VMs deferring large ones. >>>> (AIUI, xl already has some logic to handle the case of balloon-to-fit.) >>>> >>>> The second of those three is the interesting one. It seems to me that >>>> if the tools can''t force all other actors to give up memory (and not >>>> immediately take it back) then they can''t guarantee to be able to start >>>> a new VM, even with the new reservation hypercalls. >>> >>> There was a bit of discussion in the spring about this sort of thing >>> (well, three of the four), which seems to have fallen a bit by the >>> wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g. >>> http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html >>> >>> I''m sure there was earlier discussion which led to that, but I can''t >>> seem to see it in the archives right now, perhaps I''m not looking for >>> the right Subject. >> >> IIRC, we had a bit of that conversation during the Santa Clara hackathon. The idea was to devise a >> scheme so that libxl can be told who the "actor" will be for memory management, and then hand-off >> appropriately. Add xl bindings, suitable defaults, and an implementation of the "balloon actor" by > > Scanning through the archived message I am under the impression > that the focus is on a single server... i.e. "punt if actor is > not xl", i.e. it addressed "balloon-to-fit" and only tries to avoid > stepping on other memory overcommit technologies. That makes it > almost orthogonal, I think, to the problem I originally raised.Yeah, fairly orthogonal.> > But a bigger concern is that its focus on a single machine ignores > the "cloud", where Xen seems to hold an advantage. In the cloud, > the actor is "controlling" _many_ machines. In the problem I > originally raised, this actor (a centralized management console) > is simply looking for a server that has sufficient memory to house > a new domain, and it (or the automation/sysadmin running it) gets > unhappy if (xl running on) the server says "yes there is enough > memory" but then later says, "oops, I guess there wasn''t enough > after all".Big problem in itself, but not one for xen.org (yet, cart before horse). Have you had a look at the Openstack FilterScheduler? Plenty of room for contribution.> >> libxl, and the end result is the ability to start domains with a memory target suitably managed by >> balloon, xenpaging, tmem, foo, according to the user''s wish. With no need to know obscure knobs. To >> the extent that might be possible. > > Am I detecting s[k|c]epticism? > > If so, I too am s[k|c]eptical.Well, not really. Things have to coexist cleanly, to the extent feasible. Devising a libxl protocol to perform clean hand off if required, and to expose minimum complexity to the average joe, is a great idea imho. Andres> > Dan
Andres Lagar-Cavilla
2012-Oct-04 17:08 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote:>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) >> >> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote: >> >>> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote: >>>> Tmem argues that doing "memory capacity transfers" at a page granularity >>>> can only be done efficiently in the hypervisor. This is true for >>>> page-sharing when it breaks a "share" also... it can''t go ask the >>>> toolstack to approve allocation of a new page every time a write to a shared >>>> page occurs. >>>> >>>> Does that make sense? >>> >>> Yes. The page-sharing version can be handled by having a pool of >>> dedicated memory for breaking shares, and the toolstack asynchronously >>> replenish that, rather than allowing CoW to use up all memory in the >>> system. >> >> That is doable. One benefit is that it would minimize the chance of a VM hitting a CoW ENOMEM. I don''t >> see how it would altogether avoid it. > > Agreed, so it doesn''t really solve the problem. (See longer reply > to Tim.) > >> If the objective is trying to put a cap to the unpredictable growth of memory allocations via CoW >> unsharing, two observations: (1) will never grow past nominal VM footprint (2) One can put a cap today >> by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep, and things can be kicked back >> into action at a later point. > > But IIRC isn''t it (2) that has given VMware memory overcommit a bad name? > Any significant memory pressure due to overcommit leads to double-swapping, > which leads to horrible performance?The little that I''ve been able to read from their published results is that a "lot" of CPU is consumed scanning memory and fingerprinting, which leads to a massive assault on micro-architectural caches. I don''t know if that equates to a "bad name", but I don''t think that is a productive discussion either. (2) doesn''t mean swapping. Note that d->max_pages can be set artificially low by an admin, raised again. etc. It''s just a mechanism to keep a VM at bay while corrective measures of any kind are taken. It''s really up to a higher level controller whether you accept allocations and later reach a point of thrashing. I understand this is partly where your discussion is headed, but certainly fixing the primary issue of nominal vanilla allocations preempting each other looks fairly critical to begin with. Andres
Dan Magenheimer
2012-Oct-04 17:18 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote: > > >> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] > >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > >> > >> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote: > >> > >>> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote: > >>>> Tmem argues that doing "memory capacity transfers" at a page granularity > >>>> can only be done efficiently in the hypervisor. This is true for > >>>> page-sharing when it breaks a "share" also... it can''t go ask the > >>>> toolstack to approve allocation of a new page every time a write to a shared > >>>> page occurs. > >>>> > >>>> Does that make sense? > >>> > >>> Yes. The page-sharing version can be handled by having a pool of > >>> dedicated memory for breaking shares, and the toolstack asynchronously > >>> replenish that, rather than allowing CoW to use up all memory in the > >>> system. > >> > >> That is doable. One benefit is that it would minimize the chance of a VM hitting a CoW ENOMEM. I > don''t > >> see how it would altogether avoid it. > > > > Agreed, so it doesn''t really solve the problem. (See longer reply > > to Tim.) > > > >> If the objective is trying to put a cap to the unpredictable growth of memory allocations via CoW > >> unsharing, two observations: (1) will never grow past nominal VM footprint (2) One can put a cap > today > >> by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep, and things can be kicked back > >> into action at a later point. > > > > But IIRC isn''t it (2) that has given VMware memory overcommit a bad name? > > Any significant memory pressure due to overcommit leads to double-swapping, > > which leads to horrible performance? > > The little that I''ve been able to read from their published results is that a "lot" of CPU is consumed > scanning memory and fingerprinting, which leads to a massive assault on micro-architectural caches. > > I don''t know if that equates to a "bad name", but I don''t think that is a productive discussion > either.Sorry, I wasn''t intending that to be snarky, but on re-read I guess it did sound snarky. What I meant is: Is this just a manual version of what VMware does automatically? Or is there something I am misunderstanding? (I think you answered that below.)> (2) doesn''t mean swapping. Note that d->max_pages can be set artificially low by an admin, raised > again. etc. It''s just a mechanism to keep a VM at bay while corrective measures of any kind are taken. > It''s really up to a higher level controller whether you accept allocations and later reach a point of > thrashing. > > I understand this is partly where your discussion is headed, but certainly fixing the primary issue of > nominal vanilla allocations preempting each other looks fairly critical to begin with.OK. I _think_ the design I proposed helps in systems that are using page-sharing/host-swapping as well... I assume share-breaking just calls the normal hypervisor allocator interface to allocate a new page (if available)? If you could review and comment on the design from a page-sharing/host-swapping perspective, I would appreciate it. Thanks, Dan
Andres Lagar-Cavilla
2012-Oct-04 17:30 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 4, 2012, at 1:18 PM, Dan Magenheimer wrote:>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) >> >> >> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote: >> >>>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] >>>> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) >>>> >>>> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote: >>>> >>>>> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote: >>>>>> Tmem argues that doing "memory capacity transfers" at a page granularity >>>>>> can only be done efficiently in the hypervisor. This is true for >>>>>> page-sharing when it breaks a "share" also... it can''t go ask the >>>>>> toolstack to approve allocation of a new page every time a write to a shared >>>>>> page occurs. >>>>>> >>>>>> Does that make sense? >>>>> >>>>> Yes. The page-sharing version can be handled by having a pool of >>>>> dedicated memory for breaking shares, and the toolstack asynchronously >>>>> replenish that, rather than allowing CoW to use up all memory in the >>>>> system. >>>> >>>> That is doable. One benefit is that it would minimize the chance of a VM hitting a CoW ENOMEM. I >> don''t >>>> see how it would altogether avoid it. >>> >>> Agreed, so it doesn''t really solve the problem. (See longer reply >>> to Tim.) >>> >>>> If the objective is trying to put a cap to the unpredictable growth of memory allocations via CoW >>>> unsharing, two observations: (1) will never grow past nominal VM footprint (2) One can put a cap >> today >>>> by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep, and things can be kicked back >>>> into action at a later point. >>> >>> But IIRC isn''t it (2) that has given VMware memory overcommit a bad name? >>> Any significant memory pressure due to overcommit leads to double-swapping, >>> which leads to horrible performance? >> >> The little that I''ve been able to read from their published results is that a "lot" of CPU is consumed >> scanning memory and fingerprinting, which leads to a massive assault on micro-architectural caches. >> >> I don''t know if that equates to a "bad name", but I don''t think that is a productive discussion >> either. > > Sorry, I wasn''t intending that to be snarky, but on re-read I guess it > did sound snarky. What I meant is: Is this just a manual version of what > VMware does automatically? Or is there something I am misunderstanding? > (I think you answered that below.) > >> (2) doesn''t mean swapping. Note that d->max_pages can be set artificially low by an admin, raised >> again. etc. It''s just a mechanism to keep a VM at bay while corrective measures of any kind are taken. >> It''s really up to a higher level controller whether you accept allocations and later reach a point of >> thrashing. >> >> I understand this is partly where your discussion is headed, but certainly fixing the primary issue of >> nominal vanilla allocations preempting each other looks fairly critical to begin with. > > OK. I _think_ the design I proposed helps in systems that are using > page-sharing/host-swapping as well... I assume share-breaking just > calls the normal hypervisor allocator interface to allocate a > new page (if available)? If you could review and comment on > the design from a page-sharing/host-swapping perspective, I would > appreciate it.I think you will need to refine your notion of reservation. If you have nominal RAM N, and current RAM C, N >= C, it makes no sense to reserve N so the VM later has room to occupy by swapping-in, unsharing or whatever -- then you are not over-committing memory. To the extent that you want to facilitate VM creation, it does make sense to reserve C and guarantee that. Then it gets mm-specific. PoD has one way of dealing with the allocation growth. xenpaging tries to stick to the watermark -- if something swaps in something else swaps out. And uncooperative balloons are be stymied by xapi using d->max_pages. This is why I believe you need to solve the problem of initial reservation, and the problem of handing off to the right actor. And then xl need not care any further. Andres> > Thanks, > Dan
Dan Magenheimer
2012-Oct-04 17:55 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > On Oct 4, 2012, at 1:18 PM, Dan Magenheimer wrote: > > >> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] > >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > >> > >> > >> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote: > >> > > OK. I _think_ the design I proposed helps in systems that are using > > page-sharing/host-swapping as well... I assume share-breaking just > > calls the normal hypervisor allocator interface to allocate a > > new page (if available)? If you could review and comment on > > the design from a page-sharing/host-swapping perspective, I would > > appreciate it. > > I think you will need to refine your notion of reservation. If you have nominal RAM N, and current RAM > C, N >= C, it makes no sense to reserve N so the VM later has room to occupy by swapping-in, unsharing > or whatever -- then you are not over-committing memory. > > To the extent that you want to facilitate VM creation, it does make sense to reserve C and guarantee > that. > > Then it gets mm-specific. PoD has one way of dealing with the allocation growth. xenpaging tries to > stick to the watermark -- if something swaps in something else swaps out. And uncooperative balloons > are be stymied by xapi using d->max_pages. > > This is why I believe you need to solve the problem of initial reservation, and the problem of handing > off to the right actor. And then xl need not care any further. > > AndresI think we may be saying the same thing, at least in the context of the issue I am trying to solve (which, admittedly, may be a smaller part of a bigger issue, and we should attempt to ensure that the solution to the smaller part is at least a step in the right direction for the bigger issue). And I am trying to solve the mechanism problem only, not the policy which, I agree is mm-specific. The core problem, as I see it, is that there are multiple consumers of memory, some of which may be visible to xl and some of which are not. Ultimately, the hypervisor is asked to provide memory and will return failure if it can''t, so the hypervisor is the final arbiter. When a domain is created, we''d like to ensure there is enough memory for it to "not fail". But when the toolstack asks for memory to create a domain, it asks for it "piecemeal". I''ll assume that the toolstack knows how much memory it needs to allocate to ensure the launch doesn''t fail... my solution is that it asks for that entire amount of memory at once as a "reservation". If the hypervisor has that much memory available, it returns success and must behave as if the memory has been already allocated. Then, later, when the toolstack is happy that the domain did successfully launch, it says "remember that reservation? any memory reserved that has not yet been allocated, need no longer be reserved, you can unreserve it" In other words, between reservation and unreserve, there is no memory overcommit for that domain. Once the toolstack does the unreserve, its memory is available for overcommit mechanisms. Not sure if that part was clear: it''s my intent that unreserve occur soon after the domain is launched, _not_, for example, when the domain is shut down. What I don''t know is if there is a suitable point in the launch when the toolstack knows it can do the "release"... that may be the sticking point and may be mm-specific. Thanks, Dan
Olaf Hering
2012-Oct-04 18:26 UTC
Re: domain creation vs querying free memory (xend and xl)
On Mon, Oct 01, Dan Magenheimer wrote:> > From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > > Sent: Friday, September 28, 2012 11:12 AM > > To: Dan Magenheimer > > Cc: xen-devel@lists.xen.org; Kurt Hackel; Konrad Wilk > > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > > Dan Magenheimer writes ("[Xen-devel] domain creation vs querying free memory (xend and xl)"): > > > But the second domain launch fails, possibly after > > > several minutes because, actually, there isn''t enough > > > physical RAM for both. > > > > This is a real problem. The solution is not easy, and may not make it > > for 4.3. It would involve a rework of the memory handling code in > > libxl. > > [broadening cc to "Xen memory technology people", please forward/add > if I missed someone]Dan, I''m sure there has been already alot of thought and discussion about this issue, So here are my thoughts: In my opinion the code which is about to start a domain has to take all currently created/starting/running/dying domains, and their individual "allocation behaviour", into account before it can finally launch the domain. All of this needs math, not locking. A domain (domU or dom0) has a couple of constraints: - current nr_pages vs. target_nr_pages vs. max_pages - current PoD allocation vs. max_PoD - current paged_pages vs. target_nr_pages vs. max_paged_pages - some shared_pages - some tmem - maybe grant_pages - ... Depending on the state (starting and working towards a target number, running, dying) the "current" numbers above will increase or shrink. So the algorithm which turns the parameters above for each domain into a total number of allocated (or soon to be allocated) host memory has to work with "target numbers" instead of what is currently allocated. Some examples that come to mind: - a PoD domain will most likely use all of the pages configured with memory=, so that number should be used - the number shared pages is eventually not predictable. If so, this number could be handled as "allocated to the guest". Maybe a knob to say "running domains will have amount N shared" can exist? Dont know much about how sharing looks in practice. - ballooning may not reach the configured target, and the guest admin can just balloon up to the limit without notifying the toolstack - a new paging target will take some time until its reached, there is always some jitter during page-in/page-out, mapping guest pages will cause nomination failures. - tmem does something, I dont know. - no idea if grant pages are needed in the math Since the central management of xend is gone each libxl process is likely on its own, so two "xl create" can race when doing the math. Maybe a libxl process dies and leaves a mess behind. So that could make it difficult to get a good snapshot of the memory situation on the host. Maybe each domain could get some metadata to record the individual current/target/max numbers. Or if xenstore is good enough, something can cleanup zombie numbers. As IanJ said, the memory handling code in libxl needs such a feature to do the math right. The proposed handling of sharing/paging/ballooning/PoD/tmem/... in libxl is just a small part of it. Olaf
Dan Magenheimer
2012-Oct-04 19:38 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Olaf Hering [mailto:olaf@aepfle.de] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > On Mon, Oct 01, Dan Magenheimer wrote: >Hi Olaf -- Thanks for the reply.> domain. All of this needs math, not locking. > : > As IanJ said, the memory handling code in libxl needs such a feature to > do the math right. The proposed handling of > sharing/paging/ballooning/PoD/tmem/... in libxl is just a small part of > it.Unfortunately, as you observe in some of the cases earlier in your reply, it is more than a math problem for libxl... it is a crystal ball problem. If xl launches a domain D at time T and it takes N seconds before it has completed asking the hypervisor for all of the memory M that D will require to successfully launch, then xl must determine at time T the maximum memory allocated across all running domains for the future time period between T and T+N. In other words, xl must predict the future. Clearly this is impossible especially when page-sharing is not communicating its dynamic allocations (e.g. due to page-splitting) to libxl, and tmem is not communicating allocations resulting from multiple domains simultaenously making tmem hypercalls to libxl, and PoD is not communicating its allocations to libxl, and in-guest-kernel selfballooning is not communicating allocations to libxl. Only the hypervisor is aware of every dynamic allocation request. So all libxl can do is guess about the future because races are going to occur. Multiple threads are simultaneously trying to access a limited resource (pages of memory) and only the hypervisor knows whether there is enough to deliver memory for all requests. To me, the solution to racing for a shared resource is locking. Naturally, you want the critical path to be as short as possible. And you don''t want to lock all instances of the resource (i.e. every page in memory) if you can avoid it. And you need to ensure that the lock is honored for all requests to allocate the shared resource, meaning in this case that it has to be done in the hypervisor. I think that''s what the proposed design does: It provides a mechanism to ask the hypervisor to reserve a fixed amount of memory M, some or all of which will eventually turn into an allocation request; and a mechanism to ask the hypervisor to no longer honor that reservation ("unreserve") whether or not all of M has been allocated. It essentially locks that M amount of memory between reserve and unreserve so that other dynamic allocations (page-sharing, tmem, PoD, OR another libxl thread trying to create another domain) cannot sneak in and claim memory capacity that has been reserved. Does that make sense? Thanks, Dan
Olaf Hering
2012-Oct-04 20:18 UTC
Re: domain creation vs querying free memory (xend and xl)
On Thu, Oct 04, Dan Magenheimer wrote:> > From: Olaf Hering [mailto:olaf@aepfle.de] > > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > > On Mon, Oct 01, Dan Magenheimer wrote: > > > > Hi Olaf -- > > Thanks for the reply. > > > domain. All of this needs math, not locking. > > : > > As IanJ said, the memory handling code in libxl needs such a feature to > > do the math right. The proposed handling of > > sharing/paging/ballooning/PoD/tmem/... in libxl is just a small part of > > it. > > Unfortunately, as you observe in some of the cases earlier in your reply, > it is more than a math problem for libxl... it is a crystal ball problem. > If xl launches a domain D at time T and it takes N seconds before it has > completed asking the hypervisor for all of the memory M that D will require > to successfully launch, then xl must determine at time T the maximum memory > allocated across all running domains for the future time period between > T and T+N. In other words, xl must predict the future.I think xl can predict it, if it takes the target of all domains into account. Certainly not down to a handful pages, it would be good enough to know if the calculated estimate of free memory is good for the new guest and its specific memory targets.> Clearly this is impossible especially when page-sharing is not communicating > its dynamic allocations (e.g. due to page-splitting) to libxl, and tmem > is not communicating allocations resulting from multiple domains > simultaenously making tmem hypercalls to libxl, and PoD is not communicating > its allocations to libxl, and in-guest-kernel selfballooning is not communicating > allocations to libxl. Only the hypervisor is aware of every dynamic allocation > request.The hypervisor can not predict the future either, and it has even less info about the individual targets of each domain.> Does that make sense?It does, but: If xl reserves the memory in its own "virtual allocator", or if Xen gets such functionality, does not really matter, as long as its known how much exactly needs to be allocated. I think that part is missing. Olaf
Dan Magenheimer
2012-Oct-04 20:35 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Olaf Hering [mailto:olaf@aepfle.de] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > On Thu, Oct 04, Dan Magenheimer wrote: > > > > From: Olaf Hering [mailto:olaf@aepfle.de] > > > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > > > > > On Mon, Oct 01, Dan Magenheimer wrote: > > > > > > > Hi Olaf -- > > > > Thanks for the reply. > > > > > domain. All of this needs math, not locking. > > > : > > > As IanJ said, the memory handling code in libxl needs such a feature to > > > do the math right. The proposed handling of > > > sharing/paging/ballooning/PoD/tmem/... in libxl is just a small part of > > > it. > > > > Unfortunately, as you observe in some of the cases earlier in your reply, > > it is more than a math problem for libxl... it is a crystal ball problem. > > If xl launches a domain D at time T and it takes N seconds before it has > > completed asking the hypervisor for all of the memory M that D will require > > to successfully launch, then xl must determine at time T the maximum memory > > allocated across all running domains for the future time period between > > T and T+N. In other words, xl must predict the future. > > I think xl can predict it, if it takes the target of all domains into > account. Certainly not down to a handful pages, it would be good enough > to know if the calculated estimate of free memory is good for the new > guest and its specific memory targets.Well I don''t know enough about the page-sharing implementation but it''s not hard with tmem to synthesize a workload where the amount of free memory is half of RAM at time T and there is no RAM left at all at time T+(N/2) and three quarters of RAM is free at time T+N. That would be very hard for xl to predict. I expect that dramatic changes like this might be harder with page-sharing but not impossible.> > Clearly this is impossible especially when page-sharing is not communicating > > its dynamic allocations (e.g. due to page-splitting) to libxl, and tmem > > is not communicating allocations resulting from multiple domains > > simultaenously making tmem hypercalls to libxl, and PoD is not communicating > > its allocations to libxl, and in-guest-kernel selfballooning is not communicating > > allocations to libxl. Only the hypervisor is aware of every dynamic allocation > > request. > > The hypervisor can not predict the future either, and it has even less > info about the individual targets of each domain.The point is the hypervisor doesn''t need to predict the future and doesn''t need to know the individual targets. It just acts on allocation requests and, with the proposed design, on reservation requests.> > Does that make sense? > > It does, but: > If xl reserves the memory in its own "virtual allocator", or if Xen gets > such functionality, does not really matter, as long as its known how much > exactly needs to be allocated. I think that part is missing.I agree, though I think the only constraint is that the domain must be capable of booting. So if xl always requests a reservation of "mem=", I would think that should always work.
Ian Campbell
2012-Oct-05 09:44 UTC
Re: domain creation vs querying free memory (xend and xl)
On Thu, 2012-10-04 at 17:54 +0100, Dan Magenheimer wrote:> Scanning through the archived message I am under the impression > that the focus is on a single server... i.e. "punt if actor is > not xl", i.e. it addressed "balloon-to-fit" and only tries to avoid > stepping on other memory overcommit technologies.xl is inherently a single system toolstack, and a simple ballooning based actor would just be its default. The design is not intended to require that a toolstack only provide a single actor, or indeed that the actor is provided by the toolstack at all. It would be perfectly reasonable for xl to provide actors which work well with tmem or paging or sharing or some complex combination and even to select them by default when those technologies are enabled on the host. We also fully expect that other toolstacks will want to provide their own actors which make use of the facilities of those toolstacks to do a better job based on the additional stateetc (e.g. we expect xapi to want to provide a squeezed based actor). Lastly design is also intended to support "3rd party" actors which are not part of any toolstack. e.g. actors which talk to your cloud orchestration layer or with some central authority or which communicate with other hosts etc is intended to be a possibility.> That makes it > almost orthogonal, I think, to the problem I originally raised. > > But a bigger concern is that its focus on a single machine ignores > the "cloud", where Xen seems to hold an advantage. In the cloud, > the actor is "controlling" _many_ machines. In the problem I > originally raised, this actor (a centralized management console) > is simply looking for a server that has sufficient memory to house > a new domain, and it (or the automation/sysadmin running it) gets > unhappy if (xl running on) the server says "yes there is enough > memory" but then later says, "oops, I guess there wasn''t enough > after all".Integrating some sort of "entry control" into the actor protocol seems like a logical addition to me (assuming we didn''t already include it, I didn''t go back and check), since the details of when to say yes or no seem like they would very depend on the policies of that particular actor and the technologies which it is using to implement them. Ian.
George Dunlap
2012-Oct-05 11:40 UTC
Re: domain creation vs querying free memory (xend and xl)
On 04/10/12 17:54, Dan Magenheimer wrote:>> > Scanning through the archived message I am under the impression > that the focus is on a single server... i.e. "punt if actor is > not xl", i.e. it addressed "balloon-to-fit" and only tries to avoid > stepping on other memory overcommit technologies. That makes it > almost orthogonal, I think, to the problem I originally raised.No, the idea was to allow the flexibility of different actors in different situations. The plan was to start with a simple actor, but to add new ones as necessary. But on reflection, it seems like the whole "actor" thing was actually something completely separate to what we''re talking about here. The idea behind the actor (IIRC) was that you could tell the toolstack, "Make VM A use X amount of host memory"; and the actor would determine the best way to do that -- either by only ballooning, or ballooning first and then swapping. But it doesn''t decide how to get the value X. This thread has been very hard to follow for some reason, so let me see if I can understand everything: * You are concerned about being able to predictably start VMs in the face of: - concurrent requests, and - dynamic memory technologies (including PoD, ballooning, paging, page sharing, and tmem) Any of which may change the amount of free memory between the time a decision is made and the time memory is actually allocated. * You have proposed a hypervisor-based solution that allows the toolstack to "reserve" a specific amount of memory to a VM that will not be used for something else; this allocation is transactional -- it will either completely succeed, or completely fail, and do it quickly. Is that correct? The problem with that solution, it seems to me, is that the hypervisor does not (and I think probably should not) have any insight into the policy for allocating or freeing memory as a result of other activities, such as ballooning or page sharing. Suppose someone were ballooning down domain M to get 8GiB in order to start domain A; and at some point , another process looks and says, "Oh look, there''s 4GiB free, that''s enough to start domain B" and asks Xen to reserve that memory. Xen has no way of knowing that the memory freed by domain M was "earmarked" for domain A, and so will happily give it to domain B, causing domain A''s creation to fail (potentially). So it seems like we need to have the idea of a memory controller -- one central process (per host, as you say) that would know about all of the knobs -- ballooning, paging, page sharing, tmem, whatever -- that could be in charge of knowing where all the memory was coming from and where it was going. So if xl wanted to start a new VM, it can ask the memory controller for 3GiB, and the controller could decide, "I''ll take 1GiB from domain M and 2 from domain N, and give it to the new domain", and respond when it has the memory that it needs. Similarly, it can know that it should try to keep X megabytes for un-sharing of pages, and it can be responsible for freeing up more memory if that memory becomes exhausted. At the moment, the administrator himself (or the cloud orchestration layer) needs to be his own memory controller; that is, he needs to manually decide if there''s enough free memory to start a VM; if there''s not, he needs to figure out how to get that memory (either by ballooning or swapping). Ballooning and swapping are both totally under his control; the only thing he doesn''t control is the unsharing of pages. But as long as there was a way to tell the page sharing daemon not to allocate an amount of free memory, then this "administrator-as-memory-controller" should work just fine. Does that make sense? Or am I still confused? :-) -George
Andres Lagar-Cavilla
2012-Oct-05 14:25 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 4, 2012, at 1:55 PM, Dan Magenheimer wrote:>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) >> >> >> On Oct 4, 2012, at 1:18 PM, Dan Magenheimer wrote: >> >>>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] >>>> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) >>>> >>>> >>>> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote: >>>> >>> OK. I _think_ the design I proposed helps in systems that are using >>> page-sharing/host-swapping as well... I assume share-breaking just >>> calls the normal hypervisor allocator interface to allocate a >>> new page (if available)? If you could review and comment on >>> the design from a page-sharing/host-swapping perspective, I would >>> appreciate it. >> >> I think you will need to refine your notion of reservation. If you have nominal RAM N, and current RAM >> C, N >= C, it makes no sense to reserve N so the VM later has room to occupy by swapping-in, unsharing >> or whatever -- then you are not over-committing memory. >> >> To the extent that you want to facilitate VM creation, it does make sense to reserve C and guarantee >> that. >> >> Then it gets mm-specific. PoD has one way of dealing with the allocation growth. xenpaging tries to >> stick to the watermark -- if something swaps in something else swaps out. And uncooperative balloons >> are be stymied by xapi using d->max_pages. >> >> This is why I believe you need to solve the problem of initial reservation, and the problem of handing >> off to the right actor. And then xl need not care any further. >> >> Andres > > I think we may be saying the same thing, at least in the context > of the issue I am trying to solve (which, admittedly, may be > a smaller part of a bigger issue, and we should attempt to ensure > that the solution to the smaller part is at least a step in the > right direction for the bigger issue). And I am trying to > solve the mechanism problem only, not the policy which, I agree is > mm-specific. > > The core problem, as I see it, is that there are multiple consumers of > memory, some of which may be visible to xl and some of which are > not. Ultimately, the hypervisor is asked to provide memory > and will return failure if it can''t, so the hypervisor is the > final arbiter. > > When a domain is created, we''d like to ensure there is enough memory > for it to "not fail". But when the toolstack asks for memory to > create a domain, it asks for it "piecemeal". I''ll assume that > the toolstack knows how much memory it needs to allocate to ensure > the launch doesn''t fail... my solution is that it asks for that > entire amount of memory at once as a "reservation". If the > hypervisor has that much memory available, it returns success and > must behave as if the memory has been already allocated. Then, > later, when the toolstack is happy that the domain did successfully > launch, it says "remember that reservation? any memory reserved > that has not yet been allocated, need no longer be reserved, you > can unreserve it" > > In other words, between reservation and unreserve, there is no > memory overcommit for that domain. Once the toolstack does > the unreserve, its memory is available for overcommit mechanisms.I think that will be fragile. Suppose you have a 16 GiB domain and an overcommit mechanism that allows you to start the VM with 8 GiB. Straight-forward scenario with xen-4.2 and a combination of PoD and ballooning. Suppose you have 14GiB of RAM free in the system. Why should creation of that domain fail? Andres> > Not sure if that part was clear: it''s my intent that unreserve occur > soon after the domain is launched, _not_, for example, when the domain > is shut down. What I don''t know is if there is a suitable point > in the launch when the toolstack knows it can do the "release"... > that may be the sticking point and may be mm-specific. > > Thanks, > Dan
Dan Magenheimer
2012-Oct-07 23:43 UTC
Re: domain creation vs querying free memory (xend and xl)
> > In other words, between reservation and unreserve, there is no > > memory overcommit for that domain. Once the toolstack does > > the unreserve, its memory is available for overcommit mechanisms. > > I think that will be fragile. Suppose you have a 16 GiB domain and an overcommit mechanism that allows > you to start the VM with 8 GiB. Straight-forward scenario with xen-4.2 and a combination of PoD and > ballooning. Suppose you have 14GiB of RAM free in the system. Why should creation of that domain fail?It shouldn''t. Either I''m not clear or I don''t understand PoD. My understanding of PoD is that, for the above case, the domain has "mem=8192 maxmem=16394". So with my proposal xl would ask for a reservation of 8192M and, when the domain is successfully launched (i.e. for PoD, balloon driver is running?), make the matching unreserve call. * Not sure why that would be any more fragile than today. In fact it seems to me it is less fragile... changing your example to "8GiB of RAM free in the system", today, xl will ask if there is enough memory and will be told yes and attempt to launch the domain. But then suppose in between the time xl asks the hypervisor if there is enough free memory and the time it attempts to launch the domain, another domain eats up a few pages and now there is ever so slightly less than 8GiB. Won''t the domain creation commence and then fail a few moments later? (A few moments, probably not a big deal, but multiply the memory sizes by 64 and a few moments becomes a few minutes!) With my proposal, the domain will immediately fail to launch because the reservation will fail. * Maybe the above "there is no memory overcommit for that domain" was confusing? I suppose you could call that "mem=8192 maxmem=16384" overcommit... I just didn''t think of it that way.
Dan Magenheimer
2012-Oct-08 01:02 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: George Dunlap [mailto:george.dunlap@eu.citrix.com] > Sent: Friday, October 05, 2012 5:40 AM > To: Dan Magenheimer > Cc: Andres Lagar-Cavilla; Ian Campbell; Tim (Xen.org); Olaf Hering; Keir (Xen.org); Konrad Wilk; Kurt > Hackel; Ian Jackson; xen-devel@lists.xen.org; George Shuklin; Dario Faggioli > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl)Hi George -- Thanks for your thoughts!> On 04/10/12 17:54, Dan Magenheimer wrote: > >> > > Scanning through the archived message I am under the impression > > that the focus is on a single server... i.e. "punt if actor is > > not xl", i.e. it addressed "balloon-to-fit" and only tries to avoid > > stepping on other memory overcommit technologies. That makes it > > almost orthogonal, I think, to the problem I originally raised. > No, the idea was to allow the flexibility of different actors in > different situations. The plan was to start with a simple actor, but to > add new ones as necessary. But on reflection, it seems like the whole > "actor" thing was actually something completely separate to what we''re > talking about here. The idea behind the actor (IIRC) was that you could > tell the toolstack, "Make VM A use X amount of host memory"; and the > actor would determine the best way to do that -- either by only > ballooning, or ballooning first and then swapping. But it doesn''t > decide how to get the value X.OK, so if the actor stuff is orthogonal, let''s go back to the original problem. We do want to ensure the solution doesn''t _break_ the actor idea... but IMHO any assumption that there is an actor that can always sufficiently "control" memory allocation is suspect.> This thread has been very hard to follow for some reason, so let me see > if I can understand everything: > * You are concerned about being able to predictably start VMs in the > face of: > - concurrent requests, and > - dynamic memory technologies (including PoD, ballooning, paging, page > sharing, and tmem) > Any of which may change the amount of free memory between the time a > decision is made and the time memory is actually allocated. > * You have proposed a hypervisor-based solution that allows the > toolstack to "reserve" a specific amount of memory to a VM that will not > be used for something else; this allocation is transactional -- it will > either completely succeed, or completely fail, and do it quickly. > > Is that correct?Yes, good summary.> The problem with that solution, it seems to me, is that the hypervisor > does not (and I think probably should not) have any insight into the > policy for allocating or freeing memory as a result of other activities, > such as ballooning or page sharing. Suppose someone were ballooning > down domain M to get 8GiB in order to start domain A; and at some point > , another process looks and says, "Oh look, there''s 4GiB free, that''s > enough to start domain B" and asks Xen to reserve that memory. Xen has > no way of knowing that the memory freed by domain M was "earmarked" for > domain A, and so will happily give it to domain B, causing domain A''s > creation to fail (potentially).I agree completely that the hypervisor shouldn''t have any insight into the _policy_ (though see below). I''m just proposing an extension to the existing mechanism and I am quite convinced that the hypervisor must be involved (e.g. a new hypercall) for the extension to work properly. In your example, the "someone" ballooning down domain M to get 8GiB for domain M would need somehow to "reserve" the memory for domain M. I didn''t foresee the use of the proposed reservation mechanism beyond domain creation, but it could probably be used for large ballooning quantities as well.> So it seems like we need to have the idea of a memory controller -- one > central process (per host, as you say) that would know about all of the > knobs -- ballooning, paging, page sharing, tmem, whatever -- that could > be in charge of knowing where all the memory was coming from and where > it was going. So if xl wanted to start a new VM, it can ask the memory > controller for 3GiB, and the controller could decide, "I''ll take 1GiB > from domain M and 2 from domain N, and give it to the new domain", and > respond when it has the memory that it needs. Similarly, it can know > that it should try to keep X megabytes for un-sharing of pages, and it > can be responsible for freeing up more memory if that memory becomes > exhausted.First, let me quibble about the term you used. It''s especially important for you, George, because I know your previous Xen contributions. IMHO, we are not talking about a "memory controller", we are talking about a "memory scheduler". In a CPU scheduler, one would never assume that all demands for CPU time should be reviewed and granted by some userland process in dom0 (and certainly not by some grand central data center manager). That would be silly. Instead, we provide some policy parameters and let each hypervisor make intelligent dynamic decisions thousands of times every second based on those parameters. IMHO, the example you give for asking a memory controller for GiB of memory is equally silly. Outside of some geek with a handful of VMs on a single machine, there is inadequate information from any VM to drive automatic memory allocation decisions and, even if there was, it just doesn''t scale. It doesn''t scale either up, to many VMs across many physical machines, or down, to instantaneous needs of one-page-at-a-time requests for unsharing or for tmem. (Also see my previous comments to Tim about memory-overcommit-by- undercommit: There isn''t sufficient information to size any emergency buffer for unsharing either... too big and you waste memory, too little and it doesn''t solve the underlying problem.)> At the moment, the administrator himself (or the cloud orchestration > layer) needs to be his own memory controller; that is, he needs to > manually decide if there''s enough free memory to start a VM; if there''s > not, he needs to figure out how to get that memory (either by ballooning > or swapping). Ballooning and swapping are both totally under his > control; the only thing he doesn''t control is the unsharing of pages. > But as long as there was a way to tell the page sharing daemon not to > allocate an amount of free memory, then this > "administrator-as-memory-controller" should work just fine. > > Does that make sense? Or am I still confused? :-)It mostly makes sense until you get to host-swapping/unsharing, see comments above. And tmem takes the "doesn''t control" to a whole new level. Meaning tmem (IMHO) completely eliminates the possibility of a "memory controller" and begs for a "memory scheduler". Tmem really is a breakthrough on memory management in a virtualized system. I realize that many people are in the "if it doesn''t work on Windows, I don''t care" camp. And others never thought it would make it into upstream Linux (or don''t care because it isn''t completely functional in any distros yet... other than Oracle''s.. but since all parts are now upstream, it will be soon). But there probably are also many that just don''t understand it... I guess I need to work on fixing that. Any thoughts on how to start? In any case, though the reservation proposal is intended to cover tmem as well, I think it is still needed for page-sharing and domain-creation "races". Dan
George Dunlap
2012-Oct-16 11:49 UTC
Re: domain creation vs querying free memory (xend and xl)
On Mon, Oct 8, 2012 at 2:02 AM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:> Tmem really is a breakthrough on memory management in a virtualized > system. I realize that many people are in the "if it doesn''t > work on Windows, I don''t care" camp. And others never thought > it would make it into upstream Linux (or don''t care because it isn''t > completely functional in any distros yet... other than Oracle''s.. > but since all parts are now upstream, it will be soon). But there > probably are also many that just don''t understand it... I guess I need > to work on fixing that. Any thoughts on how to start?Well, I''m sorry to say this, but to start I think you need to work on your communication. I had read this entire thread 2 or 3 times before writing my last response; and I have now read this e-mail half a dozen times, and I''m still don''t have a good idea what it is you''re talking about. If I didn''t respect you, I would have just given up on the 2nd try. In my summary, I mentioned just 2 things: the problem of domain creation, and the solution of a hypercall to allocate a big chunk of memory to a domain. You answered by saying it was a good summary. But then you said:> I''m just proposing an extension to the > existing mechanism and I am quite convinced that the hypervisor must > be involved (e.g. a new hypercall) for the extension to work properly.Now you''re talking about an extension... then you mention a "memory scheduler" (which we don''t yet have), and say:> ...there is inadequate information from > any VM to drive automatic memory allocation decisions and, even if > there was, it just doesn''t scale.But you don''t say where or who *could* have adequate information; which again hints at something else which you have in mind, but you haven''t actually talked about very explicitly yet. If you have been trying to talk about it, and it wasn''t in my summary, why didn''t you say something about it, instead of saying, "Yes that''s right"? And if you haven''t talked about it, why are you speaking as though we all know already what you''re talking about? Furthermore, you say things like this:> IMHO, the example you give for asking a memory controller for GiB > of memory is equally silly. Outside of some geek with a handful > of VMs on a single machine, there is inadequate information from > any VM to drive automatic memory allocation decisions and, even if > there was, it just doesn''t scale. It doesn''t scale either up, to > many VMs across many physical machines, or down, to instantaneous > needs of one-page-at-a-time requests for unsharing or for tmem.What do you mean, "doesn''t scale up or across"? Why not? Why is there inadequate information inside dom0 for a toolstack-based memory controller? And if there''s not enough information there, who *does* have the information? It''s just a bunch of vague assertions with no justification and no alternative proposed. It doesn''t bring any light to the discussion (which is no doubt why the thread has died without conclusion). Nor does saying "see above" and "see below", when "above" and "below" are still equally unenlightening. Maybe your grand designs for a "memory scheduler", where memory pages hop back and forth at millisecond quanta based on instantaneous data, between page sharing, paging, tmem, and so on, is a good one. But that''s not what we have now. And that''s not even what you''re trying to promote. Instead, you''re trying to push a single hypercall that you think will be necessary for such a scheduler. Doesn''t it make sense to *first* talk about your grand vision and come up with a reasonable plan for it, *then* propose an implementation? If in the course of your 15-patch series introducing a "memory scheduler", you also introduce a "reservation" hypercall, then everyone can see exactly what it accomplishes, and actually see if it''s necessary, or if some other design would work better. Does that make sense? If I still haven''t understood where you''re coming from, then I am sorry; but I have tried pretty hard, and I''m not the only one having that problem. -George
Dan Magenheimer
2012-Oct-16 17:51 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > On Mon, Oct 8, 2012 at 2:02 AM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: > > Tmem really is a breakthrough on memory management in a virtualized > > system. I realize that many people are in the "if it doesn''t > > work on Windows, I don''t care" camp. And others never thought > > it would make it into upstream Linux (or don''t care because it isn''t > > completely functional in any distros yet... other than Oracle''s.. > > but since all parts are now upstream, it will be soon). But there > > probably are also many that just don''t understand it... I guess I need > > to work on fixing that. Any thoughts on how to start? > > Well, I''m sorry to say this, but to start I think you need to work on > your communication. I had read this entire thread 2 or 3 times before > writing my last response; and I have now read this e-mail half a dozen > times, and I''m still don''t have a good idea what it is you''re talking > about. If I didn''t respect you, I would have just given up on the 2nd > try. > : > If I still haven''t understood where you''re coming from, then I am > sorry; but I have tried pretty hard, and I''m not the only one having > that problem.Hi George -- Thanks for the honest direct feedback. I had no idea. I have been buried in this memory stuff since April 2008 and it is easy for me to assume that people understand what I am talking about, have read everything I''ve written about it, seen/remember my presentations etc. Further, the conversational delays due to timezone differences and the fact that we all are juggling many different deliverables makes it difficult to maintain all the context necessary to drive/converge a complex discussion. So I am truly sorry and I really appreciate that you''ve stuck with me. Let me ponder how to improve, but try to maintain some forward progress in the interim by continuing this thread. There are two things being mixed here: (A) The very general concepts of how to deal with RAM capacity as a resource and how to best "control" "sharing" of the resource among virtual machines; and (B) how to solve a very specific known problem that occurs due to "races" for memory capacity. Solving (B) requires some assumptions about (A) which is why (A) keeps coming up. I''ll mark my comments below with (A) and (B) to make it clear which is being discussed.> In my summary, I mentioned just 2 things: the problem of domain > creation, and the solution of a hypercall to allocate a big chunk of > memory to a domain. You answered by saying it was a good summary. > But then you said: > > > I''m just proposing an extension to the > > existing mechanism and I am quite convinced that the hypervisor must > > be involved (e.g. a new hypercall) for the extension to work properly. > > Now you''re talking about an extension...This is (B) Extension == new hypercall. (It''s an extension to the way memory has previously been allocated by the hypervisor.)> then you mention a "memory > scheduler" (which we don''t yet have), and say: > > > ...there is inadequate information from > > any VM to drive automatic memory allocation decisions and, even if > > there was, it just doesn''t scale. > > But you don''t say where or who *could* have adequate information; > which again hints at something else which you have in mind, but you > haven''t actually talked about very explicitly yet. If you have been > trying to talk about it, and it wasn''t in my summary, why didn''t you > say something about it, instead of saying, "Yes that''s right"? And if > you haven''t talked about it, why are you speaking as though we all > know already what you''re talking about?(A) My bad. The premise of tmem (and IMHO the thorn in the side of all memory capacity management in virtualized systems) is that *nobody* has adequate information. The guest OS has some "demand" information, though not in any externally-communicable form, and the host/hypervisor has "supply" information. Tmem uses a small handful of kernel changes and some hypercalls to tie these together in a surprisingly useful way.> Furthermore, you say things like this: > > > IMHO, the example you give for asking a memory controller for GiB > > of memory is equally silly. Outside of some geek with a handful > > of VMs on a single machine, there is inadequate information from > > any VM to drive automatic memory allocation decisions and, even if > > there was, it just doesn''t scale. It doesn''t scale either up, to > > many VMs across many physical machines, or down, to instantaneous > > needs of one-page-at-a-time requests for unsharing or for tmem. > > What do you mean, "doesn''t scale up or across"? Why not? Why is > there inadequate information inside dom0 for a toolstack-based memory > controller? And if there''s not enough information there, who *does* > have the information? It''s just a bunch of vague assertions with no > justification and no alternative proposed. It doesn''t bring any light > to the discussion (which is no doubt why the thread has died without > conclusion).(A) There is inadequate information period. OS''s have forever been designed to manage a fixed amount of RAM, not to communicate very well about if and when the OS needs more RAM (and how much) or can get by with less RAM (and how much). So any external "memory controller" is (IMHO) doomed to failure, limited to approximations based on pieces of guest-OS-externally-visible usually-out-of-date information collected at a relatively low frequency. Collecting/analyzing/acting-on the information across hundreds/thousands of guests is very difficult (doesn''t "scale up"), collecting/analyzing/acting-on the information across hundreds of machines -- each with hundreds/thousands of guests has exponential communication and bin-packing problems (doesn''t scale "across") and, if the memory-demand is a high-frequency stream of single pages (i.e. with page-unsharing), sampling by the memory controller can''t possibly keep up (doesn''t "scale down"). This is only slightly better than a bunch of vague assertions, but if you disagree, let''s take it down a level in a separate thread. My proposed alternative is tmem. which is why it may appear that I haven''t proposed anything... tmem already exists today.> Nor does saying "see above" and "see below", when "above" and "below" > are still equally unenlightening.Oops, sorry. :-} Just trying to avoid repeating myself.> Maybe your grand designs for a "memory scheduler", where memory pages > hop back and forth at millisecond quanta based on instantaneous data, > between page sharing, paging, tmem, and so on, is a good one. But > that''s not what we have now.(A) Tmem *is* essentially a memory scheduler. A grand design is implemented, works, and all the parts are upstream in open source.> And that''s not even what you''re trying > to promote. Instead, you''re trying to push a single hypercall that > you think will be necessary for such a scheduler.(B) Strangely, tmem doesn''t really need this hypercall. It already has a solution working in xm create called "tmem freeze/thaw". But this solution is a half-assed very heavy hammer. The single "memory reservation" hypercall is intended to help solve a known problem (IanJ said early in this thread: "This is a real problem") with any environment where the amount of RAM used by a guest can change dynamically without the knowledge of a not-in-hypervisor "memory controller", and the toolstack then wishes to launch a new domain. The problem can even occur with multiple toolstack threads simultaneously launching domains. After further thought, it appeared that the "memory reservation" hypercall also eliminates the need for the half-assed tmem freeze/thaw as well.> Doesn''t it make sense to *first* talk about your grand vision and come > up with a reasonable plan for it, *then* propose an implementation? > If in the course of your 15-patch series introducing a "memory > scheduler", you also introduce a "reservation" hypercall, then > everyone can see exactly what it accomplishes, and actually see if > it''s necessary, or if some other design would work better. > > Does that make sense?If you reread my last response with the assumption in mind: "tmem == an instance of a memory scheduler == grand vision" then does the discussion of the "memory reservation" hypercall make more sense? Thanks again for the pointed communication feedback. Hopefully this is a bit better and I will continue to ponder more communication improvements. Dan
George Dunlap
2012-Oct-17 17:35 UTC
Re: domain creation vs querying free memory (xend and xl)
[Sorry, forgot to reply-to-all] On Tue, Oct 16, 2012 at 6:51 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:> > If you reread my last response with the assumption in mind: > > "tmem == an instance of a memory scheduler == grand vision" > > then does the discussion of the "memory reservation" hypercall > make more sense?Sort of. :-) Unfortunately, I think it shows a bit of confusion, which is perhaps why it was hard to understand. But let''s go back for a minute to the problem at hand: you''re afraid of free memory disappearing between a toolstack checking for the memory, and the toolstack actually creating the VM. There are two ways this could happen: 1. Another admin command (perhaps by another administrator) has caused the memory to go away -- i.e,. another admin has called "xl create", or has instructed a VM to balloon up to a higher amount of memory. 2. One of the self-directed processes in the system has allocated the memory: a balloon driver has ballooned up, or the swapper has swapped something in, or the page sharing daemon has had to un-share pages. In the case of #1, I think the right answer to that is, "Don''t do that." :-) The admins should co-ordinate with each other about what to start where; if they both want to use a bit of memory, that''s a human interaction problem, not a technological one. Alternately, if we''re talking a cloud orchestration layer, the cloud orchestration should have an idea how much memory is available on each node, and not allow different users to issue commands which would violate those. In the case of #2, I think the answer is, "self-directed processes should not be allowed to consume free memory without permission from the toolstack". The pager should not increase the memory footprint of a VM unless either told to by an admin or a memory controller which has been given authority by an admin. (Yes, memory controller, not scheduler -- more on that in another e-mail.) A VM should be given a fixed amount of memory above which the balloon driver cannot go. The page-sharing daemon should have a small amount set aside to handle un-sharing requests; but this should be immediately replenished by other methods (preferably by ballooning a VM down, or if necessary by swapping pages out). It should not be able to make arbitrarily large allocations without permission from the toolstack. I was chatting with Konrad yesterday, and he brought up "self-ballooning" VMs, which apparently vonluntarily choose to balloon down to *below* their toolstack-dictated balloon target, in order to induce Linux to swap some pages out to tmem, and will then balloon up to the toolstack-dictated target later. It seems to me that the Right Thing in this case is for the toolstack to know that this "free" memory isn''t really free -- that if your 2GiB VM is only using 1.5GiB, you nonetheless don''t touch that 0.5GiB, because you know it may use it later. This is what xapi does. Alternately, if you don''t want to do that accounting, and just want to use Xen''s free memory to determine if you can start a VM, then you could just have your "self-ballooning" processes *not actually free the memory*. That way the free memory would be an accurate representation of how much memory is actually present on a system. In all of this discussion, I don''t see any reason to bring up tmem at all (except to note the reason why a VM may balloon down). It''s just another area to which memory can be allocated (along with Xen or a domain). It also should not be allowed to allocate free Xen memory to itself without being specifically instructed by the toolstack, so it can''t cause the problem you''re talking about. Any system that follows the rules I''ve set above won''t have to worry about free memory disappearing half-way through domain creation. I''m not fundamentally opposed to the idea of an "allocate memory to a VM" hypercall; but the arguments adduced to support this seem hopelessly confused, which does not bode well for the usefulness or maintainability of such a hypercall. -George
George Dunlap
2012-Oct-17 17:35 UTC
Re: domain creation vs querying free memory (xend and xl)
On Wed, Oct 17, 2012 at 6:30 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> A VM should be given a > fixed amount of memory above which the balloon driver cannot go.I forgot to mention: there is a limit you can set in the hypervisor such that the balloon driver cannot go up past a certain point. And since 4.1, I think, it has been possible to set this limit to below what the VM currently has alloated -- the effect being, that as soon as the VM balloons down to that point, it cannot balloon back up. Xapi sets this value at the same time it sets the balloon target in xenstore, so that it can have confidence that once it actually has some free memory, it won''t disappear from under its feet.
Andres Lagar-Cavilla
2012-Oct-17 18:33 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 17, 2012, at 1:35 PM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> [Sorry, forgot to reply-to-all] > > On Tue, Oct 16, 2012 at 6:51 PM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: >> >> If you reread my last response with the assumption in mind: >> >> "tmem == an instance of a memory scheduler == grand vision" >> >> then does the discussion of the "memory reservation" hypercall >> make more sense? > > Sort of. :-) Unfortunately, I think it shows a bit of confusion, which > is perhaps why it was hard to understand. > > But let''s go back for a minute to the problem at hand: you''re afraid > of free memory disappearing between a toolstack checking for the > memory, and the toolstack actually creating the VM. > > There are two ways this could happen: > > 1. Another admin command (perhaps by another administrator) has caused > the memory to go away -- i.e,. another admin has called "xl create", > or has instructed a VM to balloon up to a higher amount of memory. > > 2. One of the self-directed processes in the system has allocated the > memory: a balloon driver has ballooned up, or the swapper has swapped > something in, or the page sharing daemon has had to un-share pages. > > In the case of #1, I think the right answer to that is, "Don''t do > that." :-) The admins should co-ordinate with each other about what > to start where; if they both want to use a bit of memory, that''s a > human interaction problem, not a technological one. Alternately, if > we''re talking a cloud orchestration layer, the cloud orchestration > should have an idea how much memory is available on each node, and not > allow different users to issue commands which would violate those. > > In the case of #2, I think the answer is, "self-directed processes > should not be allowed to consume free memory without permission from > the toolstack". The pager should not increase the memory footprint of > a VM unless either told to by an admin or a memory controller which > has been given authority by an admin. (Yes, memory controller, not > scheduler -- more on that in another e-mail.) A VM should be given a > fixed amount of memory above which the balloon driver cannot go. The > page-sharing daemon should have a small amount set aside to handle > un-sharing requests; but this should be immediately replenished by > other methods (preferably by ballooning a VM down, or if necessary by > swapping pages out). It should not be able to make arbitrarily large > allocations without permission from the toolstack.Something that I struggle with here is the notion that we need to extend the hypervisor for any aspect of the discussion we''ve had so far. I just don''t see that. The toolstack has (or should definitely have) a non-racy view of the memory of the host. Reservations are therefore notions the toolstack manages. Domains can be cajoled into obedience via the max_pages tweak -- which I profoundly dislike. If anything we should change the hypervisor to have a "current_allowance" or similar field with a more obvious meaning. The abuse of max_pages makes me cringe. Not to say I disagree with its usefulness. Once you guarantee no "ex machina" entities fudging the view of the memory the toolstack has, then all known methods can be bounded in terms of their capacity to allocate memory unsupervised. Note that this implies as well, I don''t see the need for a pool of "unshare" pages. It''s all in the heap. The toolstack ensures there is something set apart. I further think the pod cache could be converted to this model. Why have specific per-domain lists of cached pages in the hypervisor? Get them back from the heap! Obviously places a decoupled requirement of certain toolstack features. But allows to throw away a lot of complex code. My two cents for the new iteration Andres> > I was chatting with Konrad yesterday, and he brought up > "self-ballooning" VMs, which apparently vonluntarily choose to balloon > down to *below* their toolstack-dictated balloon target, in order to > induce Linux to swap some pages out to tmem, and will then balloon up to > the toolstack-dictated target later. > > It seems to me that the Right Thing in this case is for the toolstack > to know that this "free" memory isn''t really free -- that if your 2GiB > VM is only using 1.5GiB, you nonetheless don''t touch that 0.5GiB, > because you know it may use it later. This is what xapi does. > > Alternately, if you don''t want to do that accounting, and just want to > use Xen''s free memory to determine if you can start a VM, then you > could just have your "self-ballooning" processes *not actually free > the memory*. That way the free memory would be an accurate > representation of how much memory is actually present on a system. > > In all of this discussion, I don''t see any reason to bring up tmem at > all (except to note the reason why a VM may balloon down). It''s just > another area to which memory can be allocated (along with Xen or a > domain). It also should not be allowed to allocate free Xen memory to > itself without being specifically instructed by the toolstack, so it can''t > cause the problem you''re talking about. > > Any system that follows the rules I''ve set above won''t have to worry > about free memory disappearing half-way through domain creation. > > I''m not fundamentally opposed to the idea of an "allocate memory to a > VM" hypercall; but the arguments adduced to support this seem > hopelessly confused, which does not bode well for the usefulness or > maintainability of such a hypercall. > > -George
Dan Magenheimer
2012-Oct-17 18:45 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > On Tue, Oct 16, 2012 at 6:51 PM, Dan Magenheimer > <dan.magenheimer@oracle.com> wrote: > > > > If you reread my last response with the assumption in mind: > > > > "tmem == an instance of a memory scheduler == grand vision" > > > > then does the discussion of the "memory reservation" hypercall > > make more sense? > > Sort of. :-) Unfortunately, I think it shows a bit of confusion, which > is perhaps why it was hard to understand. > : > I''m not fundamentally opposed to the idea of an "allocate memory to a > VM" hypercall; but the arguments adduced to support this seem > hopelessly confused, which does not bode well for the usefulness or > maintainability of such a hypercall.Hi George -- Now I think I have a better idea as to why you are not understanding and why you think this is confusing!!! It seems we are not only speaking different languages but are from completely different planets! I.e. our world views are very very different. You have a very very static/partitioned/restrictive/controlled view of how memory should be managed in a virtual environment. I have a very very dynamic view of how memory should be managed in a virtual environment. Tmem -- and the ability to change guest kernels to cooperate in dynamic memory management -- very obviously drives my world view, but my view reveals subtle deficiencies in your world view. Xapi and the constraints it lives under (i.e. requirement for proprietary HVM guest kernels) and the existing Xapi memory controller model seems good enough for you, so your view makes my need for handling subtle dynamic corner cases appear that I must have some secret fantastical "grand design" in mind.> Any system that follows the rules I''ve set above won''t have to worry > about free memory disappearing half-way through domain creation.Agreed. My claim is that: (1) tmem can''t possibly follow your rules as it would decrease its value/performance by several orders of magnitude; (2) page-unsharing/swapping can''t possibly follow your rules because the corner cases it must deal with are urgent, frequent, and unpredictable; (3) a "cloud orchestration layer" can''t follow your rules because of complexity and communication limits, unless it greatly constrains its flexibility/automation; (4) following your rules serializes common administration activities even for Xapi that otherwise don''t need to be serialized. I think your rules take an overconstrained problem (managing memory for multiple VMs) and add more constraints. While IMHO tmem takes away constraints. That''s why I brought up CPU schedulers. I know you are an expert in CPU scheduling, and you would never apply similar rules to CPU scheduling that you want to apply to "memory scheduling". E.g. you would never require the toolstack to be in the critical path for every VCPU->CPU reassignment. And so I have to try to solve a problem that you don''t have (or IMHO that you will likely have in the future but don''t admit to yet ;-) And I think the "reservation" hypercall will solve that problem.> In all of this discussion, I don''t see any reason to bring up tmem at > all (except to note the reason why a VM may balloon down). It''s just > another area to which memory can be allocated (along with Xen or a > domain). It also should not be allowed to allocate free Xen memory to > itself without being specifically instructed by the toolstack, so it can''t > cause the problem you''re talking about.This is all very wrong. It''s clear you don''t understand why tmem exists, how it works, and what its value is/can be in the cloud. I''ll take some of the blame for that because I''ve had to spend so much time in Linux-kernel land in the last couple of years. But if you want to try a different world view, and understand tmem, let me know ;-) I don''t mean to be immodest, but I truly believe it is the first significant advance in managing RAM in a virtual environment in ten years (since Waldspurger). Dan
Dan Magenheimer
2012-Oct-17 19:46 UTC
Re: domain creation vs querying free memory (xend and xl)
> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl)Hi Andres -- Re reply just sent to George... I think you must be on a third planet, revolving somewhere between George''s and mine. I say that because I agree completely with some of your statements and disagree with the conclusions you draw from them! :-)> Domains can be cajoled into obedience via the max_pages tweak -- which I profoundly dislike. If > anything we should change the hypervisor to have a "current_allowance" or similar field with a more > obvious meaning. The abuse of max_pages makes me cringe. Not to say I disagree with its usefulness.Me cringes too. Though I can see from George''s view that it makes perfect sense. Since the toolstack always controls exactly how much memory is assigned to a domain and since it can cache the "original max", current allowance and the hypervisors view of max_pages must always be the same. Only if the hypervisor or the domain or the domain''s administrator can tweak current memory usage without the knowledge of the toolstack (which is closer to my planet) does an issue arise. And, to me, that''s the foundation of this whole thread.> Once you guarantee no "ex machina" entities fudging the view of the memory the toolstack has, then all > known methods can be bounded in terms of their capacity to allocate memory unsupervised. > Note that this implies as well, I don''t see the need for a pool of "unshare" pages. It''s all in the > heap. The toolstack ensures there is something set apart.By "ex machina" do you mean "without the toolstack''s knowledge"? Then how does page-unsharing work? Does every page-unshare done by the hypervisor require serial notification/permission of the toolstack? Or is this "batched", in which case a pool is necessary, isn''t it? (Not sure what you mean by "no need for a pool" and then "toolstack ensures there is something set apart"... what''s the difference?) My point is, whether there is no pool or a pool that sometimes runs dry, are you really going to put the toolstack in the hypervisor''s path for allocating a page so that the hypervisor can allocate a new page for CoW to fulfill an unshare?> Something that I struggle with here is the notion that we need to extend the hypervisor for any aspect > of the discussion we''ve had so far. I just don''t see that. The toolstack has (or should definitely > have) a non-racy view of the memory of the host. Reservations are therefore notions the toolstack > manages.In a perfect world where the toolstack has an oracle for the precise time-varying memory requirements for all guests, I would agree. In that world, there''s no need for a CPU scheduler either... the toolstack can decide exactly when to assign each VCPU for each VM onto each PCPU, and when to stop and reassign. And then every PCPU would be maximally utilized, right? My point: Why would you resource-manage CPUs differently from memory? The demand of real-world workloads varies dramatically for both... don''t you want both to be managed dynamically, whenever possible? If yes (dynamic is good), in order for the toolstack''s view of memory to be non-racy, doesn''t every hypervisor page allocation need to be serialized with the toolstack granting notification/permission?> I further think the pod cache could be converted to this model. Why have specific per-domain lists of > cached pages in the hypervisor? Get them back from the heap! Obviously places a decoupled requirement > of certain toolstack features. But allows to throw away a lot of complex code.IIUC in George''s (Xapi) model (or using Tim''s phrase, "balloon-to-fit") the heap is "always" empty because the toolstack has assigned all memory. So I''m still confused... where does "page unshare" get memory from and how does it notify and/or get permission from the toolstack?> My two cents for the new iterationI''ll see your two cents, and raise you a penny! ;-) Dan
Andres Lagar-Cavilla
2012-Oct-17 20:14 UTC
Re: domain creation vs querying free memory (xend and xl)
On Oct 17, 2012, at 3:46 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca] >> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and xl) > > Hi Andres -- > > Re reply just sent to George... > > I think you must be on a third planet, revolving somewhere between > George''s and mine. I say that because I agree completely with some > of your statements and disagree with the conclusions you draw from > them! :-) > >> Domains can be cajoled into obedience via the max_pages tweak -- which I profoundly dislike. If >> anything we should change the hypervisor to have a "current_allowance" or similar field with a more >> obvious meaning. The abuse of max_pages makes me cringe. Not to say I disagree with its usefulness. > > Me cringes too. Though I can see from George''s view that it makes > perfect sense. Since the toolstack always controls exactly how > much memory is assigned to a domain and since it can cache the > "original max", current allowance and the hypervisors view of > max_pages must always be the same.No. There is room for slack. max_pages (or current_allowance) simply sets an upper bound, which if met will trigger the need for memory management intervention.> > Only if the hypervisor or the domain or the domain''s administrator > can tweak current memory usage without the knowledge of the > toolstack (which is closer to my planet) does an issue arise. > And, to me, that''s the foundation of this whole thread. > >> Once you guarantee no "ex machina" entities fudging the view of the memory the toolstack has, then all >> known methods can be bounded in terms of their capacity to allocate memory unsupervised. >> Note that this implies as well, I don''t see the need for a pool of "unshare" pages. It''s all in the >> heap. The toolstack ensures there is something set apart. > > By "ex machina" do you mean "without the toolstack''s knowledge"? > > Then how does page-unsharing work? Does every page-unshare done by > the hypervisor require serial notification/permission of the toolstack?No of course not. But if you want to keep a domain at bay you keep its max_pages where you want it to stop growing. And at that point the domain will fall asleep (not 100% there hypervisor-wise yet but Real Soon Now (™)), and a synchronous notification will be sent to a listener. At that point it''s again a memory management decision. Should I increase the domain''s reservation, page something out, etc? There is a range of possibilities that are not germane to the core issue of enforcing memory limits.> Or is this "batched", in which case a pool is necessary, isn''t it? > (Not sure what you mean by "no need for a pool" and then "toolstack > ensures there is something set apart"... what''s the difference?)I am under the impression there is a proposal floating for a hypervisor-maintained pool of pages to immediately relief un-sharing. Much like there is now for PoD (the pod cache). This is what I think is not necessary.> > My point is, whether there is no pool or a pool that sometimes > runs dry, are you really going to put the toolstack in the hypervisor''s > path for allocating a page so that the hypervisor can allocate > a new page for CoW to fulfill an unshare?Absolutely not.> >> Something that I struggle with here is the notion that we need to extend the hypervisor for any aspect >> of the discussion we''ve had so far. I just don''t see that. The toolstack has (or should definitely >> have) a non-racy view of the memory of the host. Reservations are therefore notions the toolstack >> manages. > > In a perfect world where the toolstack has an oracle for the > precise time-varying memory requirements for all guests, I > would agree.With the mechanism outlined, the toolstack needs to make coarse-grained infrequent decisions. There is a possibility for pathological misbehavior -- I think there is always that possibility. Correctness is preserved, at worst, performance will be hurt. It''s really important to keep things separate in this discussion. The toolstack+hypervisor are enabling (1) control over how memory is allocated to what (2) control over a domain''s ability to grow its footprint unsupervised (3) control over a domain''s footprint with PV mechanisms from within, or externally. Performance is not up to the toolstack but to the memory manager magic the toolstack enables with (3).> > In that world, there''s no need for a CPU scheduler either... > the toolstack can decide exactly when to assign each VCPU for > each VM onto each PCPU, and when to stop and reassign. > And then every PCPU would be maximally utilized, right? > > My point: Why would you resource-manage CPUs differently from > memory? The demand of real-world workloads varies dramatically > for both... don''t you want both to be managed dynamically, > whenever possible? > > If yes (dynamic is good), in order for the toolstack''s view of > memory to be non-racy, doesn''t every hypervisor page allocation > need to be serialized with the toolstack granting notification/permission?Once you bucketize RAM and know you will get synchronous kicks as buckets fill up, then you have a non-racy view. If you choose buckets of width one…..> >> I further think the pod cache could be converted to this model. Why have specific per-domain lists of >> cached pages in the hypervisor? Get them back from the heap! Obviously places a decoupled requirement >> of certain toolstack features. But allows to throw away a lot of complex code. > > IIUC in George''s (Xapi) model (or using Tim''s phrase, "balloon-to-fit") > the heap is "always" empty because the toolstack has assigned all memory.I don''t think that''s what they mean. Nor is it what I mean. The toolstack may chunk memory up into abstract buckets. It can certainly assert that its bucketized view matches the hypervisor view. Pages flow from the heap to each domain -- but the bucket "domain X" will not overflow unsupervised.> So I''m still confused... where does "page unshare" get memory from > and how does it notify and/or get permission from the toolstack?Re sharing, as it should be clear by now, the answer is "it doesn''t matter". If unsharing cannot be satisfied form the heap, then memory management in dom0 is invoked. Heavy-weight, but it means you''ve hit an admin-imposed limit. Please note that this notion of limits and enforcement is sparingly applied today, to the best of my knowledge. But imho it''d be great to meaningfully work towards it. Andres> >> My two cents for the new iteration > > I''ll see your two cents, and raise you a penny! ;-) > > Dan
Dan Magenheimer
2012-Oct-17 22:07 UTC
Re: domain creation vs querying free memory (xend and xl)
Hi Andres -- First, the primary target of page-sharing is HVM proprietary/legacy guests, correct? So, as I said, we are starting from different planets. I''m not arguing that a toolstack-memory-controller won''t be sufficient for your needs, especially in a single server environment, only that the work required to properly ensure that:> >> The toolstack has (or should definitely have) a non-racy view > >> of the memory of the hostis unnecessary if you (and the toolstack) take a slightly broader dynamic view of memory management. IMHO that broader view (which requires the "memory reservation" hypercall) both encompasses tmem and IMHO greatly simplifies memory management in the presence of page-unsharing. I.e. it allows the toolstack to NOT have a non-racy view of the memory of the host. So, if you don''t mind, I will take this opportunity to ask some questions about page-sharing stuff, in the context of the toolstack-memory-controller and/or memory reservation hypercall.> >> Domains can be cajoled into obedience via the max_pages tweak -- which I profoundly dislike. If > >> anything we should change the hypervisor to have a "current_allowance" or similar field with a more > >> obvious meaning. The abuse of max_pages makes me cringe. Not to say I disagree with its usefulness. > > > > Me cringes too. Though I can see from George''s view that it makes > > perfect sense. Since the toolstack always controls exactly how > > much memory is assigned to a domain and since it can cache the > > "original max", current allowance and the hypervisors view of > > max_pages must always be the same. > > No. There is room for slack. max_pages (or current_allowance) simply sets an upper bound, which if met > will trigger the need for memory management intervention.I think we agree if we change my "must always be the same" to "must always be essentially the same, ignoring some fudge factor". Which begs the questions: How does one determine how big the fudge factor is, what happens if it is not big enough, and if it is too big, doesn''t that potentially add up to a lot of wasted space?> > By "ex machina" do you mean "without the toolstack''s knowledge"? > > > > Then how does page-unsharing work? Does every page-unshare done by > > the hypervisor require serial notification/permission of the toolstack? > > No of course not. But if you want to keep a domain at bay you keep its max_pages where you want it to > stop growing. And at that point the domain will fall asleep (not 100% there hypervisor-wise yet but > Real Soon Now (T)), and a synchronous notification will be sent to a listener. > > At that point it''s again a memory management decision. Should I increase the domain''s reservation, > page something out, etc? There is a range of possibilities that are not germane to the core issue of > enforcing memory limits.Maybe we need to dive deep into page-sharing accounting for a moment here: When a page is shared say, by 1000 different VMs, does it get "billed" to all VMs? If no (which makes the most sense to me), how is the toolstack informed that there is now 999 free pages available so that it can use them in, say, a new domain? Does the hypervisor notification wait until there is sufficient pages (say, a bucket''s worth)? If yes, what''s the point of sharing if the hypervisor now has some free memory but the the freed memory is still "billed"; and are there data structures in the hypervisor to track this so that unsharing does proper accounting too? Now suppose 10000 pages are shared by 1000 different VMs at domain launch (scenario: an online class is being set up by a cloud user) and then the VMs suddenly get very active and require a lot of CoWing (say the online class just got underway). What''s the profile of interaction between the hypervisor and toolstack? Maybe you''ve got this all figured out (whether implemented or not) and are convinced it is scalable (or don''t care because the target product is a small single system), but I''d imagine the internal hypervisor vs toolstack accounting/notifications will get very very messy and have concerns about scalability and memory waste.> > Or is this "batched", in which case a pool is necessary, isn''t it? > > (Not sure what you mean by "no need for a pool" and then "toolstack > > ensures there is something set apart"... what''s the difference?) > > I am under the impression there is a proposal floating for a hypervisor-maintained pool of pages to > immediately relief un-sharing. Much like there is now for PoD (the pod cache). This is what I think is > not necessary.I agree it is not necessary, but don''t understand who manages the "slop" (unallocated free pages) and how a pool is different from a "bucket" (to use your term from further down in your reply).> > My point is, whether there is no pool or a pool that sometimes > > runs dry, are you really going to put the toolstack in the hypervisor''s > > path for allocating a page so that the hypervisor can allocate > > a new page for CoW to fulfill an unshare? > > Absolutely not.Good to hear. But this begs answers to the previous questions. Mainly: How does it all work then so that the toolstack and hypervisor are "in sync" about the number of available pages such that the toolstack never wrongly determines that there is enough free space to launch a domain and (by the time it tries to use the free space) there really isn''t? If they can''t remain in sync (at least within a single "bucket", across the entire system, not one bucket per domain), then isn''t something like the proposed "memory reservation" hypercall still required?> >> Something that I struggle with here is the notion that we need to extend the hypervisor for any > aspect > >> of the discussion we''ve had so far. I just don''t see that. The toolstack has (or should definitely > >> have) a non-racy view of the memory of the host. Reservations are therefore notions the toolstack > >> manages. > > > > In a perfect world where the toolstack has an oracle for the > > precise time-varying memory requirements for all guests, I > > would agree. > > With the mechanism outlined, the toolstack needs to make coarse-grained infrequent decisions. There is > a possibility for pathological misbehavior -- I think there is always that possibility. Correctness is > preserved, at worst, performance will be hurt.IMHO, performance will be hurt not only for the pathological cases. Memory will also needlessly be wasted. But, for Windows, I don''t have a better solution, and it will probably be no worse than Microsoft''s solution.> It''s really important to keep things separate in this discussion. The toolstack+hypervisor are > enabling (1) control over how memory is allocated to what (2) control over a domain''s ability to grow > its footprint unsupervised (3) control over a domain''s footprint with PV mechanisms from within, or > externally. > > Performance is not up to the toolstack but to the memory manager magic the toolstack enables with (3).Good dichotomy (though not entirely perfect on my planet).> > In that world, there''s no need for a CPU scheduler either... > > the toolstack can decide exactly when to assign each VCPU for > > each VM onto each PCPU, and when to stop and reassign. > > And then every PCPU would be maximally utilized, right? > > > > My point: Why would you resource-manage CPUs differently from > > memory? The demand of real-world workloads varies dramatically > > for both... don''t you want both to be managed dynamically, > > whenever possible? > > > > If yes (dynamic is good), in order for the toolstack''s view of > > memory to be non-racy, doesn''t every hypervisor page allocation > > need to be serialized with the toolstack granting notification/permission? > > Once you bucketize RAM and know you will get synchronous kicks as buckets fill up, then you have a > non-racy view. If you choose buckets of width one...... e.g. tmem, which is saving one page of data at high frequency> >> I further think the pod cache could be converted to this model. Why have specific per-domain lists > of > >> cached pages in the hypervisor? Get them back from the heap! Obviously places a decoupled > requirement > >> of certain toolstack features. But allows to throw away a lot of complex code. > > > > IIUC in George''s (Xapi) model (or using Tim''s phrase, "balloon-to-fit") > > the heap is "always" empty because the toolstack has assigned all memory. > > I don''t think that''s what they mean. Nor is it what I mean. The toolstack may chunk memory up into > abstract buckets. It can certainly assert that its bucketized view matches the hypervisor view. Pages > flow from the heap to each domain -- but the bucket "domain X" will not overflow unsupervised.Right, but it is the "underflow" I am concerned with. I don''t know if that is what they mean by "balloon-to-fit" (or exactly what you mean), but I think we are all trying to optimize the use of a fixed amount of RAM among some number of VMs. To me, a corollary of that is that the size of the heap is always as small "as possible". And another corollary is that there aren''t a bunch of empty pools of free pages lying about waiting for rare events to happen. And one more corollary is that, to the extent possible, guests aren''t "wasting" memory.> > So I''m still confused... where does "page unshare" get memory from > > and how does it notify and/or get permission from the toolstack? > > Re sharing, as it should be clear by now, the answer is "it doesn''t matter". If unsharing cannot be > satisfied form the heap, then memory management in dom0 is invoked. Heavy-weight, but it means you''ve > hit an admin-imposed limit.Well it *does* matter if that fallback (unsharing cannot be satisfied from the heap) happens too frequently.> Please note that this notion of limits and enforcement is sparingly applied today, to the best of my > knowledge. But imho it''d be great to meaningfully work towards it.Agreed. There''s lots of policy questions around all of our different mechanism "planets", so I hope this discussion meaningfully helps! Thanks for the great discussion! Dan