Dan Magenheimer
2012-Oct-29 17:06 UTC
Proposed new "memory capacity claim" hypercall/feature
Keir, Jan (et al) -- In a recent long thread [1], there was a great deal of discussion about the possible need for a "memory reservation" hypercall. While there was some confusion due to the two worldviews of static vs dynamic management of physical memory capacity, one worldview definitely has a requirement for this new capability. It is still uncertain whether the other worldview will benefit as well, though I believe it eventually will, especially when page sharing is fully deployed. Note that to avoid confusion with existing usages of various terms (such as "reservation"), I am now using the distinct word "claim" as in a "land claim" or "mining claim": http://dictionary.cambridge.org/dictionary/british/stake-a-claim When a toolstack creates a domain, it can first "stake a claim" to the amount of memory capacity necessary to ensure the domain launch will succeed. In order to explore feasibility, I wanted to propose a possible hypervisor design and would very much appreciate feedback! The objective of the design is to ensure that a multi-threaded toolstack can atomically claim a specific amount of RAM capacity for a domain, especially in the presence of independent dynamic memory demand (such as tmem and selfballooning) which the toolstack is not able to track. "Claim X 50G" means that, on completion of the call, either (A) 50G of capacity has been claimed for use by domain X and the call returns success or (B) the call returns failure. Note that in the above, "claim" explicitly does NOT mean that specific physical RAM pages have been assigned, only that the 50G of RAM capacity is not available either to a subsequent "claim" or for most[2] independent dynamic memory demands. I think the underlying hypervisor issue is that the current process of "reserving" memory capacity (which currently does assign specific physical RAM pages) is, by necessity when used for large quantities of RAM, batched and slow and, consequently, can NOT be atomic. One way to think of the newly proposed "claim" is as "lazy reserving": The capacity is set aside even though specific physical RAM pages have not been assigned. In another way, claiming is really just an accounting illusion, similar to how an accountant must "accrue" future liabilities. Hypervisor design/implementation overview: A domain currently does RAM accounting with two primary counters "tot_pages" and "max_pages". (For now, let''s ignore shr_pages, paged_pages, and xenheap_pages, and I hope Olaf/Andre/others can provide further expertise and input.) Tot_pages is a struct_domain element in the hypervisor that tracks the number of physical RAM pageframes "owned" by the domain. The hypervisor enforces that tot_pages is never allowed to exceed another struct_domain element called max_pages. I would like to introduce a new counter, which records how much capacity is claimed for a domain which may or may not yet be mapped to physical RAM pageframes. To do so, I''d like to split the concept of tot_pages into two variables, tot_phys_pages and tot_claimed_pages and require the hypervisor to also enforce: d.tot_phys_pages <= d.tot_claimed_pages[3] <= d.max_pages I''d also split the hypervisor global "total_avail_pages" into "total_free_pages" and "total_unclaimed_pages". (I''m definitely going to need to study more the two-dimensional array "avail"...) The hypervisor must now do additional accounting to keep track of the sum of claims across all domains and also enforce the global: total_unclaimed_pages <= total_free_pages I think the memory_op hypercall can be extended to add two additional subops, XENMEM_claim and XENMEM_release. (Note: To support tmem, there will need to be two variations of XEN_claim, "hard claim" and "soft claim" [3].) The XEN_claim subop atomically evaluates total_unclaimed_pages against the new claim, claims the pages for the domain if possible and returns success or failure. The XEN_release "unsets" the domain''s tot_claimed_pages (to an "illegal" value such as zero or MINUS_ONE). The hypervisor must also enforce some semantics: If an allocation occurs such that a domain''s tot_phys_pages would equal or exceed d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset". This enforces the temporary nature of a claim: Once a domain fully "occupies" its claim, the claim silently expires. In the case of a dying domain, a XENMEM_release operation is implied and must be executed by the hypervisor. Ideally, the quantity of unclaimed memory for each domain and for the system should be query-able. This may require additional memory_op hypercalls. I''d very much appreciate feedback on this proposed design! Thanks, Dan [1] http://lists.xen.org/archives/html/xen-devel/2012-09/msg02229.html and continued in October (the archives don''t thread across months) http://lists.xen.org/archives/html/xen-devel/2012-10/msg00080.html [2] Pages used to store tmem "ephemeral" data may be an exception because those pages are "free-on-demand". [3] I''d be happy to explain the minor additional work necessary to support tmem but have mostly left it out of the proposal for clarity.
Keir Fraser
2012-Oct-29 18:24 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 29/10/2012 18:06, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> The objective of the design is to ensure that a multi-threaded > toolstack can atomically claim a specific amount of RAM capacity for a > domain, especially in the presence of independent dynamic memory demand > (such as tmem and selfballooning) which the toolstack is not able to track. > "Claim X 50G" means that, on completion of the call, either (A) 50G of > capacity has been claimed for use by domain X and the call returns > success or (B) the call returns failure. Note that in the above, > "claim" explicitly does NOT mean that specific physical RAM pages have > been assigned, only that the 50G of RAM capacity is not available either > to a subsequent "claim" or for most[2] independent dynamic memory demands.I don''t really understand the problem it solves, to be honest. Why would you not just allocate the RAM pages, rather than merely making that amount of memory unallocatable for any other purpose? -- Keir
Dan Magenheimer
2012-Oct-29 21:08 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Keir Fraser [mailto:keir.xen@gmail.com] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On 29/10/2012 18:06, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > > The objective of the design is to ensure that a multi-threaded > > toolstack can atomically claim a specific amount of RAM capacity for a > > domain, especially in the presence of independent dynamic memory demand > > (such as tmem and selfballooning) which the toolstack is not able to track. > > "Claim X 50G" means that, on completion of the call, either (A) 50G of > > capacity has been claimed for use by domain X and the call returns > > success or (B) the call returns failure. Note that in the above, > > "claim" explicitly does NOT mean that specific physical RAM pages have > > been assigned, only that the 50G of RAM capacity is not available either > > to a subsequent "claim" or for most[2] independent dynamic memory demands. > > I don''t really understand the problem it solves, to be honest. Why would you > not just allocate the RAM pages, rather than merely making that amount of > memory unallocatable for any other purpose?Hi Keir -- Thanks for the response! Sorry, I guess the answer to your question is buried in the thread referenced (as [1]) plus a vague mention in this proposal. The core issue is that, in the hypervisor, every current method of "allocating RAM" is slow enough that if you want to allocate millions of pages (e.g. for a large domain), the total RAM can''t be allocated atomically. In fact, it may even take minutes, so currently a large allocation is explicitly preemptible, not atomic. The problems the proposal solves are (1) some toolstacks (including Oracle''s "cloud orchestration layer") want to launch domains in parallel; currently xl/xapi require launches to be serialized which isn''t very scalable in a large data center; and (2) tmem and/or other dynamic memory mechanisms may be asynchronously absorbing small-but-significant portions of RAM for other purposes during an attempted domain launch. In either case, this is a classic race, and a large allocation may unexpectedly fail, possibly even after several minutes, which is unacceptable for a data center operator or for automated tools trying to launch any very large domain. Does that make sense? I''m very open to other solutions, but the only one I''ve heard so far was essentially "disallow independent dynamic memory allocations" plus keep track of all "claiming" in the toolstack. Thanks, Dan
Keir Fraser
2012-Oct-29 22:22 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> The core issue is that, in the hypervisor, every current method of > "allocating RAM" is slow enough that if you want to allocate millions > of pages (e.g. for a large domain), the total RAM can''t be allocated > atomically. In fact, it may even take minutes, so currently a large > allocation is explicitly preemptible, not atomic. > > The problems the proposal solves are (1) some toolstacks (including > Oracle''s "cloud orchestration layer") want to launch domains in parallel; > currently xl/xapi require launches to be serialized which isn''t very > scalable in a large data center;Well it does depend how scalable domain creation actually is as an operation. If it is spending most of its time allocating memory then it is quite likely that parallel creations will spend a lot of time competing for the heap spinlock, and actually there will be little/no speedup compared with serialising the creations. Further, if domain creation can take minutes, it may be that we simply need to go optimise that -- we already found one stupid thing in the heap allocator recently that was burining loads of time during large-memory domain creations, and fixed it for a massive speedup in that particular case.> and (2) tmem and/or other dynamic > memory mechanisms may be asynchronously absorbing small-but-significant > portions of RAM for other purposes during an attempted domain launch.This is an argument against allocate-rather-than-reserve? I don''t think that makes sense -- so is this instead an argument against reservation-as-a-toolstack-only-mechanism? I''m not actually convinced yet we need reservations *at all*, before we get down to where it should be implemented. -- Keir> In either case, this is a classic race, and a large allocation may > unexpectedly fail, possibly even after several minutes, which is > unacceptable for a data center operator or for automated tools trying > to launch any very large domain. > > Does that make sense? I''m very open to other solutions, but the > only one I''ve heard so far was essentially "disallow independent > dynamic memory allocations" plus keep track of all "claiming" in the > toolstack.
Tim Deegan
2012-Oct-29 22:35 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
At 10:06 -0700 on 29 Oct (1351505175), Dan Magenheimer wrote:> Hypervisor design/implementation overview: > > A domain currently does RAM accounting with two primary counters > "tot_pages" and "max_pages". (For now, let''s ignore shr_pages, > paged_pages, and xenheap_pages, and I hope Olaf/Andre/others can > provide further expertise and input.) > > Tot_pages is a struct_domain element in the hypervisor that tracks > the number of physical RAM pageframes "owned" by the domain. The > hypervisor enforces that tot_pages is never allowed to exceed another > struct_domain element called max_pages. > > I would like to introduce a new counter, which records how > much capacity is claimed for a domain which may or may not yet be > mapped to physical RAM pageframes. To do so, I''d like to split > the concept of tot_pages into two variables, tot_phys_pages and > tot_claimed_pages and require the hypervisor to also enforce: > > d.tot_phys_pages <= d.tot_claimed_pages[3] <= d.max_pages > > I''d also split the hypervisor global "total_avail_pages" into > "total_free_pages" and "total_unclaimed_pages". (I''m definitely > going to need to study more the two-dimensional array "avail"...) > The hypervisor must now do additional accounting to keep track > of the sum of claims across all domains and also enforce the > global: > > total_unclaimed_pages <= total_free_pages > > I think the memory_op hypercall can be extended to add two > additional subops, XENMEM_claim and XENMEM_release. (Note: To > support tmem, there will need to be two variations of XEN_claim, > "hard claim" and "soft claim" [3].) The XEN_claim subop atomically > evaluates total_unclaimed_pages against the new claim, claims > the pages for the domain if possible and returns success or failure. > The XEN_release "unsets" the domain''s tot_claimed_pages (to an > "illegal" value such as zero or MINUS_ONE). > > The hypervisor must also enforce some semantics: If an allocation > occurs such that a domain''s tot_phys_pages would equal or exceed > d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset". > This enforces the temporary nature of a claim: Once a domain > fully "occupies" its claim, the claim silently expires.Why does that happen? If I understand you correctly, releasing the claim is something the toolstack should do once it knows it''s no longer needed.> In the case of a dying domain, a XENMEM_release operation > is implied and must be executed by the hypervisor. > > Ideally, the quantity of unclaimed memory for each domain and > for the system should be query-able. This may require additional > memory_op hypercalls. > > I''d very much appreciate feedback on this proposed design!As I said, I''m not opposed to this, though even after reading through the other thread I''m not convinced that it''s necessary (except in cases where guest-controlled operations are allowed to consume unbounded memory, which frankly gives me the heebie-jeebies). I think it needs a plan for handling restricted memory allocations. For example, some PV guests need their memory to come below a certain machine address, or entirely in superpages, and certain build-time allocations come from xenheap. How would you handle that sort of thing? Cheers, Tim.
Dan Magenheimer
2012-Oct-29 23:03 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Keir Fraser [mailto:keir@xen.org] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > > The core issue is that, in the hypervisor, every current method of > > "allocating RAM" is slow enough that if you want to allocate millions > > of pages (e.g. for a large domain), the total RAM can''t be allocated > > atomically. In fact, it may even take minutes, so currently a large > > allocation is explicitly preemptible, not atomic. > > > > The problems the proposal solves are (1) some toolstacks (including > > Oracle''s "cloud orchestration layer") want to launch domains in parallel; > > currently xl/xapi require launches to be serialized which isn''t very > > scalable in a large data center; > > Well it does depend how scalable domain creation actually is as an > operation. If it is spending most of its time allocating memory then it is > quite likely that parallel creations will spend a lot of time competing for > the heap spinlock, and actually there will be little/no speedup compared > with serialising the creations. Further, if domain creation can take > minutes, it may be that we simply need to go optimise that -- we already > found one stupid thing in the heap allocator recently that was burining > loads of time during large-memory domain creations, and fixed it for a > massive speedup in that particular case.I suppose ultimately it is a scalability question. But Oracle''s measure of success here is based on how long a human or a tool has to wait for confirmation to ensure that a domain will successfully launch. If two domains are launched in parallel AND an indication is given that both will succeed, spinning on the heaplock a bit just makes for a longer "boot" time, which is just a cost of virtualization. If they are launched in parallel and, minutes later (or maybe even 20 seconds later), one or both say "oops, I was wrong, there wasn''t enough memory, so try again", that''s not OK for data center operations, especially if there really was enough RAM for one, but not for both. Remember, in the Oracle environment, we are talking about an administrator/automation overseeing possibly hundreds of physical servers, not just a single user/server. Does that make more sense? The "claim" approach immediately guarantees success or failure. Unless there are enough "stupid things/optimisations" found that you would be comfortable putting memory allocation for a domain creation in a hypervisor spinlock, there will be a race unless an atomic mechanism exists such as "claiming" where only simple arithmetic must be done within a hypervisor lock. Do you disagree?> > and (2) tmem and/or other dynamic > > memory mechanisms may be asynchronously absorbing small-but-significant > > portions of RAM for other purposes during an attempted domain launch. > > This is an argument against allocate-rather-than-reserve? I don''t think that > makes sense -- so is this instead an argument against > reservation-as-a-toolstack-only-mechanism? I''m not actually convinced yet we > need reservations *at all*, before we get down to where it should be > implemented.I''m not sure if we are defining terms the same, so that''s hard to answer. If you define "allocation" as "a physical RAM page frame number is selected (and possibly the physical page is zeroed)", then I''m not sure how your definition of "reservation" differs (because that''s how increase/decrease_reservation are implemented in the hypervisor, right?). Or did you mean "allocate-rather-than-claim" (where "allocate" is select a specific physical pageframe and "claim" means do accounting only? If so, see the atomicity argument above. I''m not just arguing against reservation-as-a-toolstack-mechanism, I''m stating I believe unequivocally that reservation-as-a-toolstack- only-mechanism and tmem are incompatible. (Well, not _totally_ incompatible... the existing workaround, tmem freeze/thaw, works but is also single-threaded and has fairly severe unnecessary performance repercussions. So I''d like to solve both problems at the same time.) Dan
Keir Fraser
2012-Oct-29 23:17 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 30/10/2012 00:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> From: Keir Fraser [mailto:keir@xen.org] >> Subject: Re: Proposed new "memory capacity claim" hypercall/feature >> >> On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: >> >> Well it does depend how scalable domain creation actually is as an >> operation. If it is spending most of its time allocating memory then it is >> quite likely that parallel creations will spend a lot of time competing for >> the heap spinlock, and actually there will be little/no speedup compared >> with serialising the creations. Further, if domain creation can take >> minutes, it may be that we simply need to go optimise that -- we already >> found one stupid thing in the heap allocator recently that was burining >> loads of time during large-memory domain creations, and fixed it for a >> massive speedup in that particular case. > > I suppose ultimately it is a scalability question. But Oracle''s > measure of success here is based on how long a human or a tool > has to wait for confirmation to ensure that a domain will > successfully launch. If two domains are launched in parallel > AND an indication is given that both will succeed, spinning on > the heaplock a bit just makes for a longer "boot" time, which is > just a cost of virtualization. If they are launched in parallel > and, minutes later (or maybe even 20 seconds later), one or > both say "oops, I was wrong, there wasn''t enough memory, so > try again", that''s not OK for data center operations, especially if > there really was enough RAM for one, but not for both. Remember, > in the Oracle environment, we are talking about an administrator/automation > overseeing possibly hundreds of physical servers, not just a single > user/server. > > Does that make more sense?Yes, that makes sense.> The "claim" approach immediately guarantees success or failure. > Unless there are enough "stupid things/optimisations" found that > you would be comfortable putting memory allocation for a domain > creation in a hypervisor spinlock, there will be a race unless > an atomic mechanism exists such as "claiming" where > only simple arithmetic must be done within a hypervisor lock. > > Do you disagree? > >>> and (2) tmem and/or other dynamic >>> memory mechanisms may be asynchronously absorbing small-but-significant >>> portions of RAM for other purposes during an attempted domain launch. >> >> This is an argument against allocate-rather-than-reserve? I don''t think that >> makes sense -- so is this instead an argument against >> reservation-as-a-toolstack-only-mechanism? I''m not actually convinced yet we >> need reservations *at all*, before we get down to where it should be >> implemented. > > I''m not sure if we are defining terms the same, so that''s hard > to answer. If you define "allocation" as "a physical RAM page frame > number is selected (and possibly the physical page is zeroed)", > then I''m not sure how your definition of "reservation" differs > (because that''s how increase/decrease_reservation are implemented > in the hypervisor, right?). > > Or did you mean "allocate-rather-than-claim" (where "allocate" is > select a specific physical pageframe and "claim" means do accounting > only? If so, see the atomicity argument above. > > I''m not just arguing against reservation-as-a-toolstack-mechanism, > I''m stating I believe unequivocally that reservation-as-a-toolstack- > only-mechanism and tmem are incompatible. (Well, not _totally_ > incompatible... the existing workaround, tmem freeze/thaw, works > but is also single-threaded and has fairly severe unnecessary > performance repercussions. So I''d like to solve both problems > at the same time.)Okay, so why is tmem incompatible with implementing claims in the toolstack? -- Keir> Dan
Dan Magenheimer
2012-Oct-29 23:21 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Tim Deegan [mailto:tim@xen.org] > Sent: Monday, October 29, 2012 4:36 PM > To: Dan Magenheimer > Cc: Keir (Xen.org); Jan Beulich; George Dunlap; Olaf Hering; Ian Campbell; Konrad Wilk; xen- > devel@lists.xen.org; George Shuklin; Dario Faggioli; Kurt Hackel; Ian Jackson; Zhigang Wang; Mukesh > Rathor > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > > The hypervisor must also enforce some semantics: If an allocation > > occurs such that a domain''s tot_phys_pages would equal or exceed > > d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset". > > This enforces the temporary nature of a claim: Once a domain > > fully "occupies" its claim, the claim silently expires. > > Why does that happen? If I understand you correctly, releasing the > claim is something the toolstack should do once it knows it''s no longer > needed.Hi Tim -- Thanks for the feedback! I haven''t thought this all the way through yet, but I think this part of the design allows the toolstack to avoid monitoring the domain until "total_phys_pages" reaches "total_claimed" pages, which should make the implementation of claims in the toolstack simpler, especially in many-server environments.> > In the case of a dying domain, a XENMEM_release operation > > is implied and must be executed by the hypervisor. > > > > Ideally, the quantity of unclaimed memory for each domain and > > for the system should be query-able. This may require additional > > memory_op hypercalls. > > > > I''d very much appreciate feedback on this proposed design! > > As I said, I''m not opposed to this, though even after reading through > the other thread I''m not convinced that it''s necessary (except in cases > where guest-controlled operations are allowed to consume unbounded > memory, which frankly gives me the heebie-jeebies).A really detailed discussion of tmem would probably be good but, yes, with tmem, guest-controlled* operations can and frequently will absorb ALL physical RAM. However, this is "freeable" (ephemeral) memory used by the hypervisor on behalf of domains, not domain-owned memory. * "guest-controlled" I suspect is the heebie-jeebie word... in tmem, a better description might be "guest-controls-which-data- and-hypervisor-controls-how-many-pages"> I think it needs a plan for handling restricted memory allocations. > For example, some PV guests need their memory to come below a > certain machine address, or entirely in superpages, and certain > build-time allocations come from xenheap. How would you handle that > sort of thing?Good point. I think there''s always been some uncertainty about how to account for different zones and xenheap... are they part of the domain''s memory or not? Deserves some more thought... if you can enumerate all such cases, that would be very helpful (and probably valuable long-term documentation as well). Thanks, Dan
Tim Deegan
2012-Oct-30 08:13 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
Hi, At 16:21 -0700 on 29 Oct (1351527686), Dan Magenheimer wrote:> > > The hypervisor must also enforce some semantics: If an allocation > > > occurs such that a domain''s tot_phys_pages would equal or exceed > > > d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset". > > > This enforces the temporary nature of a claim: Once a domain > > > fully "occupies" its claim, the claim silently expires. > > > > Why does that happen? If I understand you correctly, releasing the > > claim is something the toolstack should do once it knows it''s no longer > > needed. > > I haven''t thought this all the way through yet, but I think this > part of the design allows the toolstack to avoid monitoring the > domain until "total_phys_pages" reaches "total_claimed" pages, > which should make the implementation of claims in the toolstack > simpler, especially in many-server environments.I think the toolstack has to monitor the domain for that long anyway, since it will have to unpause it once it''s built. Relying on an implicit release seems fragile -- if the builder ends up using only (total_claimed - 1) pages, or temporarily allocating total_claimed and then releasing some memory, things could break.> > I think it needs a plan for handling restricted memory allocations. > > For example, some PV guests need their memory to come below a > > certain machine address, or entirely in superpages, and certain > > build-time allocations come from xenheap. How would you handle that > > sort of thing? > > Good point. I think there''s always been some uncertainty about > how to account for different zones and xenheap... are they part of the > domain''s memory or not?Xenheap pages are not part of the domain memory for accounting purposes; likewise other ''anonymous'' allocations (that is, anywhere that alloc_domheap_pages() & friends are called with a NULL domain pointer). Pages with restricted addresses are just accounted like any other memory, except when they''re on the free lists. Today, toolstacks use a rule of thumb of how much extra space to leave to cover those things -- if you want to pre-allocate them, you''ll have to go through the hypervisor making sure _all_ memory allocations are accounted to the right domain somehow (maybe by generalizing the shadow-allocation pool to cover all per-domain overheads). That seems like a useful side-effect of adding your new feature.> Deserves some more thought... if you can enumerate all such cases, > that would be very helpful (and probably valuable long-term > documentation as well).I''m afraid I can''t, not without re-reading all the domain-builder code and a fair chunk of the hypervisor, so it''s up to you to figure it out. Tim.
Jan Beulich
2012-Oct-30 08:29 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 30.10.12 at 00:21, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Tim Deegan [mailto:tim@xen.org] >> As I said, I''m not opposed to this, though even after reading through >> the other thread I''m not convinced that it''s necessary (except in cases >> where guest-controlled operations are allowed to consume unbounded >> memory, which frankly gives me the heebie-jeebies). > > A really detailed discussion of tmem would probably be good but, > yes, with tmem, guest-controlled* operations can and frequently will > absorb ALL physical RAM. However, this is "freeable" (ephemeral) > memory used by the hypervisor on behalf of domains, not domain-owned > memory. > > * "guest-controlled" I suspect is the heebie-jeebie word... in > tmem, a better description might be "guest-controls-which-data- > and-hypervisor-controls-how-many-pages"But isn''t tmem use supposed to be transparent in this respect, i.e. if a "normal" allocation cannot be satisfied, tmem would jump in and free sufficient space? In which case there''s no need to do any accounting outside of the control tools (leaving aside the smaller hypervisor internal allocations, which the tool stack needs to provide room for anyway). Jan
George Dunlap
2012-Oct-30 09:11 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On Mon, Oct 29, 2012 at 6:06 PM, Dan Magenheimer <dan.magenheimer@oracle.com> wrote:> Keir, Jan (et al) -- > > In a recent long thread [1], there was a great deal of discussion > about the possible need for a "memory reservation" hypercall. > While there was some confusion due to the two worldviews of static > vs dynamic management of physical memory capacity, one worldview > definitely has a requirement for this new capability.No, it does not.> I''m not just arguing against reservation-as-a-toolstack-mechanism, > I''m stating I believe unequivocally that reservation-as-a-toolstack- > only-mechanism and tmem are incompatible. (Well, not _totally_ > incompatible... the existing workaround, tmem freeze/thaw, works > but is also single-threaded and has fairly severe unnecessary > performance repercussions. So I''d like to solve both problems > at the same time.)No, it is not. Look, the *only* reason you have this problem is that *you yourselves* programmed in two incompatible assumptions: 1. You have a toolstack that assumes it can ask "how much free memory is there" from the HV and have that be an accurate answer, rather than keeping track of this itself 2. You wrote the tmem code to do "self-ballooning", which for no good reason, gives memory back to the hypervisor, rather than just keeping it itself. Basically #2 breaks the assumption of #1. It has absolutely nothing at all to do with tmem. It''s just a quirk of your particular implementation of self-ballooning. This new hypercall you''re introducing is just a hack to fix the fact that you''ve baked in incompatible assumptions. It''s completely unnecessary. All of the functionality you''re describing can be implemented outside of the hypervisor in the toolstack -- this would fix #1. Doing that would have no effect on tmem whatsoever. Alternately, you could fix #2 -- have the "self-ballooning" mechanism just allocate the memory to force the swapping to happen, but *not hand it back to the hypervisor*. We don''t need this new hypercall. You should just fix your own bugs rather than introducing new hacks to work around them. -George
Keir Fraser
2012-Oct-30 14:43 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 30/10/2012 16:13, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> Okay, so why is tmem incompatible with implementing claims in the toolstack? > > (Hmmm... maybe I could schedule the equivalent of a PhD qual exam > for tmem with all the core Xen developers as examiners?) > > The short answer is tmem moves memory capacity around far too > frequently to be managed by a userland toolstack, especially if > the "controller" lives on a central "manager machine" in a > data center (Oracle''s model). The ebb and flow of memory supply > and demand for each guest is instead managed entirely dynamically.I don''t know. I agree that fine-grained memory management is the duty of the hypervisor, but it seems to me that the toolstack should be able to handle admission control. It knows how much memory each existing guest is allowed to consume at max, how much memory the new guest requires, how much memory the system has total... Isn''t the decision then simple? Tmem should be fairly invisible to the toolstack, right? -- Keir> The somewhat longer answer (and remember all of this is > implemented and upstream in Xen and Linux today): > > First, in the tmem model, each guest is responsible for driving > its memory utilization (what Xen tools calls "current" and Xen > hypervisor calls "tot_pages") as low as it can. This is done > in Linux with selfballooning. At 50Hz (default), the guest > kernel will attempt to expand or contract the balloon to match > the guest kernel''s current demand for memory. Agreed, one guest > requesting changes at 50Hz could probably be handled by > a userland toolstack, but what about 100 guests? Maybe... > but there''s more. > > Second, in the tmem model, each guest is making tmem hypercalls > at a rate of perhaps thousands per second, driven by the kernel > memory management internals. Each call deals with a single > page of memory and each possibly may remove a page from (or > return a page to) Xen''s free list. Interacting with a userland > toolstack for each page is simply not feasible for this high > of a frequency, even in a single guest. > > Third, tmem in Xen implements both compression and deduplication > so each attempt to put a page of data from the guest into > the hypervisor may or may not require a new physical page. > Only the hypervisor knows. > > So, even on a single machine, tmem is tossing memory capacity > about at a very very high frequency. A userland toolstack can''t > possibly keep track, let alone hope to control it; that would > entirely defeat the value of tmem. It would be like requiring > the toolstack to participate in every vcpu->pcpu transition > in the Xen cpu scheduler. > > Does that make sense and answer your question? > > Anyway, I think the proposed "claim" hypercall/subop neatly > solves the problem of races between large-chunk memory demands > (i.e. large domain launches) and small-chunk memory demands > (i.e. small domain launches and single-page tmem allocations).
Dan Magenheimer
2012-Oct-30 15:13 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Keir Fraser [mailto:keir.xen@gmail.com] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On 30/10/2012 00:03, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > >> From: Keir Fraser [mailto:keir@xen.org] > >> Subject: Re: Proposed new "memory capacity claim" hypercall/feature > >> > >> On 29/10/2012 21:08, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > >> > >> Well it does depend how scalable domain creation actually is as an > >> operation. If it is spending most of its time allocating memory then it is > >> quite likely that parallel creations will spend a lot of time competing for > >> the heap spinlock, and actually there will be little/no speedup compared > >> with serialising the creations. Further, if domain creation can take > >> minutes, it may be that we simply need to go optimise that -- we already > >> found one stupid thing in the heap allocator recently that was burining > >> loads of time during large-memory domain creations, and fixed it for a > >> massive speedup in that particular case. > > > > I suppose ultimately it is a scalability question. But Oracle''s > > measure of success here is based on how long a human or a tool > > has to wait for confirmation to ensure that a domain will > > successfully launch. If two domains are launched in parallel > > AND an indication is given that both will succeed, spinning on > > the heaplock a bit just makes for a longer "boot" time, which is > > just a cost of virtualization. If they are launched in parallel > > and, minutes later (or maybe even 20 seconds later), one or > > both say "oops, I was wrong, there wasn''t enough memory, so > > try again", that''s not OK for data center operations, especially if > > there really was enough RAM for one, but not for both. Remember, > > in the Oracle environment, we are talking about an administrator/automation > > overseeing possibly hundreds of physical servers, not just a single > > user/server. > > > > Does that make more sense? > > Yes, that makes sense.:) So, not to beat a dead horse, but let me re-emphasize that the problem exists even without considering tmem. I wish to solve the problem, but would like to do it in a way which also resolves a similar problem for tmem. I think the "claim" approach does that.> > The "claim" approach immediately guarantees success or failure. > > Unless there are enough "stupid things/optimisations" found that > > you would be comfortable putting memory allocation for a domain > > creation in a hypervisor spinlock, there will be a race unless > > an atomic mechanism exists such as "claiming" where > > only simple arithmetic must be done within a hypervisor lock. > > > > Do you disagree? > > > >>> and (2) tmem and/or other dynamic > >>> memory mechanisms may be asynchronously absorbing small-but-significant > >>> portions of RAM for other purposes during an attempted domain launch. > >> > >> This is an argument against allocate-rather-than-reserve? I don''t think that > >> makes sense -- so is this instead an argument against > >> reservation-as-a-toolstack-only-mechanism? I''m not actually convinced yet we > >> need reservations *at all*, before we get down to where it should be > >> implemented. > > > > I''m not sure if we are defining terms the same, so that''s hard > > to answer. If you define "allocation" as "a physical RAM page frame > > number is selected (and possibly the physical page is zeroed)", > > then I''m not sure how your definition of "reservation" differs > > (because that''s how increase/decrease_reservation are implemented > > in the hypervisor, right?). > > > > Or did you mean "allocate-rather-than-claim" (where "allocate" is > > select a specific physical pageframe and "claim" means do accounting > > only? If so, see the atomicity argument above. > > > > I''m not just arguing against reservation-as-a-toolstack-mechanism, > > I''m stating I believe unequivocally that reservation-as-a-toolstack- > > only-mechanism and tmem are incompatible. (Well, not _totally_ > > incompatible... the existing workaround, tmem freeze/thaw, works > > but is also single-threaded and has fairly severe unnecessary > > performance repercussions. So I''d like to solve both problems > > at the same time.) > > Okay, so why is tmem incompatible with implementing claims in the toolstack?(Hmmm... maybe I could schedule the equivalent of a PhD qual exam for tmem with all the core Xen developers as examiners?) The short answer is tmem moves memory capacity around far too frequently to be managed by a userland toolstack, especially if the "controller" lives on a central "manager machine" in a data center (Oracle''s model). The ebb and flow of memory supply and demand for each guest is instead managed entirely dynamically. The somewhat longer answer (and remember all of this is implemented and upstream in Xen and Linux today): First, in the tmem model, each guest is responsible for driving its memory utilization (what Xen tools calls "current" and Xen hypervisor calls "tot_pages") as low as it can. This is done in Linux with selfballooning. At 50Hz (default), the guest kernel will attempt to expand or contract the balloon to match the guest kernel''s current demand for memory. Agreed, one guest requesting changes at 50Hz could probably be handled by a userland toolstack, but what about 100 guests? Maybe... but there''s more. Second, in the tmem model, each guest is making tmem hypercalls at a rate of perhaps thousands per second, driven by the kernel memory management internals. Each call deals with a single page of memory and each possibly may remove a page from (or return a page to) Xen''s free list. Interacting with a userland toolstack for each page is simply not feasible for this high of a frequency, even in a single guest. Third, tmem in Xen implements both compression and deduplication so each attempt to put a page of data from the guest into the hypervisor may or may not require a new physical page. Only the hypervisor knows. So, even on a single machine, tmem is tossing memory capacity about at a very very high frequency. A userland toolstack can''t possibly keep track, let alone hope to control it; that would entirely defeat the value of tmem. It would be like requiring the toolstack to participate in every vcpu->pcpu transition in the Xen cpu scheduler. Does that make sense and answer your question? Anyway, I think the proposed "claim" hypercall/subop neatly solves the problem of races between large-chunk memory demands (i.e. large domain launches) and small-chunk memory demands (i.e. small domain launches and single-page tmem allocations). Thanks, Dan
Dan Magenheimer
2012-Oct-30 15:26 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Tim Deegan [mailto:tim@xen.org] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > Hi,Hi Tim!> At 16:21 -0700 on 29 Oct (1351527686), Dan Magenheimer wrote: > > > > The hypervisor must also enforce some semantics: If an allocation > > > > occurs such that a domain''s tot_phys_pages would equal or exceed > > > > d.tot_claimed_pages, then d.tot_claimed_pages becomes "unset". > > > > This enforces the temporary nature of a claim: Once a domain > > > > fully "occupies" its claim, the claim silently expires. > > > > > > Why does that happen? If I understand you correctly, releasing the > > > claim is something the toolstack should do once it knows it''s no longer > > > needed. > > > > I haven''t thought this all the way through yet, but I think this > > part of the design allows the toolstack to avoid monitoring the > > domain until "total_phys_pages" reaches "total_claimed" pages, > > which should make the implementation of claims in the toolstack > > simpler, especially in many-server environments. > > I think the toolstack has to monitor the domain for that long anyway, > since it will have to unpause it once it''s built.Could be. This "claim auto-expire" feature is certainly not a requirement but I thought it might be useful, especially for multi-server toolstacks (such as Oracle''s). I may take a look at implementing it anyway since it is probably only a few lines of code, but will ensure I do so as a separately reviewable/rejectable patch.> Relying on an > implicit release seems fragile -- if the builder ends up using only > (total_claimed - 1) pages, or temporarily allocating total_claimed and > then releasing some memory, things could break.I agree its fragile, though I don''t see how things could actually "break". But, let''s drop claim-auto-expire for now as I fear it is detracting from the larger discussion.> > > I think it needs a plan for handling restricted memory allocations. > > > For example, some PV guests need their memory to come below a > > > certain machine address, or entirely in superpages, and certain > > > build-time allocations come from xenheap. How would you handle that > > > sort of thing? > > > > Good point. I think there''s always been some uncertainty about > > how to account for different zones and xenheap... are they part of the > > domain''s memory or not? > > Xenheap pages are not part of the domain memory for accounting purposes; > likewise other ''anonymous'' allocations (that is, anywhere that > alloc_domheap_pages() & friends are called with a NULL domain pointer). > Pages with restricted addresses are just accounted like any other > memory, except when they''re on the free lists. > > Today, toolstacks use a rule of thumb of how much extra space to leave > to cover those things -- if you want to pre-allocate them, you''ll have > to go through the hypervisor making sure _all_ memory allocations are > accounted to the right domain somehow (maybe by generalizing the > shadow-allocation pool to cover all per-domain overheads). That seems > like a useful side-effect of adding your new feature.Hmmm... then I''m not quite sure how adding a simple "claim" changes the need for accounting of these anonymous allocations. I guess it depends on the implementation... maybe the simple implementation I have in mind can''t co-exist with anonymous allocations but I think it will.> > Deserves some more thought... if you can enumerate all such cases, > > that would be very helpful (and probably valuable long-term > > documentation as well). > > I''m afraid I can''t, not without re-reading all the domain-builder code > and a fair chunk of the hypervisor, so it''s up to you to figure it out.Well, or at least to ensure that I haven''t made it any worse ;-) me adds "world peace" to the requirements list for the new claim hypercall ;-) Thanks much for the feedback! Dan
Dan Magenheimer
2012-Oct-30 15:43 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Tuesday, October 30, 2012 2:29 AM > To: Dan Magenheimer > Cc: Olaf Hering; IanCampbell; GeorgeDunlap; IanJackson; George Shuklin; DarioFaggioli; xen- > devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; Keir (Xen.org); Tim Deegan > Subject: RE: Proposed new "memory capacity claim" hypercall/feature > > >>> On 30.10.12 at 00:21, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > >> From: Tim Deegan [mailto:tim@xen.org] > >> As I said, I''m not opposed to this, though even after reading through > >> the other thread I''m not convinced that it''s necessary (except in cases > >> where guest-controlled operations are allowed to consume unbounded > >> memory, which frankly gives me the heebie-jeebies). > > > > A really detailed discussion of tmem would probably be good but, > > yes, with tmem, guest-controlled* operations can and frequently will > > absorb ALL physical RAM. However, this is "freeable" (ephemeral) > > memory used by the hypervisor on behalf of domains, not domain-owned > > memory. > > > > * "guest-controlled" I suspect is the heebie-jeebie word... in > > tmem, a better description might be "guest-controls-which-data- > > and-hypervisor-controls-how-many-pages" > > But isn''t tmem use supposed to be transparent in this respect, i.e. > if a "normal" allocation cannot be satisfied, tmem would jump in > and free sufficient space? In which case there''s no need to do > any accounting outside of the control tools (leaving aside the > smaller hypervisor internal allocations, which the tool stack needs > to provide room for anyway).Hi Jan -- Tmem can only "free sufficient space" up to the total amount of ephemeral space of which it has control (ie. all "freeable" memory). Let me explain further: Let''s oversimplify a bit and say that there are three types of pages: a) Truly free memory (each free page is on the hypervisor free list) b) Freeable memory ("ephmeral" memory managed by tmem) c) Owned memory (pages allocated by the hypervisor or for a domain) The sum of these three is always a constant: The total number of RAM pages in the system. However, when tmem is active, the values of all _three_ of these change constantly. So if at the start of a domain launch, the sum of free+freeable exceeds the intended size of the domain, the domain allocation/launch can start. But then if "owned" increases enough, there may no longer be enough memory and the domain launch will fail. With tmem, memory "owned" by domain (d.tot_pages) increases dynamically in two ways: selfballooning and persistent puts (aka frontswap), but is always capped by d.max_pages. Neither of these communicate to the toolstack. Similarly, tmem (or selfballooning) may be dynamically freeing up lots of memory without communicating to the toolstack, which could result in the toolstack rejecting a domain launch believing there is insufficient memory. I am thinking the "claim" hypercall/subop eliminates these problems and hope you agree! Thanks, Dan
Jan Beulich
2012-Oct-30 16:04 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 30.10.12 at 16:43, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > With tmem, memory "owned" by domain (d.tot_pages) increases dynamically > in two ways: selfballooning and persistent puts (aka frontswap), > but is always capped by d.max_pages. Neither of these communicate > to the toolstack. > > Similarly, tmem (or selfballooning) may be dynamically freeing up lots > of memory without communicating to the toolstack, which could result in > the toolstack rejecting a domain launch believing there is insufficient > memory. > > I am thinking the "claim" hypercall/subop eliminates these problems > and hope you agree!With tmem being the odd one here, wouldn''t it make more sense to force it into no-alloc mode (apparently not exactly the same as freezing all pools) for the (infrequent?) time periods of domain creation, thus not allowing the amount of free memory to drop unexpectedly? Tmem could, during these time periods, still itself internally recycle pages (e.g. fulfill a persistent put by discarding an ephemeral page). Jan
Dan Magenheimer
2012-Oct-30 16:13 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com] > : > No, it does not. > : > No, it does not. > : > We don''t need this new hypercall. You should just fix your own bugs > rather than introducing new hacks to work around them.Ouch. I''m sorry if the previous discussion on this made you angry. I wasn''t sure if you were just absorbing the new information or rejecting it or just too busy to reply, so decided to proceed with a more specific proposal. I wasn''t intending to cut off the discussion. New paradigms and paradigm shifts always encounter resistance, especially from those with a lot of investment in the old paradigm. This "new" paradigm, tmem, has been in Xen for years now and the final piece is now in upstream Linux as well. Tmem is in many ways a breakthrough in virtualized memory management, though admittedly it is far from perfect (and, notably, will not help proprietary or legacy guests). I would hope you, as release manager, would either try to understand the different paradigm or at least accept that there are different paradigms than yours that can co-exist in an open source project. To answer some of your points: Dynamic handling of memory management is not a bug. And selfballooning is only a small (though important) part of the tmem story. And the Oracle "toolstack" manages hundreds of physical machines and thousands of virtual machines across a physical network, not one physical machine with a handful of virtual machines across Xenbus. So we come from different perspectives. As repeatedly pointed out (and confirmed by others), variations of the memory "race" problem exist even without tmem. I do agree that if a toolstack insists that only it, the toolstack, can ever allocate or free memory, the problem goes away. You think that restriction is reasonable, and I think it is not. The "claim" proposal is very simple and (as far as I can tell so far) shouldn''t interfere with your paradigm. Reinforcing your paradigm by rejecting the proposal only cripples my paradigm. Please ensure you don''t reject a proposal simply because you have a different worldview. Thanks, Dan
Dan Magenheimer
2012-Oct-30 16:33 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Keir Fraser [mailto:keir.xen@gmail.com] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On 30/10/2012 16:13, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > >> Okay, so why is tmem incompatible with implementing claims in the toolstack? > > > > (Hmmm... maybe I could schedule the equivalent of a PhD qual exam > > for tmem with all the core Xen developers as examiners?) > > > > The short answer is tmem moves memory capacity around far too > > frequently to be managed by a userland toolstack, especially if > > the "controller" lives on a central "manager machine" in a > > data center (Oracle''s model). The ebb and flow of memory supply > > and demand for each guest is instead managed entirely dynamically. > > I don''t know. I agree that fine-grained memory management is the duty of the > hypervisor, but it seems to me that the toolstack should be able to handle > admission control. It knows how much memory each existing guest is allowed > to consume at max, > !!!!!!!!!!!how much memory the new guest requires!!!!!!!!!! > how much memory > the system has total... Isn''t the decision then simple?A fundamental assumption of tmem is that _nobody_ knows how much memory a guest requires, not even the OS kernel running in the guest. If you have a toolstack that does know, please submit a paper to OSDI. ;-) If you have a toolstack that can do it for thousands of guests across hundreds of machines, please start up a company and allow me to invest. ;-) One way to think of tmem is as a huge co-feedback loop that estimates memory demand and deals effectively with the consequences of the (always wrong) estimate using very fine-grained adjustments AND mechanisms that allow maximum flexibility between guest memory demands while minimizing impact on the running guests.> Tmem should be fairly invisible to the toolstack, right?It can be invisible, as long as the toolstack doesn''t either make the assumption that it controls every page allocated/freed by the hypervisor or make the assumption that a large allocation can be completed atomically. The first of those assumptions is what is generating all the controversy (George''s worldview) and the second is the problem I am trying to solve with the "claim" hypercall/subop. And I''d like to solve it in a way that handles both tmem and non-tmem. Thanks, Dan
Dan Magenheimer
2012-Oct-30 17:13 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: Proposed new "memory capacity claim" hypercall/feature > > >>> On 30.10.12 at 16:43, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > With tmem, memory "owned" by domain (d.tot_pages) increases dynamically > > in two ways: selfballooning and persistent puts (aka frontswap), > > but is always capped by d.max_pages. Neither of these communicate > > to the toolstack. > > > > Similarly, tmem (or selfballooning) may be dynamically freeing up lots > > of memory without communicating to the toolstack, which could result in > > the toolstack rejecting a domain launch believing there is insufficient > > memory. > > > > I am thinking the "claim" hypercall/subop eliminates these problems > > and hope you agree! > > With tmem being the odd one here, wouldn''t it make more sense > to force it into no-alloc mode (apparently not exactly the same as > freezing all pools) for the (infrequent?) time periods of domain > creation, thus not allowing the amount of free memory to drop > unexpectedly? Tmem could, during these time periods, still itself > internally recycle pages (e.g. fulfill a persistent put by discarding > an ephemeral page).Hi Jan -- Freeze has some unattractive issues that "claim" would solve (see below) and freeze (whether ephemeral pages are used or not) blocks allocations due to tmem, but doesn''t block allocations due to selfballooning (or manual ballooning attempts by a guest user with root access). I suppose the tmem freeze implementation could be extended to also block all non-domain-creation ballooning attempts but I''m not sure if that''s what you are proposing. To digress for a moment first, the original problem exists both in non-tmem systems AND tmem systems. It has been seen in the wild on non-tmem systems. I am involved with proposing a solution primarily because, if the solution is designed correctly, it _also_ solves a tmem problem. (And as long as we have digressed, I believe it _also_ solves a page-sharing problem on non-tmem systems.) That said, here''s the unattractive tmem freeze/thaw issue, first with the existing freeze implementation. Suppose you have a huge 256GB machine and you have already launched a 64GB tmem guest "A". The guest is idle for now, so slowly selfballoons down to maybe 4GB. You start to launch another 64GB guest "B" which, as we know, is going to take some time to complete. In the middle of launching "B", "A" suddenly gets very active and needs to balloon up as quickly as possible or it can''t balloon fast enough (or at all if "frozen" as suggested) so starts swapping (and, thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem memory). But ballooning and tmem are both blocked and so the guest swaps its poor little butt off even though there''s >100GB of free physical memory available. Let''s add in your suggestion, that a persistent put can be fulfilled by discarding an ephemeral page. I see two issues: First, it requires the number of ephemeral pages available to be larger than the number of persistent pages required; this may not always be true, though most of the time it will be true. Second, the second domain creation activity may have been assuming that it could use some (or all) of the freeable pages, which have now been absorbed by the first guest''s persistent puts. So I think "claim" is still needed anyway. Comments? Thanks, Dan
Jan Beulich
2012-Oct-31 08:14 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> With tmem being the odd one here, wouldn''t it make more sense >> to force it into no-alloc mode (apparently not exactly the same as >> freezing all pools) for the (infrequent?) time periods of domain >> creation, thus not allowing the amount of free memory to drop >> unexpectedly? Tmem could, during these time periods, still itself >> internally recycle pages (e.g. fulfill a persistent put by discarding >> an ephemeral page). > > Freeze has some unattractive issues that "claim" would solve > (see below) and freeze (whether ephemeral pages are used or not) > blocks allocations due to tmem, but doesn''t block allocations due > to selfballooning (or manual ballooning attempts by a guest user > with root access). I suppose the tmem freeze implementation could > be extended to also block all non-domain-creation ballooning > attempts but I''m not sure if that''s what you are proposing. > > To digress for a moment first, the original problem exists both in > non-tmem systems AND tmem systems. It has been seen in the wild on > non-tmem systems. I am involved with proposing a solution primarily > because, if the solution is designed correctly, it _also_ solves a > tmem problem. (And as long as we have digressed, I believe it _also_ > solves a page-sharing problem on non-tmem systems.) That said, > here''s the unattractive tmem freeze/thaw issue, first with > the existing freeze implementation. > > Suppose you have a huge 256GB machine and you have already launched > a 64GB tmem guest "A". The guest is idle for now, so slowly > selfballoons down to maybe 4GB. You start to launch another 64GB > guest "B" which, as we know, is going to take some time to complete. > In the middle of launching "B", "A" suddenly gets very active and > needs to balloon up as quickly as possible or it can''t balloon fast > enough (or at all if "frozen" as suggested) so starts swapping (and, > thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem > memory). But ballooning and tmem are both blocked and so the > guest swaps its poor little butt off even though there''s >100GB > of free physical memory available.That''s only one side of the overcommit situation you''re striving to get work right here: That same self ballooning guest, after sufficiently more guest got started so that the rest of the memory got absorbed by them would suffer the very same problems in the described situation, so it has to be prepared for this case anyway. As long as the allocation times can get brought down to an acceptable level, I continue to not see a need for the extra "claim" approach you''re proposing. So working on that one (or showing that without unreasonable effort this cannot be further improved) would be a higher priority thing from my pov (without anyone arguing about its usefulness). But yes, with all the factors you mention brought in, there is certainly some improvement needed (whether your "claim" proposal is a the right thing is another question, not to mention that I currently don''t see how this would get implemented in a consistent way taking several orders of magnitude less time to carry out). Jan
Dan Magenheimer
2012-Oct-31 16:04 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: Proposed new "memory capacity claim" hypercall/feature > > >>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > >> From: Jan Beulich [mailto:JBeulich@suse.com](NOTE TO KEIR: Input from you requested in first stanza below.) Hi Jan -- Thanks for the continued feedback! I''ve slightly re-ordered the email to focus on the problem (moved tmem-specific discussion to the end).> As long as the allocation times can get brought down to an > acceptable level, I continue to not see a need for the extra > "claim" approach you''re proposing. So working on that one (or > showing that without unreasonable effort this cannot be > further improved) would be a higher priority thing from my pov > (without anyone arguing about its usefulness).Fair enough. I will do some measurement and analysis of this code. However, let me ask something of you and Keir as well: Please estimate how long (in usec) you think it is acceptable to hold the heap_lock. If your limit is very small (as I expect), doing anything "N" times in a loop with the lock held (for N==2^26, which is a 256GB domain) may make the analysis moot.> But yes, with all the factors you mention brought in, there is > certainly some improvement needed (whether your "claim" > proposal is a the right thing is another question, not to mention > that I currently don''t see how this would get implemented in > a consistent way taking several orders of magnitude less time > to carry out).OK, I will start on the next step... proof-of-concept. I''m envisioning simple arithmetic, but maybe you are right and arithmetic will not be sufficient.> > Suppose you have a huge 256GB machine and you have already launched > > a 64GB tmem guest "A". The guest is idle for now, so slowly > > selfballoons down to maybe 4GB. You start to launch another 64GB > > guest "B" which, as we know, is going to take some time to complete. > > In the middle of launching "B", "A" suddenly gets very active and > > needs to balloon up as quickly as possible or it can''t balloon fast > > enough (or at all if "frozen" as suggested) so starts swapping (and, > > thanks to Linux frontswap, the swapping tries to go to hypervisor/tmem > > memory). But ballooning and tmem are both blocked and so the > > guest swaps its poor little butt off even though there''s >100GB > > of free physical memory available. > > That''s only one side of the overcommit situation you''re striving > to get work right here: That same self ballooning guest, after > sufficiently more guest got started so that the rest of the memory > got absorbed by them would suffer the very same problems in > the described situation, so it has to be prepared for this case > anyway.The tmem design does ensure the guest is prepared for this case anyway... the guest swaps. And, unlike page-sharing, the guest determines which pages to swap, not the host, and there is no possibility of double-paging. In your scenario, the host memory is truly oversubscribed. This scenario is ultimately a weakness of virtualization in general; trying to statistically-share an oversubscribed fixed resource among a number of guests will sometimes cause a performance degradation, whether the resource is CPU or LAN bandwidth or, in this case, physical memory. That very generic problem is I think not one any of us can solve. Toolstacks need to be able to recognize the problem (whether CPU, LAN, or memory) and act accordingly (report, or auto-migrate). In my scenario, guest performance is hammered only because of the unfortunate deficiency in the existing hypervisor memory allocation mechanisms, namely that small allocations must be artificially "frozen" until a large allocation can complete. That specific problem is one I am trying to solve. BTW, with tmem, some future toolstack might monitor various available tmem statistics and predict/avoid your scenario. Dan
Jan Beulich
2012-Oct-31 16:19 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 31.10.12 at 17:04, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Subject: RE: Proposed new "memory capacity claim" hypercall/feature >> >> >>> On 30.10.12 at 18:13, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> >> From: Jan Beulich [mailto:JBeulich@suse.com] > > (NOTE TO KEIR: Input from you requested in first stanza below.) > > Hi Jan -- > > Thanks for the continued feedback! > > I''ve slightly re-ordered the email to focus on the problem > (moved tmem-specific discussion to the end). > >> As long as the allocation times can get brought down to an >> acceptable level, I continue to not see a need for the extra >> "claim" approach you''re proposing. So working on that one (or >> showing that without unreasonable effort this cannot be >> further improved) would be a higher priority thing from my pov >> (without anyone arguing about its usefulness). > > Fair enough. I will do some measurement and analysis of this > code. However, let me ask something of you and Keir as well: > Please estimate how long (in usec) you think it is acceptable > to hold the heap_lock. If your limit is very small (as I expect), > doing anything "N" times in a loop with the lock held (for N==2^26, > which is a 256GB domain) may make the analysis moot.I think your thoughts here simply go a different route than mine: Of course it is wrong to hold _any_ lock for extended periods of time. But extending what was done by c/s 26056:177fdda0be56 might, considering the effect that change had, buy you quite a bit of allocation efficiency. Jan
Dan Magenheimer
2012-Oct-31 16:51 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: Proposed new "memory capacity claim" hypercall/feature > > >>> On 31.10.12 at 17:04, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> Subject: RE: Proposed new "memory capacity claim" hypercall/feature > >> > >> As long as the allocation times can get brought down to an > >> acceptable level, I continue to not see a need for the extra > >> "claim" approach you''re proposing. So working on that one (or > >> showing that without unreasonable effort this cannot be > >> further improved) would be a higher priority thing from my pov > >> (without anyone arguing about its usefulness). > > > > Fair enough. I will do some measurement and analysis of this > > code. However, let me ask something of you and Keir as well: > > Please estimate how long (in usec) you think it is acceptable > > to hold the heap_lock. If your limit is very small (as I expect), > > doing anything "N" times in a loop with the lock held (for N==2^26, > > which is a 256GB domain) may make the analysis moot. > > I think your thoughts here simply go a different route than mine: > Of course it is wrong to hold _any_ lock for extended periods of > time. But extending what was done by c/s 26056:177fdda0be56 > might, considering the effect that change had, buy you quite a > bit of allocation efficiency.No, I think we are on the same route, except that maybe I am trying to take a shortcut to the end. :-) I did follow the discussion that led to that changeset and highly recommended to the Oracle product folks that we integrate it asap. But reducing the domain allocation time "massively" from 30 sec to 3 sec doesn''t help solve my issue because, in essence, my issue says that the heap_lock must still be held for most of that 3 sec. Even reducing it by _another_ factor of 10 to 0.3 sec or a factor of 100 to 30msec doesn''t solve my problem. To look at it another way, the code in alloc_heap_page() contained within the loop: for ( i = 0; i < (1 << order); i++ ) may be already unacceptable, even _after_ the patch, if order==26 (a fictional page size just for this illustration) because the heap_lock will be held for a very very long time. (In fact for order==20, 1GB pages, it could already be a problem.) The claim hypercall/subop would allocate _capacity_ only, and then the actual physical pages are "lazily" allocated from that pre-allocated capacity. Anyway, I am still planning on proceeding with some of the measurement/analysis _and_ proof-of-concept. Thanks, Dan
Dario Faggioli
2012-Nov-01 02:13 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On Mon, 2012-10-29 at 22:35 +0000, Tim Deegan wrote:> At 10:06 -0700 on 29 Oct (1351505175), Dan Magenheimer wrote: > > In the case of a dying domain, a XENMEM_release operation > > is implied and must be executed by the hypervisor. > > > > Ideally, the quantity of unclaimed memory for each domain and > > for the system should be query-able. This may require additional > > memory_op hypercalls. > > > > I''d very much appreciate feedback on this proposed design! > > As I said, I''m not opposed to this, though even after reading through > the other thread I''m not convinced that it''s necessary (except in cases > where guest-controlled operations are allowed to consume unbounded > memory, which frankly gives me the heebie-jeebies). >Let me also ask something. Playing with NUMA systems I''ve been in the situation where it would be nice to know not only how much free memory we have in general, but how much free memory there is in a specific (set of) node(s), and that in many places, from the hypervisor, to libxc, to top level toolstack. Right now I ask this to Xen, but that is indeed prone to races and TOCTOU issues if we allow for domain creation and ballooning (tmem/paging/...) to happen concurrently between themselves and between each other (as noted in the long thread that preceded this one). Question is, the "claim" mechanism you''re proposing is by no means NUMA node-aware, right? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Dan Magenheimer
2012-Nov-01 15:51 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Dario Faggioli [mailto:raistlin@linux.it] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On Mon, 2012-10-29 at 22:35 +0000, Tim Deegan wrote: > > At 10:06 -0700 on 29 Oct (1351505175), Dan Magenheimer wrote: > > > In the case of a dying domain, a XENMEM_release operation > > > is implied and must be executed by the hypervisor. > > > > > > Ideally, the quantity of unclaimed memory for each domain and > > > for the system should be query-able. This may require additional > > > memory_op hypercalls. > > > > > > I''d very much appreciate feedback on this proposed design! > > > > As I said, I''m not opposed to this, though even after reading through > > the other thread I''m not convinced that it''s necessary (except in cases > > where guest-controlled operations are allowed to consume unbounded > > memory, which frankly gives me the heebie-jeebies). > > > Let me also ask something. > > Playing with NUMA systems I''ve been in the situation where it would be > nice to know not only how much free memory we have in general, but how > much free memory there is in a specific (set of) node(s), and that in > many places, from the hypervisor, to libxc, to top level toolstack. > > Right now I ask this to Xen, but that is indeed prone to races and > TOCTOU issues if we allow for domain creation and ballooningTOCTOU... hadn''t seen that term before, but I agree it describes the problem succinctly. Thanks, I will begin using that now!> (tmem/paging/...) to happen concurrently between themselves and between > each other (as noted in the long thread that preceded this one). > > Question is, the "claim" mechanism you''re proposing is by no means NUMA > node-aware, right?I hadn''t thought about NUMA, but I think the claim mechanism could be augmented to attempt to stake a claim on a specified node, or on any node that has sufficient memory. AFAICT this might complicate the arithmetic a bit but should work. Let me prototype the NUMA-ignorant mechanism first though... Dan
Jan Beulich
2012-Nov-02 09:01 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 31.10.12 at 17:51, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > To look at it another way, the code in alloc_heap_page() > contained within the loop: > > for ( i = 0; i < (1 << order); i++ ) > > may be already unacceptable, even _after_ the patch, if > order==26 (a fictional page size just for this illustration) > because the heap_lock will be held for a very very long time. > (In fact for order==20, 1GB pages, it could already be a > problem.)A million iterations doing just a few memory reads and writes (not even atomic ones afaics) doesn''t sound that bad. And order-18 allocations (which is what 1Gb pages really amount to) are the biggest ever happening (post-boot, if that matters). You''ll get much worse behavior if these large order allocations fail, and the callers have to fall back to smaller ones. Plus, if necessary, that loop could be broken up so that only the initial part of it gets run with the lock held (see c/s 22135:69e8bb164683 for why the unlock was moved past the loop). That would make for a shorter lock hold time, but for a higher allocation latency on large oder allocations (due to worse cache locality). Jan
Keir Fraser
2012-Nov-02 09:30 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote:> Plus, if necessary, that loop could be broken up so that only the > initial part of it gets run with the lock held (see c/s > 22135:69e8bb164683 for why the unlock was moved past the > loop). That would make for a shorter lock hold time, but for a > higher allocation latency on large oder allocations (due to worse > cache locality).In fact I believe only the first page needs to have its count_info set to !PGC_state_free, while the lock is held. That is sufficient to defeat the buddy merging in free_heap_pages(). Similarly, we could hoist most of the first loop in free_heap_pages() outside the lock. There''s a lot of scope for optimisation here. -- Keir
Dan Magenheimer
2012-Nov-04 19:43 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Keir Fraser [mailto:keir@xen.org] > Sent: Friday, November 02, 2012 3:30 AM > To: Jan Beulich; Dan Magenheimer > Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen- > devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; TimDeegan > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote: > > > Plus, if necessary, that loop could be broken up so that only the > > initial part of it gets run with the lock held (see c/s > > 22135:69e8bb164683 for why the unlock was moved past the > > loop). That would make for a shorter lock hold time, but for a > > higher allocation latency on large oder allocations (due to worse > > cache locality). > > In fact I believe only the first page needs to have its count_info set to !> PGC_state_free, while the lock is held. That is sufficient to defeat the > buddy merging in free_heap_pages(). Similarly, we could hoist most of the > first loop in free_heap_pages() outside the lock. There''s a lot of scope for > optimisation here.(sorry for the delayed response) Aren''t we getting a little sidetracked here? (Maybe my fault for looking at whether this specific loop is fast enough...) This loop handles only order=N chunks of RAM. Speeding up this loop and holding the heap_lock here for a shorter period only helps the TOCTOU race if the entire domain can be allocated as a single order-N allocation. Domain creation is supposed to succeed as long as there is sufficient RAM, _regardless_ of the state of memory fragmentation, correct? So unless the code for the _entire_ memory allocation path can be optimized so that the heap_lock can be held across _all_ the allocations necessary to create an arbitrary-sized domain, for any arbitrary state of memory fragmentation, the original problem has not been solved. Or am I misunderstanding? I _think_ the claim hypercall/subop should resolve this, though admittedly I have yet to prove (and code) it. Thanks, Dan
Tim Deegan
2012-Nov-04 20:35 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
At 11:43 -0800 on 04 Nov (1352029386), Dan Magenheimer wrote:> > From: Keir Fraser [mailto:keir@xen.org] > > Sent: Friday, November 02, 2012 3:30 AM > > To: Jan Beulich; Dan Magenheimer > > Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen- > > devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; TimDeegan > > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > > > On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > Plus, if necessary, that loop could be broken up so that only the > > > initial part of it gets run with the lock held (see c/s > > > 22135:69e8bb164683 for why the unlock was moved past the > > > loop). That would make for a shorter lock hold time, but for a > > > higher allocation latency on large oder allocations (due to worse > > > cache locality). > > > > In fact I believe only the first page needs to have its count_info set to !> > PGC_state_free, while the lock is held. That is sufficient to defeat the > > buddy merging in free_heap_pages(). Similarly, we could hoist most of the > > first loop in free_heap_pages() outside the lock. There''s a lot of scope for > > optimisation here. > > (sorry for the delayed response) > > Aren''t we getting a little sidetracked here? (Maybe my fault for > looking at whether this specific loop is fast enough...) > > This loop handles only order=N chunks of RAM. Speeding up this > loop and holding the heap_lock here for a shorter period only helps > the TOCTOU race if the entire domain can be allocated as a > single order-N allocation.I think the idea is to speed up allocation so that, even for a large VM, you can just allocate memory instead of needing a reservation hypercall (whose only purpose, AIUI, is to give you an immediate answer).> So unless the code for the _entire_ memory allocation path can > be optimized so that the heap_lock can be held across _all_ the > allocations necessary to create an arbitrary-sized domain, for > any arbitrary state of memory fragmentation, the original > problem has not been solved. > > Or am I misunderstanding? > > I _think_ the claim hypercall/subop should resolve this, though > admittedly I have yet to prove (and code) it.I don''t think it solves it - or rather it might solve this _particular_ instance of it but it doesn''t solve the bigger problem. If you have a set of overcommitted hosts and you want to start a new VM, you need to: - (a) decide which of your hosts is the least overcommitted; - (b) free up enough memory on that host to build the VM; and - (c) build the VM. The claim hypercall _might_ fix (c) (if it could handle allocations that need address-width limits or contiguous pages). But (b) and (a) have exactly the same problem, unless there is a central arbiter of memory allocation (or equivalent distributed system). If you try to start 2 VMs at once, - (a) the toolstack will choose to start them both on the same machine, even if that''s not optimal, or in the case where one creation is _bound_ to fail after some delay. - (b) the other VMs (and perhaps tmem) start ballooning out enough memory to start the new VM. This can take even longer than allocating it since it depends on guest behaviour. It can fail after an arbitrary delay (ditto). If you have a toolstack with enough knowledge and control over memory allocation to sort out stages (a) and (b) in such a way that there are no delayed failures, (c) should be trivial. Tim.
Dan Magenheimer
2012-Nov-05 00:23 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Tim Deegan [mailto:tim@xen.org] > Subject: Re: Proposed new "memory capacity claim" hypercall/featureHi Tim --> At 11:43 -0800 on 04 Nov (1352029386), Dan Magenheimer wrote: > > > From: Keir Fraser [mailto:keir@xen.org] > > > Sent: Friday, November 02, 2012 3:30 AM > > > To: Jan Beulich; Dan Magenheimer > > > Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen- > > > devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; TimDeegan > > > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > > > > > On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote: > > > > > > > Plus, if necessary, that loop could be broken up so that only the > > > > initial part of it gets run with the lock held (see c/s > > > > 22135:69e8bb164683 for why the unlock was moved past the > > > > loop). That would make for a shorter lock hold time, but for a > > > > higher allocation latency on large oder allocations (due to worse > > > > cache locality). > > > > > > In fact I believe only the first page needs to have its count_info set to !> > > PGC_state_free, while the lock is held. That is sufficient to defeat the > > > buddy merging in free_heap_pages(). Similarly, we could hoist most of the > > > first loop in free_heap_pages() outside the lock. There''s a lot of scope for > > > optimisation here. > > > > (sorry for the delayed response) > > > > Aren''t we getting a little sidetracked here? (Maybe my fault for > > looking at whether this specific loop is fast enough...) > > > > This loop handles only order=N chunks of RAM. Speeding up this > > loop and holding the heap_lock here for a shorter period only helps > > the TOCTOU race if the entire domain can be allocated as a > > single order-N allocation. > > I think the idea is to speed up allocation so that, even for a large VM, > you can just allocate memory instead of needing a reservation hypercall > (whose only purpose, AIUI, is to give you an immediate answer).Its purpose is to give an immediate answer on whether sufficient space is available for allocation AND (atomically) claim it so no other call to the allocator can race and steal some or all of it away. So unless the allocation is sped up enough (given an arbitrary size domain and arbitrary state of memory fragmentation) so that the heap_lock can be held for that length of time, speeding up allocation doesn''t solve the problem.> > So unless the code for the _entire_ memory allocation path can > > be optimized so that the heap_lock can be held across _all_ the > > allocations necessary to create an arbitrary-sized domain, for > > any arbitrary state of memory fragmentation, the original > > problem has not been solved. > > > > Or am I misunderstanding? > > > > I _think_ the claim hypercall/subop should resolve this, though > > admittedly I have yet to prove (and code) it. > > I don''t think it solves it - or rather it might solve this _particular_ > instance of it but it doesn''t solve the bigger problem. If you have a > set of overcommitted hosts and you want to start a new VM, you need to: > > - (a) decide which of your hosts is the least overcommitted; > - (b) free up enough memory on that host to build the VM; and > - (c) build the VM. > > The claim hypercall _might_ fix (c) (if it could handle allocations that > need address-width limits or contiguous pages). But (b) and (a) have > exactly the same problem, unless there is a central arbiter of memory > allocation (or equivalent distributed system). If you try to start 2 > VMs at once, > > - (a) the toolstack will choose to start them both on the same machine, > even if that''s not optimal, or in the case where one creation is > _bound_ to fail after some delay. > - (b) the other VMs (and perhaps tmem) start ballooning out enough > memory to start the new VM. This can take even longer than > allocating it since it depends on guest behaviour. It can fail > after an arbitrary delay (ditto). > > If you have a toolstack with enough knowledge and control over memory > allocation to sort out stages (a) and (b) in such a way that there are > no delayed failures, (c) should be trivial.(You''ve used the labels (a) and (b) twice so I''m not quite sure I understand... but in any case) Sigh. No, you are missing the beauty of tmem and dynamic allocation; you are thinking from the old static paradigm where the toolstack controls how much memory is available. There is no central arbiter of memory anymore than there is a central toolstack (other than the hypervisor on a one server Xen environment) that decides exactly when to assign vcpus to pcpus. There is no "free up enough memory on that host". Tmem doesn''t start ballooning out enough memory to start the VM... the guests are responsible for doing the ballooning and it is _already done_. The machine either has sufficient free+freeable memory or it does not; and it is _that_ determination that needs to be done atomically because many threads are micro-allocating, and possibly multiple toolstack threads are macro-allocating, simultaneously. Everything is handled dynamically. And just like a CPU scheduler built into a hypervisor that dynamically allocates vcpu->pcpus has proven more effective than partitioning pcpus to different domains, dynamic memory management should prove more effective than some bossy toolstack trying to control memory statically. I understand that you can solve "my" problem in your paradigm without a claim hypercall and/or by speeding up allocations. I _don''t_ see that you can solve "my" problem in _my_ paradigm without a claim hypercall... speeding up allocations doesn''t solve the TOCTOU race so allocating sufficient space for a domain must be atomic. Sigh. Dan
Jan Beulich
2012-Nov-05 09:16 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 04.11.12 at 20:43, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Keir Fraser [mailto:keir@xen.org] >> Sent: Friday, November 02, 2012 3:30 AM >> To: Jan Beulich; Dan Magenheimer >> Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; > DarioFaggioli; xen- >> devel@lists.xen.org; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; > Zhigang Wang; TimDeegan >> Subject: Re: Proposed new "memory capacity claim" hypercall/feature >> >> On 02/11/2012 09:01, "Jan Beulich" <JBeulich@suse.com> wrote: >> >> > Plus, if necessary, that loop could be broken up so that only the >> > initial part of it gets run with the lock held (see c/s >> > 22135:69e8bb164683 for why the unlock was moved past the >> > loop). That would make for a shorter lock hold time, but for a >> > higher allocation latency on large oder allocations (due to worse >> > cache locality). >> >> In fact I believe only the first page needs to have its count_info set to !>> PGC_state_free, while the lock is held. That is sufficient to defeat the >> buddy merging in free_heap_pages(). Similarly, we could hoist most of the >> first loop in free_heap_pages() outside the lock. There''s a lot of scope for >> optimisation here. > > (sorry for the delayed response) > > Aren''t we getting a little sidetracked here? (Maybe my fault for > looking at whether this specific loop is fast enough...) > > This loop handles only order=N chunks of RAM. Speeding up this > loop and holding the heap_lock here for a shorter period only helps > the TOCTOU race if the entire domain can be allocated as a > single order-N allocation. > > Domain creation is supposed to succeed as long as there is > sufficient RAM, _regardless_ of the state of memory fragmentation, > correct? > > So unless the code for the _entire_ memory allocation path can > be optimized so that the heap_lock can be held across _all_ the > allocations necessary to create an arbitrary-sized domain, for > any arbitrary state of memory fragmentation, the original > problem has not been solved. > > Or am I misunderstanding?I think we got here via questioning whether suppressing certain activities (like tmem causing the allocator visible amount of available memory) for a brief period of time would be acceptable, and while that indeed depends on the overall latency of memory allocation for the domain as a whole, I would be somewhat tolerant for it to involve a longer suspension period on a highly fragmented system. But of course, if this can be made work uniformly, that would be preferred. Jan
Ian Campbell
2012-Nov-05 10:29 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote:> There is no "free up enough memory on that host". Tmem doesn''t start > ballooning out enough memory to start the VM... the guests are > responsible for doing the ballooning and it is _already done_. The > machine either has sufficient free+freeable memory or it does not;How does one go about deciding which host in a multi thousand host deployment to try the claim hypercall on? Ian
Dan Magenheimer
2012-Nov-05 14:54 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Ian Campbell [mailto:ian.campbell@citrix.com] > Sent: Monday, November 05, 2012 3:30 AM > To: Dan Magenheimer > Cc: Tim (Xen.org); Keir (Xen.org); Jan Beulich; Olaf Hering; George Dunlap; Ian Jackson; George > Shuklin; DarioFaggioli; xen-devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote: > > There is no "free up enough memory on that host". Tmem doesn''t start > > ballooning out enough memory to start the VM... the guests are > > responsible for doing the ballooning and it is _already done_. The > > machine either has sufficient free+freeable memory or it does not; > > How does one go about deciding which host in a multi thousand host > deployment to try the claim hypercall on?I don''t get paid enough to solve that problem :-) VM placement (both for new domains and migration due to load-balancing and power-management) is dependent on a number of factors currently involving CPU utilization, SAN utilization, and LAN utilization, I think using historical trends on streams of sampled statistics. This is very non-deterministic as all of these factors may vary dramatically within a sampling interval. Adding free+freeable memory to this just adds one more such statistic. Actually two, as it is probably best to track free separately from freeable since a candidate host that has enough free memory should have preference over one with freeable memory. Sorry if that''s not very satisfying but anything beyond that meager description is outside of my area of expertise. Dan P.S. I don''t think I''ve ever said _thousands_ of physical hosts, just hundreds (with thousands of VMs). Honestly I don''t know the upper support bound for an Oracle VM "server pool" (which is what we call the collection of hundreds of physical machines)... it may be thousands.
George Dunlap
2012-Nov-05 17:14 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 30/10/12 15:43, Dan Magenheimer wrote:> a) Truly free memory (each free page is on the hypervisor free list) > b) Freeable memory ("ephmeral" memory managed by tmem) > c) Owned memory (pages allocated by the hypervisor or for a domain) > > The sum of these three is always a constant: The total number of > RAM pages in the system. However, when tmem is active, the values > of all _three_ of these change constantly. So if at the start of a > domain launch, the sum of free+freeable exceeds the intended size > of the domain, the domain allocation/launch can start.Why free+freeable, rather than just "free"?> But then > if "owned" increases enough, there may no longer be enough memory > and the domain launch will fail.Again, "owned" would not increase at all if the guest weren''t handing memory back to Xen. Why is that necessary, or even helpful? (And please don''t start another rant about the bold new world of peace and love. Give me a freaking *technical* answer.) -George
Dan Magenheimer
2012-Nov-05 18:21 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: George Dunlap [mailto:george.dunlap@eu.citrix.com] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > On 30/10/12 15:43, Dan Magenheimer wrote: > > a) Truly free memory (each free page is on the hypervisor free list) > > b) Freeable memory ("ephmeral" memory managed by tmem) > > c) Owned memory (pages allocated by the hypervisor or for a domain) > > > > The sum of these three is always a constant: The total number of > > RAM pages in the system. However, when tmem is active, the values > > of all _three_ of these change constantly. So if at the start of a > > domain launch, the sum of free+freeable exceeds the intended size > > of the domain, the domain allocation/launch can start.> (And please don''t start another rant about the bold new world of peace > and love. Give me a freaking *technical* answer.)<grin> /Me removes seventies-style tie-dye tshirt with peace logo and sadly withdraws single daisy previously extended to George.> Why free+freeable, rather than just "free"?A free page is a page that is not used for anything at all. It is on the hypervisor''s free list. A freeable page contains tmem ephemeral data stored on behalf of a domain (or, if dedup''ing is enabled, on behalf of one or more domains). More specifically for a tmem-enabled Linux guest, a freeable page contains a clean page cache page that the Linux guest OS has asked the hypervisor (via the tmem ABI) to hold if it can for as long as it can. The specific clean page cache pages are chosen and the call is done on the Linux side via "cleancache". So, when tmem is working optimally, there are few or no free pages and many many freeable pages (perhaps half of physical RAM or more). Freeable pages across all tmem-enabled guests are kept in a single LRU queue. When a request is made to the hypervisor allocator for a free page and its free list is empty, the allocator will force tmem to relinquish an ephemeral page (in LRU order). Because this is entirely up to the hypervisor and can happen at any time, freeable pages are not counted as "owned" by a domain but still have some value to a domain. So, in essence, a "free" page has zero value and a "freeable" page has a small, but non-zero value that decays over time. So it''s useful for a toolstack to know both quantities. (And, since this thread has gone in many directions, let me reiterate that all of this has been working in the hypervisor since 4.0 in 2009, and cleancache in Linux since mid-2011.)> > But then > > if "owned" increases enough, there may no longer be enough memory > > and the domain launch will fail. > > Again, "owned" would not increase at all if the guest weren''t handing > memory back to Xen. Why is that necessary, or even helpful?The guest _is_ handing memory back to Xen. This is the other half of the tmem functionality, persistent pages. Answering your second question is going to require a little more background. Since nobody, not even the guest kernel, can guess the future needs of its workload, there are two choices: (1) allocate enough RAM so that the supply always exceeds max-demand, or (2) aggressively reduce RAM to a reasonable guess for a target and prepare for the probability that, sometimes, available RAM won''t be enough. Tmem does choice #2; self-ballooning aggressively drives RAM (or "current memory" as the hypervisor sees it) to a target level: in Linux, to Committed_AS modified by a formula similar to the one Novell derived for a minimum ballooning safety level. The target level changes constantly, but the selfballooning code samples and adjusts only periodically. If, during the time interval between samples, memory demand spikes, Linux has a memory shortage and responds as it must, namely by swapping. The frontswap code in Linux "intercepts" this swapping so that, in most cases, it goes to a Xen tmem persistent pool instead of to a (virtual or physical) swap disk. Data in persistent pools, unlike ephemeral pools, are guaranteed to be maintained by the hypervisor until the guest invalidates it or until the guest dies. As a result, pages allocated for persistent pools increase the count of pages "owned" by the domain that requested the pages, until the guest explicitly invalidates them (or dies). The accounting also ensures that malicious domains can''t absorb memory beyond the toolset-specified limit ("maxmem"). Note that, if compression is enabled, a domain _may_ "logically" exceed maxmem, as long as it does not physically exceed it. (And, again, all of this too has been in Xen since 4.0 in 2009, and selfballooning has been in Linux since mid-2011, but frontswap finally was accepted into Linux earlier in 2012.) Ok, George, does that answer your questions, _technically_? I''ll be happy to answer any others. Thanks, Dan
Ian Campbell
2012-Nov-05 22:24 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On Mon, 2012-11-05 at 14:54 +0000, Dan Magenheimer wrote:> > From: Ian Campbell [mailto:ian.campbell@citrix.com] > > Sent: Monday, November 05, 2012 3:30 AM > > To: Dan Magenheimer > > Cc: Tim (Xen.org); Keir (Xen.org); Jan Beulich; Olaf Hering; George Dunlap; Ian Jackson; George > > Shuklin; DarioFaggioli; xen-devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang > > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > > > On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote: > > > There is no "free up enough memory on that host". Tmem doesn''t start > > > ballooning out enough memory to start the VM... the guests are > > > responsible for doing the ballooning and it is _already done_. The > > > machine either has sufficient free+freeable memory or it does not; > > > > How does one go about deciding which host in a multi thousand host > > deployment to try the claim hypercall on? > > I don''t get paid enough to solve that problem :-) > > VM placement (both for new domains and migration due to > load-balancing and power-management) is dependent on a > number of factors currently involving CPU utilization, > SAN utilization, and LAN utilization, I think using > historical trends on streams of sampled statistics. This > is very non-deterministic as all of these factors may > vary dramatically within a sampling interval. > > Adding free+freeable memory to this just adds one more > such statistic. Actually two, as it is probably best to > track free separately from freeable since a candidate > host that has enough free memory should have preference > over one with freeable memory. > > Sorry if that''s not very satisfying but anything beyond that > meager description is outside of my area of expertise.I guess I don''t see how your proposed claim hypercall is useful if you can''t decide which machine you should call it on, whether it''s 10s, 100s or 1000s of hosts. Surely you aren''t suggesting that the toolstack try it on all (or even a subset) of them and see which sticks? By ignoring this part of the problem I think you are ignoring one of the most important bits of the story, without which it is very hard to make a useful and informed determination about the validity of the use cases you are describing for the new call. Ian.
Dan Magenheimer
2012-Nov-05 22:33 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Tim Deegan [mailto:tim@xen.org] > Subject: Re: Proposed new "memory capacity claim" hypercall/featureOops, missed an important part of your response... I''m glad I went back and reread it...> The claim hypercall _might_ fix (c) (if it could handle allocations that > need address-width limits or contiguous pages).I''m still looking into this part. It''s my understanding (from Jan) that, post-dom0-launch, there are no known memory allocation paths that _require_ order>0 allocations. All of them attempt a larger allocation and gracefully fallback to (eventually) order==0 allocations. I''ve hacked some code in to the allocator to confirm this, though I''m not sure how to test the hypothesis exhaustively. For address-width limits, I suspect we are talking mostly or entirely about DMA in 32-bit PV domains? And/or PCI-passthrough? I''ll look into it further, but if those are the principal cases, I''d have no problem documenting that the claim hypercall doesn''t handle them and attempts to build such a domain might still fail slowly. At least unless/until someone decided to add any necessary special corner cases to the claim hypercall.
Zhigang Wang
2012-Nov-05 22:58 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 11/05/2012 05:24 PM, Ian Campbell wrote:> On Mon, 2012-11-05 at 14:54 +0000, Dan Magenheimer wrote: >>> On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote: >>>> There is no "free up enough memory on that host". Tmem doesn''t start >>>> ballooning out enough memory to start the VM... the guests are >>>> responsible for doing the ballooning and it is _already done_. The >>>> machine either has sufficient free+freeable memory or it does not; >>> How does one go about deciding which host in a multi thousand host >>> deployment to try the claim hypercall on? > I guess I don''t see how your proposed claim hypercall is useful if you > can''t decide which machine you should call it on, whether it''s 10s, 100s > or 1000s of hosts. Surely you aren''t suggesting that the toolstack try > it on all (or even a subset) of them and see which sticks? > > By ignoring this part of the problem I think you are ignoring one of the > most important bits of the story, without which it is very hard to make > a useful and informed determination about the validity of the use cases > you are describing for the new call.Planned implement: 1. Every Server (dom0) sends memory statistics to Manager every 20 seconds (tunable). 2. At one time, Manager selects a Server to run VM based on the snapshot of Server memory. Selected server should have: enough free memory for the VM or have free + freeable memory > VM memory. Two ways to handle failures: 1. Try start_vm on the first selected Server. If failed, try the second one. 2. Try reserve memory on the first Server. If failed, try the second one. If success, start_vm on the Server. From high level, Dan''s proposal could help with 2). If memory allocation is fast enough (VM start failed/success very fast), then 1) is preferred. Thanks, Zhigang
Dan Magenheimer
2012-Nov-05 22:58 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Ian Campbell [mailto:ian.campbell@citrix.com] > Subject: Re: Proposed new "memory capacity claim" hypercall/featureHi Ian --> On Mon, 2012-11-05 at 14:54 +0000, Dan Magenheimer wrote: > > > From: Ian Campbell [mailto:ian.campbell@citrix.com] > > > Sent: Monday, November 05, 2012 3:30 AM > > > To: Dan Magenheimer > > > Cc: Tim (Xen.org); Keir (Xen.org); Jan Beulich; Olaf Hering; George Dunlap; Ian Jackson; George > > > Shuklin; DarioFaggioli; xen-devel@lists.xen.org; Konrad Wilk; Kurt Hackel; Mukesh Rathor; Zhigang > Wang > > > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > > > > > On Mon, 2012-11-05 at 00:23 +0000, Dan Magenheimer wrote: > > > > There is no "free up enough memory on that host". Tmem doesn''t start > > > > ballooning out enough memory to start the VM... the guests are > > > > responsible for doing the ballooning and it is _already done_. The > > > > machine either has sufficient free+freeable memory or it does not; > > > > > > How does one go about deciding which host in a multi thousand host > > > deployment to try the claim hypercall on? > > > > I don''t get paid enough to solve that problem :-) > > > > VM placement (both for new domains and migration due to > > load-balancing and power-management) is dependent on a > > number of factors currently involving CPU utilization, > > SAN utilization, and LAN utilization, I think using > > historical trends on streams of sampled statistics. This > > is very non-deterministic as all of these factors may > > vary dramatically within a sampling interval. > > > > Adding free+freeable memory to this just adds one more > > such statistic. Actually two, as it is probably best to > > track free separately from freeable since a candidate > > host that has enough free memory should have preference > > over one with freeable memory. > > > > Sorry if that''s not very satisfying but anything beyond that > > meager description is outside of my area of expertise. > > I guess I don''t see how your proposed claim hypercall is useful if you > can''t decide which machine you should call it on, whether it''s 10s, 100s > or 1000s of hosts. Surely you aren''t suggesting that the toolstack try > it on all (or even a subset) of them and see which sticks? > > By ignoring this part of the problem I think you are ignoring one of the > most important bits of the story, without which it is very hard to make > a useful and informed determination about the validity of the use cases > you are describing for the new call.I''m not ignoring it at all. One only needs to choose a machine and be prepared that the machine will (immediately) answer "sorry, won''t fit". It''s not necessary to choose the _optimal_ fit, only a probable one. Since failure is immediate, trying more than one machine (which should happen only rarely) is not particularly problematic, though I completely agree that trying _all_ of them might be. The existing OracleVM Manager already chooses domain launch candidates and load balancing candidates based on sampled CPU/SAN/LAN data, which is always stale but still sufficient as a rough estimate of the best machine to choose. Beyond that, I''m not particularly knowledgeable about the details and, even if I were, I''m not sure if the details are suitable for a public forum. But I can tell you that it has been shipping for over a year and here''s some of what''s published... look for DRS and DPM. http://www.oracle.com/us/technologies/virtualization/ovm3-whats-new-459313.pdf Dan
Jan Beulich
2012-Nov-06 10:49 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 05.11.12 at 23:33, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > For address-width limits, I suspect we are talking mostly or > entirely about DMA in 32-bit PV domains? And/or PCI-passthrough? > I''ll look into it further, but if those are the principal cases, > I''d have no problem documenting that the claim hypercall doesn''t > handle them and attempts to build such a domain might still > fail slowly. At least unless/until someone decided to add > any necessary special corner cases to the claim hypercall.DMA (also for 64-bit PV) is one aspect, and the fundamental address restriction of 32-bit guests is the perhaps more important one (for they can''t access the full M2P map, and hence can''t ever be handed pages not covered by the portion they have access to). Jan
Ian Campbell
2012-Nov-06 13:23 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On Mon, 2012-11-05 at 22:58 +0000, Dan Magenheimer wrote:> It''s not necessary to choose the _optimal_ fit, only a probable one.I think this is the key point which I was missing i.e. that it doesn''t need to be a totally accurate answer. Without that piece it seemed to me that you must already have the more knowledgeable toolstack part which others have mentioned. Ian.
Dan Magenheimer
2012-Nov-07 22:17 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: Proposed new "memory capacity claim" hypercall/feature > > > Aren''t we getting a little sidetracked here? (Maybe my fault for > > looking at whether this specific loop is fast enough...) > > > > This loop handles only order=N chunks of RAM. Speeding up this > > loop and holding the heap_lock here for a shorter period only helps > > the TOCTOU race if the entire domain can be allocated as a > > single order-N allocation. > > > > Domain creation is supposed to succeed as long as there is > > sufficient RAM, _regardless_ of the state of memory fragmentation, > > correct? > > > > So unless the code for the _entire_ memory allocation path can > > be optimized so that the heap_lock can be held across _all_ the > > allocations necessary to create an arbitrary-sized domain, for > > any arbitrary state of memory fragmentation, the original > > problem has not been solved. > > > > Or am I misunderstanding? > > I think we got here via questioning whether suppressing certain > activities (like tmem causing the allocator visible amount of > available memory) for a brief period of time would be acceptable, > and while that indeed depends on the overall latency of memory > allocation for the domain as a whole, I would be somewhat > tolerant for it to involve a longer suspension period on a highly > fragmented system. > > But of course, if this can be made work uniformly, that would be > preferred.Hi Jan and Keir -- OK, here''s a status update. Sorry for the delay but it took awhile for me to refamiliarize myself with the code paths. It appears that the attempt to use 2MB and 1GB pages is done in the toolstack, and if the hypervisor rejects it, toolstack tries smaller pages. Thus, if physical memory is highly fragmented (few or no order>=9 allocations available), this will result in one hypercall per 4k page so a 256GB domain would require 64 million hypercalls. And, since AFAICT, there is no sane way to hold the heap_lock across even two hypercalls, speeding up the in-hypervisor allocation path, by itself, will not solve the TOCTOU race. One option to avoid the 64M hypercalls is to change the Xen ABI to add a new memory hypercall/subop to populate_physmap an arbitrary amount of physical RAM, and have Xen (optionally) try order==18, then order==9, then order==0. I suspect that, even with the overhead of hypercalls removed, the steps required to allocate 64 million pages (including, for example, removing a page from a xen list and adding it to the domain''s page list) will consume enough time that holding the heap_lock and/or suppressing micro-allocations for the entire macro-allocation on a fragmented system will still be unacceptable (e.g. at least tens of seconds). However, I am speculating, and I think I can measure it if you (Jan or Keir) feel a measurement is necessary to fully convince you. I think this brings us back to the proposed "claim" hypercall/subop. Unless there are further objections or suggestions for different approaches, I''ll commence prototyping it, OK? Thanks, Dan
Keir Fraser
2012-Nov-08 07:36 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> I think this brings us back to the proposed "claim" hypercall/subop. > Unless there are further objections or suggestions for different > approaches, I''ll commence prototyping it, OK?Yes, in fact I thought you''d started already! K.
Jan Beulich
2012-Nov-08 08:00 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > It appears that the attempt to use 2MB and 1GB pages is done in > the toolstack, and if the hypervisor rejects it, toolstack tries > smaller pages. Thus, if physical memory is highly fragmented > (few or no order>=9 allocations available), this will result > in one hypercall per 4k page so a 256GB domain would require > 64 million hypercalls. And, since AFAICT, there is no sane > way to hold the heap_lock across even two hypercalls, speeding > up the in-hypervisor allocation path, by itself, will not solve > the TOCTOU race.No, even in the absence of large pages, the tool stack will do 8M allocations, just without requesting them to be contiguous. Whether 8M is a suitable value is another aspect; that value may predate hypercall preemption, and I don''t immediately see why the tool stack shouldn''t be able to request larger chunks (up to the whole amount at once). Jan
Keir Fraser
2012-Nov-08 08:18 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 08/11/2012 08:00, "Jan Beulich" <JBeulich@suse.com> wrote:>>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> It appears that the attempt to use 2MB and 1GB pages is done in >> the toolstack, and if the hypervisor rejects it, toolstack tries >> smaller pages. Thus, if physical memory is highly fragmented >> (few or no order>=9 allocations available), this will result >> in one hypercall per 4k page so a 256GB domain would require >> 64 million hypercalls. And, since AFAICT, there is no sane >> way to hold the heap_lock across even two hypercalls, speeding >> up the in-hypervisor allocation path, by itself, will not solve >> the TOCTOU race. > > No, even in the absence of large pages, the tool stack will do 8M > allocations, just without requesting them to be contiguous. > Whether 8M is a suitable value is another aspect; that value may > predate hypercall preemption, and I don''t immediately see why > the tool stack shouldn''t be able to request larger chunks (up to > the whole amount at once).It is probably to allow other dom0 processing (including softirqs) to preempt the toolstack task, in the case that the kernel was not built with involuntary preemption enabled (having it disabled is the common case I believe?). 8M batches may provide enough returns to user space to allow other work to get a look-in.> Jan >
Jan Beulich
2012-Nov-08 08:54 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 08.11.12 at 09:18, Keir Fraser <keir.xen@gmail.com> wrote: > On 08/11/2012 08:00, "Jan Beulich" <JBeulich@suse.com> wrote: > >>>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >>> It appears that the attempt to use 2MB and 1GB pages is done in >>> the toolstack, and if the hypervisor rejects it, toolstack tries >>> smaller pages. Thus, if physical memory is highly fragmented >>> (few or no order>=9 allocations available), this will result >>> in one hypercall per 4k page so a 256GB domain would require >>> 64 million hypercalls. And, since AFAICT, there is no sane >>> way to hold the heap_lock across even two hypercalls, speeding >>> up the in-hypervisor allocation path, by itself, will not solve >>> the TOCTOU race. >> >> No, even in the absence of large pages, the tool stack will do 8M >> allocations, just without requesting them to be contiguous. >> Whether 8M is a suitable value is another aspect; that value may >> predate hypercall preemption, and I don''t immediately see why >> the tool stack shouldn''t be able to request larger chunks (up to >> the whole amount at once). > > It is probably to allow other dom0 processing (including softirqs) to > preempt the toolstack task, in the case that the kernel was not built with > involuntary preemption enabled (having it disabled is the common case I > believe?). 8M batches may provide enough returns to user space to allow > other work to get a look-in.That may have mattered when ioctl-s were run with the big kernel lock held, but even 2.6.18 didn''t do that anymore (using the .unlocked_ioctl field of struct file_operations), which means that even softirqs will get serviced in Dom0 since the preempted hypercall gets restarted via exiting to the guest (i.e. events get delivered). Scheduling is what indeed wouldn''t happen, but if allocation latency can be brought down, 8M might turn out pretty small a chunk size. If we do care about Dom0-s running even older kernels (assuming there ever was a privcmd implementation that didn''t use the unlocked path), or if we have to assume non-Linux Dom0-s might have issues here, making the tool stack behavior kernel kind/ version dependent without strong need of course wouldn''t sound very attractive. Jan
Keir Fraser
2012-Nov-08 09:12 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 08/11/2012 08:54, "Jan Beulich" <JBeulich@suse.com> wrote:>>>> On 08.11.12 at 09:18, Keir Fraser <keir.xen@gmail.com> wrote: >> On 08/11/2012 08:00, "Jan Beulich" <JBeulich@suse.com> wrote: >> >>>>>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >>>> It appears that the attempt to use 2MB and 1GB pages is done in >>>> the toolstack, and if the hypervisor rejects it, toolstack tries >>>> smaller pages. Thus, if physical memory is highly fragmented >>>> (few or no order>=9 allocations available), this will result >>>> in one hypercall per 4k page so a 256GB domain would require >>>> 64 million hypercalls. And, since AFAICT, there is no sane >>>> way to hold the heap_lock across even two hypercalls, speeding >>>> up the in-hypervisor allocation path, by itself, will not solve >>>> the TOCTOU race. >>> >>> No, even in the absence of large pages, the tool stack will do 8M >>> allocations, just without requesting them to be contiguous. >>> Whether 8M is a suitable value is another aspect; that value may >>> predate hypercall preemption, and I don''t immediately see why >>> the tool stack shouldn''t be able to request larger chunks (up to >>> the whole amount at once). >> >> It is probably to allow other dom0 processing (including softirqs) to >> preempt the toolstack task, in the case that the kernel was not built with >> involuntary preemption enabled (having it disabled is the common case I >> believe?). 8M batches may provide enough returns to user space to allow >> other work to get a look-in. > > That may have mattered when ioctl-s were run with the big kernel > lock held, but even 2.6.18 didn''t do that anymore (using the > .unlocked_ioctl field of struct file_operations), which means > that even softirqs will get serviced in Dom0 since the preempted > hypercall gets restarted via exiting to the guest (i.e. events get > delivered). Scheduling is what indeed wouldn''t happen, but if > allocation latency can be brought down, 8M might turn out pretty > small a chunk size.Ah, then I am out of date on how Linux services softirqs and preemption? Can softirqs/preemption occur any time, even in kernel mode, so long as no locks are held? I thought softirq-type work only happened during event servicing, only if the event servicing had interrupted user context (ie, would not happen if started from within kernel mode). So the restart of the hypercall trap instruction would be an opportunity to service hardirqs, but not softirqs or scheduler... -- Keir> If we do care about Dom0-s running even older kernels (assuming > there ever was a privcmd implementation that didn''t use the > unlocked path), or if we have to assume non-Linux Dom0-s might > have issues here, making the tool stack behavior kernel kind/ > version dependent without strong need of course wouldn''t sound > very attractive. > > Jan >
Jan Beulich
2012-Nov-08 09:47 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 08.11.12 at 10:12, Keir Fraser <keir@xen.org> wrote: > On 08/11/2012 08:54, "Jan Beulich" <JBeulich@suse.com> wrote: >> That may have mattered when ioctl-s were run with the big kernel >> lock held, but even 2.6.18 didn''t do that anymore (using the >> .unlocked_ioctl field of struct file_operations), which means >> that even softirqs will get serviced in Dom0 since the preempted >> hypercall gets restarted via exiting to the guest (i.e. events get >> delivered). Scheduling is what indeed wouldn''t happen, but if >> allocation latency can be brought down, 8M might turn out pretty >> small a chunk size. > > Ah, then I am out of date on how Linux services softirqs and preemption? Can > softirqs/preemption occur any time, even in kernel mode, so long as no locks > are held? > > I thought softirq-type work only happened during event servicing, only if > the event servicing had interrupted user context (ie, would not happen if > started from within kernel mode). So the restart of the hypercall trap > instruction would be an opportunity to service hardirqs, but not softirqs or > scheduler...No, irq_exit() can invoke softirqs, provided this isn''t a nested IRQ (soft as well as hard) or softirqs weren''t disabled in the interrupted context. The only thing that indeed is - on non-preemptible kernels - done only on exit to user mode is the eventual entering of the scheduler. Jan
Ian Jackson
2012-Nov-08 10:11 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
Keir Fraser writes ("Re: Proposed new "memory capacity claim" hypercall/feature"):> On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > I think this brings us back to the proposed "claim" hypercall/subop. > > Unless there are further objections or suggestions for different > > approaches, I''ll commence prototyping it, OK? > > Yes, in fact I thought you''d started already!Sorry to play bad cop here but I am still far from convinced that a new hypercall is necessary or desirable. A lot of words have been written but the concrete, detailed, technical argument remains to be made IMO. Ian.
Keir Fraser
2012-Nov-08 10:50 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote:>> Ah, then I am out of date on how Linux services softirqs and preemption? Can >> softirqs/preemption occur any time, even in kernel mode, so long as no locks >> are held? >> >> I thought softirq-type work only happened during event servicing, only if >> the event servicing had interrupted user context (ie, would not happen if >> started from within kernel mode). So the restart of the hypercall trap >> instruction would be an opportunity to service hardirqs, but not softirqs or >> scheduler... > > No, irq_exit() can invoke softirqs, provided this isn''t a nested IRQ > (soft as well as hard) or softirqs weren''t disabled in the interrupted > context.Ah, okay. In fact maybe that''s always been the case and I have misremembered this detail, since condition for softirq entry in Xen has always been more strict than this.> The only thing that indeed is - on non-preemptible kernels - done > only on exit to user mode is the eventual entering of the scheduler.That alone may still be an argument for restricting the batch size from the toolstack? -- Keir
Keir Fraser
2012-Nov-08 10:57 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 08/11/2012 10:11, "Ian Jackson" <Ian.Jackson@eu.citrix.com> wrote:> Keir Fraser writes ("Re: Proposed new "memory capacity claim" > hypercall/feature"): >> On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: >>> I think this brings us back to the proposed "claim" hypercall/subop. >>> Unless there are further objections or suggestions for different >>> approaches, I''ll commence prototyping it, OK? >> >> Yes, in fact I thought you''d started already! > > Sorry to play bad cop here but I am still far from convinced that a > new hypercall is necessary or desirable. > > A lot of words have been written but the concrete, detailed, technical > argument remains to be made IMO.I agree but prototyping != acceptance, and at least it gives something concrete to hang the discussion on. Otherwise this longwinded thread is going nowhere.> Ian.
Jan Beulich
2012-Nov-08 13:48 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 08.11.12 at 11:50, Keir Fraser <keir@xen.org> wrote: > On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote: >> The only thing that indeed is - on non-preemptible kernels - done >> only on exit to user mode is the eventual entering of the scheduler. > > That alone may still be an argument for restricting the batch size from the > toolstack?Yes, this clearly prohibits unlimited batches. But not being able to schedule should be less restrictive than not being able to run softirqs, so I''d still put under question whether the limit shouldn''t be bumped. Jan
Dan Magenheimer
2012-Nov-08 18:38 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: Proposed new "memory capacity claim" hypercall/feature > > >>> On 07.11.12 at 23:17, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > > It appears that the attempt to use 2MB and 1GB pages is done in > > the toolstack, and if the hypervisor rejects it, toolstack tries > > smaller pages. Thus, if physical memory is highly fragmented > > (few or no order>=9 allocations available), this will result > > in one hypercall per 4k page so a 256GB domain would require > > 64 million hypercalls. And, since AFAICT, there is no sane > > way to hold the heap_lock across even two hypercalls, speeding > > up the in-hypervisor allocation path, by itself, will not solve > > the TOCTOU race. > > No, even in the absence of large pages, the tool stack will do 8M > allocations, just without requesting them to be contiguous.Rats, you are right (as usual). My debug code was poorly placed and missed this important point. So ignore the huge-number-of-hypercalls point and I think we return to: What is an upper time bound for holding the heap_lock and, for an arbitrary-sized domain in an arbitrarily-fragmented system, can the page allocation code be made fast enough to fit within that bound? I am in agreement that if the page allocation code can be fast enough so that the heap_lock can be held, this is a better solution than "claim". I am just skeptical that, in the presence of those two "arbitraries", it is possible. So I will proceed with more measurements before prototyping the "claim" stuff. Dan
Dan Magenheimer
2012-Nov-08 19:16 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Thursday, November 08, 2012 6:49 AM > To: Keir Fraser > Cc: Olaf Hering; IanCampbell; George Dunlap; Ian Jackson; George Shuklin; DarioFaggioli; xen- > devel@lists.xen.org; Dan Magenheimer; Konrad Rzeszutek Wilk; Kurt Hackel; Mukesh Rathor; Zhigang Wang; > TimDeegan > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > >>> On 08.11.12 at 11:50, Keir Fraser <keir@xen.org> wrote: > > On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote: > >> The only thing that indeed is - on non-preemptible kernels - done > >> only on exit to user mode is the eventual entering of the scheduler. > > > > That alone may still be an argument for restricting the batch size from the > > toolstack? > > Yes, this clearly prohibits unlimited batches. But not being able to > schedule should be less restrictive than not being able to run > softirqs, so I''d still put under question whether the limit shouldn''t > be bumped.Wait, please define unlimited. I think we are in agreement from previous discussion that, to solve the TOCTOU race, the heap_lock must be held for the entire allocation for a domain creation. True? So unless the limit is "bumped" to handle the largest supported physical memory size for a domain AND the allocation code in the hypervisor is rewritten to hold the heap_lock while allocating the entire extent, bumping the limit doesn''t help the TOCTOU race, correct? Further, holding the heap_lock not only stops scheduling of this pcpu, but also blocks other domains/pcpus from doing any micro-allocations at all. True? Sorry if I am restating the obvious, but I am red-faced about the huge-number-of-hypercalls mistake, so want to ensure if I am understanding. Dan P.S. For PV domains, doesn''t the toolstack already use a batch of up to 2^20 pages? (Or maybe I am misunderstanding/misreading the code in arch_setup_meminit() in xc_dom_x86.c?)
Dan Magenheimer
2012-Nov-08 21:45 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com] > Subject: Re: Proposed new "memory capacity claim" hypercall/feature > > Keir Fraser writes ("Re: Proposed new "memory capacity claim" hypercall/feature"): > > On 07/11/2012 22:17, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote: > > > I think this brings us back to the proposed "claim" hypercall/subop. > > > Unless there are further objections or suggestions for different > > > approaches, I''ll commence prototyping it, OK? > > > > Yes, in fact I thought you''d started already! > > Sorry to play bad cop here but I am still far from convinced that a > new hypercall is necessary or desirable. > > A lot of words have been written but the concrete, detailed, technical > argument remains to be made IMO.Hi Ian -- I agree, a _lot_ of words have been written and this discussion has had a lot of side conversations so has gone back and forth into a lot of weed patches. I agree it would be worthwhile to restate the problem clearly, along with some of the proposed solutions/pros/cons. When I have a chance I will do that, but prototyping may either clarify some things or bring out some new unforeseen issues, so I think I will do some more coding first (and this may take a week or two due to some other constraints). But to ensure that any summary/restatement touches on your concerns, could you be more specific as to about what you are unconvinced? I.e. I still think the toolstack can manage all memory allocation; or, holding the heap_lock for a longer period should solve the problem; or I don''t understand what the original problem is that you are trying to solve, etc. Thanks, Dan
Keir Fraser
2012-Nov-08 22:32 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
On 08/11/2012 19:16, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:>> Yes, this clearly prohibits unlimited batches. But not being able to >> schedule should be less restrictive than not being able to run >> softirqs, so I''d still put under question whether the limit shouldn''t >> be bumped. > > Wait, please define unlimited. > > I think we are in agreement from previous discussion that, to solve > the TOCTOU race, the heap_lock must be held for the entire allocation > for a domain creation. True?It''s pretty obvious that this isn''t going to be possible in the general case. E.g., a 40G domain being created out of 4k pages (eg. Because memory is fragmented) is going to be at least 40G/4k == 10M heap operations. Say each takes 10ns, which would be quick, we''re talking 100ms of cpu work. Holding a lock that long can''t be recommended really. -- Keir
Jan Beulich
2012-Nov-09 08:47 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
>>> On 08.11.12 at 20:16, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> >>> On 08.11.12 at 11:50, Keir Fraser <keir@xen.org> wrote: >> > On 08/11/2012 09:47, "Jan Beulich" <JBeulich@suse.com> wrote: >> >> The only thing that indeed is - on non-preemptible kernels - done >> >> only on exit to user mode is the eventual entering of the scheduler. >> > >> > That alone may still be an argument for restricting the batch size from the >> > toolstack? >> >> Yes, this clearly prohibits unlimited batches. But not being able to >> schedule should be less restrictive than not being able to run >> softirqs, so I''d still put under question whether the limit shouldn''t >> be bumped. > > Wait, please define unlimited.Unlimited as in unlimited.> I think we are in agreement from previous discussion that, to solve > the TOCTOU race, the heap_lock must be held for the entire allocation > for a domain creation. True?That''s only one way (and as Keir already responded, not one that we should actually pursue). The point about being fast enough was rather made to allow a decision towards the feasibility of intermediately disabling tmem (or at least allocations originating from it) in particular (I''m not worried about micro-allocations - the tool stack has to provide some slack in its calculations for this anyway). Jan
Ian Jackson
2012-Nov-12 11:03 UTC
Re: Proposed new "memory capacity claim" hypercall/feature
Dan Magenheimer writes ("RE: Proposed new "memory capacity claim" hypercall/feature"):> But to ensure that any summary/restatement touches on your > concerns, could you be more specific as to about what you are > unconvinced? > > I.e. I still think the toolstack can manage all memory > allocation;I''m still unconvinced that this is false. I think it''s probably true. Ian.