Dan Magenheimer
2009-Jan-08 17:27 UTC
[Xen-devel] [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
At last year''s Xen North America Summit in Boston, I gave a talk about memory overcommitment in Xen. I showed that the basic mechanisms for moving memory between domains were already present in Xen and that, with a few scripts, it was possible to roughly load-balance memory between domains. During this effort, I discovered that "ballooning" had a lot of weaknesses, even though it is the foundation for "time-sharing" physical memory in every major virtualization system existing today. These weaknesses have led in many cases to unacceptable performance issues when VMs are densely packed; as a result, memory is becoming the bottleneck in many deployments of virtualization. Transcendent Memory -- or "tmem" for short -- is phase II of that work and it essentially augments ballooning and "fixes" many of its weaknesses. It requires paravirtualization, but the changes (to Linux) are fairly small and minimally-invasive. The changes to Xen are larger, but also fairly non-invasive. (No shell scripts this time! :-) The concept and code is modular and may easily port to Windows, as well as KVM. It may even be useful in containers and in a native physical operating system. And, yes, it is machine-independent so should be easily portable to ia64 too! Basically, instead of moving the ownership of all physical memory between one domain and another, tmem instead collects system-wide underutilized memory into a "pool" in the hypervisor and provides indirect access to that memory so that it can serve the needs of domains without necessarily being directly addressible by the domains it serves. It is implemented with a small set of (hyper)calls that enable pages to be copied between a domain and Xen, controlled by a carefully-crafted set of semantics that make it easy in most cases for memory to be easily reclaimed by Xen as memory needs vary (as they often do -- rapidly and unpredictably). As a result, physical memory is utilized more efficiently, reducing unnecessary paging and the likelihood of thrashing and thus increasing performance and/or allowing greater VM density. If you are interested in this topic, please see: http://oss.oracle.com/projects/tmem (note, site is sometimes slow) for more information. This site will be updated frequently, with patches, documentation, and FAQs. The site also supports mailing lists, though I''d prefer to have all Xen-related discussions start on xen-devel. Linux patches based on 2.6.18-xen, 2.6.27-xen, and 2.6.28 are available. The Xen patch is currently-based on 3.3.0+ and I am in the process of updating it and cleaning it up, so will post it in the near future, but can provide it to anyone who is very interested in seeing/trying it now. I could use some help on the "control plane" python software, in performance evaluation, and in "porting". Comments and questions welcome. I also plan to submit an abstract for the upcoming Xen summit and, if accepted, give a talk about tmem there. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Alan Cox
2009-Jan-08 21:38 UTC
Re: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
> Comments and questions welcome. I also plan to submit an > abstract for the upcoming Xen summit and, if accepted, give > a talk about tmem there.I assume you''ve looked at how S/390 handles this problem - the guests can mark pages as stable or unused and the rest is up to the hypervisor ? No complex pool interfaces and the resulting interface slots into the Linux kernel as a pair of 1 liners in existing arch_foo hooks in the mm. The S/390 keeps the question of shared/private memory objects separate from the question of whether they are currently used - a point on which I think their model and interface is probably better. I would look at the patches but the URL you give contains nothing but an empty repository. I''d be interested to see how the kernel patches look and also how you implement migration of some of the users of a shared pool object - do you implement a full clustered memory manager and what do the performance figures look like across networks ? How do you find a pool across the network ? Its interesting as you can do a lot of other interesting things with this kind of interface. Larry McVoy''s bitcluster SMP proposal was built on a similar idea using refcounted page loans to drive a multiprocessor NUMA box as a cluster with page sharing. Alan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jan-08 22:32 UTC
RE: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
Hi Alan -- Sorry, I''m not trying to be cryptic. I''m having some network problems and will try to post the Linux patch shortly. I''m aware of IBM''s CMM and I think (hope) tmem achieves similar goals but is much much simpler and more extensible for other interesting uses. After I get the patch posted, let me know if you agree. Yes, one of the other uses for tmem is for cluster nodes co-resident on a physical machine to share a "virtual page cache". That''s under development... working with the ocfs2 team. Looking forward to more discussion... Dan> > Comments and questions welcome. I also plan to submit an > > abstract for the upcoming Xen summit and, if accepted, give > > a talk about tmem there. > > I assume you''ve looked at how S/390 handles this problem - > the guests can > mark pages as stable or unused and the rest is up to the > hypervisor ? No > complex pool interfaces and the resulting interface slots > into the Linux > kernel as a pair of 1 liners in existing arch_foo hooks in the mm. The > S/390 keeps the question of shared/private memory objects > separate from > the question of whether they are currently used - a point on which I > think their model and interface is probably better. > > I would look at the patches but the URL you give contains > nothing but an > empty repository. I''d be interested to see how the kernel patches look > and also how you implement migration of some of the users of a shared > pool object - do you implement a full clustered memory > manager and what do the > performance figures look like across networks ? How do you find a pool > across the network ? > > Its interesting as you can do a lot of other interesting > things with this > kind of interface. Larry McVoy''s bitcluster SMP proposal was > built on a > similar idea using refcounted page loans to drive a > multiprocessor NUMA > box as a cluster with page sharing._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jan-09 00:00 UTC
RE: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
For expediency, I''ve posted the Linux 2.6.28 patchset for tmem, implementing precache and preswap, here: http://oss.oracle.com/pipermail/tmem-devel/2009-January/000002.html This being the first time I''ve used that mailing-list-er myself, I had thought that the patches attached would be inlined, but I see there were not and it will be necessary to click on a URL for each. Apologies... if this is incovenient, please let me know and I will repost (to xen-devel?). It also appears that responders to messages on tmem-devel require list membership. Apologies again... please respond to xen-devel with comments for now. Thanks, Dan> -----Original Message----- > From: Dan Magenheimer > Sent: Thursday, January 08, 2009 10:27 AM > To: Xen-Devel (E-mail) > Subject: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a > new approach > to physical memory management > > > At last year''s Xen North America Summit in Boston, I gave a talk > about memory overcommitment in Xen. I showed that the basic > mechanisms for moving memory between domains were already present > in Xen and that, with a few scripts, it was possible to roughly > load-balance memory between domains. During this effort, I > discovered that "ballooning" had a lot of weaknesses, even > though it is the foundation for "time-sharing" physical > memory in every major virtualization system existing today. > These weaknesses have led in many cases to unacceptable performance > issues when VMs are densely packed; as a result, memory is becoming > the bottleneck in many deployments of virtualization. > > Transcendent Memory -- or "tmem" for short -- is phase II of that > work and it essentially augments ballooning and "fixes" many of > its weaknesses. It requires paravirtualization, but the changes > (to Linux) are fairly small and minimally-invasive. The changes > to Xen are larger, but also fairly non-invasive. (No shell scripts > this time! :-) The concept and code is modular and may easily > port to Windows, as well as KVM. It may even be useful in > containers and in a native physical operating system. And, > yes, it is machine-independent so should be easily portable > to ia64 too! > > Basically, instead of moving the ownership of all physical memory > between one domain and another, tmem instead collects system-wide > underutilized memory into a "pool" in the hypervisor and provides > indirect access to that memory so that it can serve the needs > of domains without necessarily being directly addressible by the > domains it serves. It is implemented with a small set of > (hyper)calls that enable pages to be copied between a domain > and Xen, controlled by a carefully-crafted set of semantics that > make it easy in most cases for memory to be easily reclaimed > by Xen as memory needs vary (as they often do -- rapidly and > unpredictably). As a result, physical memory is utilized more > efficiently, reducing unnecessary paging and the likelihood > of thrashing and thus increasing performance and/or allowing > greater VM density. > > If you are interested in this topic, please see: > > http://oss.oracle.com/projects/tmem > (note, site is sometimes slow) > > for more information. This site will be updated frequently, > with patches, documentation, and FAQs. The site also > supports mailing lists, though I''d prefer to have all > Xen-related discussions start on xen-devel. > > Linux patches based on 2.6.18-xen, 2.6.27-xen, and 2.6.28 > are available. The Xen patch is currently-based on 3.3.0+ > and I am in the process of updating it and cleaning it up, so > will post it in the near future, but can provide it to anyone > who is very interested in seeing/trying it now. I could > use some help on the "control plane" python software, > in performance evaluation, and in "porting". > > Comments and questions welcome. I also plan to submit an > abstract for the upcoming Xen summit and, if accepted, give > a talk about tmem there. > > Thanks, > Dan > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jan-09 15:56 UTC
RE: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
Hi Alan -- Sorry for the terse reply yesterday. I was told that my reply was unclear, and I also see I neglected to reply to some of your questions, so please let me try again.> I assume you''ve looked at how S/390 handles this problemCould you point me to the two one liners? If they are simply informing the hypervisor, that is certainly a step in the right direction. IBM''s CMM2 is of course much more complex (and I''m told was abandoned for that reason). Tmem falls in between and is probably a good compromise; the change is small (though certainly greater than two one liners) but provides lots of useful information to the hypervisor. One additional advantage of tmem is that the Linux kernel actively participates in the "admission policy" so this information need not be inferred outside of the kernel by the hypervisor.> I would look at the patches but the URL you give contains > nothing but an empty repository > I''d be interested to see how the kernel patches lookSorry, the patch should be there now. The site is still under construction :-} Feedback very much appreciated. (NOTE: The following refers to advanced features of tmem still under development, not part of the core features already submitted.)> and also how you implement migration of some of the users of a shared > pool object - do you implement a full clustered memory managerShared ephemeral pools (e.g. precache) don''t need to migrate. Since the tmem client (e.g. ocfs2 in the Linux kernel) is responsible for ensuring consistency, there should be no dirty pages that need to be migrated and clean pages can be left behind (if there are other sharing "nodes") or automatically flushed (when the last sharing node migrates or is shut down).> and what do the > performance figures look like across networks ? How do you > find a pool across the network ?I''m not trying to implement distributed shared memory. There is no "across the network" except what the cluster fs handles already. The clusterfs must ensure that the precache (shared between co-resident nodes) is consistent, probably by flushing any inodes/objects for which another node (physically co-resident or not) requests an exclusive lock.> Its interesting as you can do a lot of other interesting > things with this > kind of interface.Indeed I hope so. I''d like to discuss with you (offline?) if this interface might have some value for a future SSD interface or might help for hot-swap memory. I also think it might be used like compcache, but with the advantage that clean page cache pages can also be compressed.> Larry McVoy''s bitcluster SMP proposal wasI''ll try to look that up. Thanks again for your reply! Dan> -----Original Message----- > From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk] > Sent: Thursday, January 08, 2009 2:38 PM > To: Dan Magenheimer > Cc: Xen-Devel (E-mail) > Subject: Re: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a new > approach to physical memory management > > > > Comments and questions welcome. I also plan to submit an > > abstract for the upcoming Xen summit and, if accepted, give > > a talk about tmem there. > > I assume you''ve looked at how S/390 handles this problem - > the guests can > mark pages as stable or unused and the rest is up to the > hypervisor ? No > complex pool interfaces and the resulting interface slots > into the Linux > kernel as a pair of 1 liners in existing arch_foo hooks in the mm. The > S/390 keeps the question of shared/private memory objects > separate from > the question of whether they are currently used - a point on which I > think their model and interface is probably better. > > I would look at the patches but the URL you give contains > nothing but an > empty repository. I''d be interested to see how the kernel patches look > and also how you implement migration of some of the users of a shared > pool object - do you implement a full clustered memory > manager and what do the > performance figures look like across networks ? How do you find a pool > across the network ? > > Its interesting as you can do a lot of other interesting > things with this > kind of interface. Larry McVoy''s bitcluster SMP proposal was > built on a > similar idea using refcounted page loans to drive a > multiprocessor NUMA > box as a cluster with page sharing. > > Alan >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Alan Cox
2009-Jan-09 18:37 UTC
Re: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
> Could you point me to the two one liners? If they are > simply informing the hypervisor, that is certainly a step > in the right direction. IBM''s CMM2 is of course much moreYes - they hook arch_free_page and arch_alloc_page so that free pages are known to the hypervisor layer.> the Linux kernel actively participates in the "admission > policy" so this information need not be inferred outside > of the kernel by the hypervisor.Yes - the patches are very interesting and you take it a stage further than the S/390 hooks by exposing a lot more to the hypervisor.> I''m not trying to implement distributed shared memory. > There is no "across the network" except what the cluster > fs handles already. The clusterfs must ensure that theThat was what confused me about the shared pools. I had assumed that shared pools would imply DSM simply because two guests could use a shared pool and one of them get live migrated without the other.> SSD interface or might help for hot-swap memory.Not something I''d thought about. The problem with hot swap is generally one of managing to get stuff removed from a given physical page of RAM. Having more flexible allocators probably helps there simply because you can make space to relocate pages underneath the guest.> I also think it might be used like compcache, but with > the advantage that clean page cache pages can also be > compressed.Would it be useful to distinguish between pages the OS definitely doesn''t care about (freed) and pages that can vanish, at least in terms of reclaiming them between guests. It seems that truely free pages are the first target and there may even be a proper heirarchy. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jan-09 20:29 UTC
RE: [Xen-devel] [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
This discussion probably belongs in lkml rather than xen-devel but since we''ve started here already...> > SSD interface or might help for hot-swap memory. > > Not something I''d thought about. The problem with hot swap is > generally > one of managing to get stuff removed from a given physical > page of RAM. > Having more flexible allocators probably helps there simply > because you > can make space to relocate pages underneath the guest.Hot-swap: What I have in mind is as follows (and I''m talking about native kernel, no Xen here): Hot-delete requires some kind of kernel notification (provoked by operator or hardware) that says "this physical address range is going to disappear soon" at which point the kernel will try to abandon that piece of memory. Between the time of the notification and actual disappearance, which may be a fairly long time!, that memory goes unused. During that time period, that memory could be configured and used as a precache pool, clean pages only, so when the actual removal event happens, no valuable data is lost, but in the meantime the memory isn''t wasted. SSD: Pardon my ignorance, but will SSD ever be fast enough to be used as slow RAM (e.g. synchronously accessed, but still classified as second-class RAM)? If so, hiding it from guests and only allowing it to be used via the tmem interface might be a nice way to get benefits of SSD without the major kernel changes required to deal with two classes (normal and slow) RAM.> > I also think it might be used like compcache, but with > > the advantage that clean page cache pages can also be > > compressed. > > Would it be useful to distinguish between pages the OS > definitely doesn''t > care about (freed) and pages that can vanish, at least in terms of > reclaiming them between guests. It seems that truely free > pages are the > first target and there may even be a proper heirarchy.I think this would be useful for periods of time where a guest is "down-revving" from very busy to idle because it would more proactively notify the hypervisor that a lot of memory is available, without waiting for whatever automated-ballooning to notice. However if the memory is only temporarily free (say between compiles in a make), the information might be misleading. It would be interesting to study the distribution of lengths of time between when: 1) a page is last written 2) the page is "repurposed" (or freed?) in the kernel 3) the page is written again In the time between (2) and (3), the page is "idle" and if the average interval is long enough, certainly the page-worth of memory could be reclaimed by the hypervisor for another guest. (Does KVM do this already instead of ballooning?)> first target and there may even be a proper heirarchy.I definitely agree that there is a proper hierarchy and that better taxonomizing within the kernel will pay off sooner or later, at least in virtualized environments. Thanks! Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jan-16 18:20 UTC
[Xen-devel] RE: [RFC] Transcendent Memory ("tmem"): a new approach to physical memory management
> The Xen patch is currently-based on 3.3.0+ > and I am in the process of updating it and cleaning it up, so > will post it in the near future, but can provide it to anyone > who is very interested in seeing/trying it now. I could > use some help on the "control plane" python software, > in performance evaluation, and in "porting".For those interested in tmem, I have ported the Xen 3.3 patch to xen-unstable (cset 19043) and the monolithic patch (plus the Linux patch) can be viewed at: http://oss.oracle.com/projects/tmem/files/ After I''ve completed the control plane, I''ll post the patch more formally and less monolithic. Note that I am still trying to track down an ASSERT bug that wasn''t present before the port, but I don''t think anyone is going to apply this to a production system anyway :-) I also haven''t tested it recently on a 32-bit hypervisor, but it''s a bit of a toy on 32-bit anyway because of the 12MB limit on the xenheap. More patch and usage documentation will be forthcoming but a quick run-through of the patch is below for those that don''t want to dig through 2500 lines of code. Comments and questions are very welcome! Thanks, Dan P.S. If you reply-all to this message, ignore bounces from tmem-devel; I am the moderator and will approve your post. If you are interested in other (e.g. non-Xen) discussion of tmem, please feel free to subscribe to tmem-devel via: http://oss.oracle.com/projects/tmem/mailman/ Direct link to Xen patch: http://oss.oracle.com/projects/tmem/dist/files/xen-unstable/tmem-xen-unstable-19043-090115.patch Core functionality: ==================common/tmem.c: implementation of transcendent memory common/radix-tree.c: heavily leveraged from Linux (see comment near beginning for differences) include/xen/tmem.h: defines and declarations for tmem include/xen/radix-tree.h: heavily leveraged from Linux common/Makefile: add tmem.o and radix-tree.o New hypercall: (only one new hypercall!) =======================================include/public/xen.h: new tmem hypercall and interface include/xen/hypercall.h: ditto various/entry.S: ditto Misc interface stuff: ====================arch/x86/setup.c: call init_tmem() common/domain.c: destroy a domain''s pool when it dies common/page_alloc.c: comment out an annoying printk common/xmalloc_tlsf.c: use domheap instead of xenheap for xmalloc/xfree and add some useful measurements (metrics will be removed in final patch) include/xen/hash.h: identical to Linux version include/xen/sched.h: add per-domain tmem container pointer> -----Original Message----- > From: Dan Magenheimer > Sent: Thursday, January 08, 2009 10:27 AM > To: Xen-Devel (E-mail) > Subject: [RFC] Transcendent Memory ("tmem"): a new approach > to physical > memory management > > > At last year''s Xen North America Summit in Boston, I gave a talk > about memory overcommitment in Xen. I showed that the basic > mechanisms for moving memory between domains were already present > in Xen and that, with a few scripts, it was possible to roughly > load-balance memory between domains. During this effort, I > discovered that "ballooning" had a lot of weaknesses, even > though it is the foundation for "time-sharing" physical > memory in every major virtualization system existing today. > These weaknesses have led in many cases to unacceptable performance > issues when VMs are densely packed; as a result, memory is becoming > the bottleneck in many deployments of virtualization. > > Transcendent Memory -- or "tmem" for short -- is phase II of that > work and it essentially augments ballooning and "fixes" many of > its weaknesses. It requires paravirtualization, but the changes > (to Linux) are fairly small and minimally-invasive. The changes > to Xen are larger, but also fairly non-invasive. (No shell scripts > this time! :-) The concept and code is modular and may easily > port to Windows, as well as KVM. It may even be useful in > containers and in a native physical operating system. And, > yes, it is machine-independent so should be easily portable > to ia64 too! > > Basically, instead of moving the ownership of all physical memory > between one domain and another, tmem instead collects system-wide > underutilized memory into a "pool" in the hypervisor and provides > indirect access to that memory so that it can serve the needs > of domains without necessarily being directly addressible by the > domains it serves. It is implemented with a small set of > (hyper)calls that enable pages to be copied between a domain > and Xen, controlled by a carefully-crafted set of semantics that > make it easy in most cases for memory to be easily reclaimed > by Xen as memory needs vary (as they often do -- rapidly and > unpredictably). As a result, physical memory is utilized more > efficiently, reducing unnecessary paging and the likelihood > of thrashing and thus increasing performance and/or allowing > greater VM density. > > If you are interested in this topic, please see: > > http://oss.oracle.com/projects/tmem > (note, site is sometimes slow) > > for more information. This site will be updated frequently, > with patches, documentation, and FAQs. The site also > supports mailing lists, though I''d prefer to have all > Xen-related discussions start on xen-devel. > > Linux patches based on 2.6.18-xen, 2.6.27-xen, and 2.6.28 > are available. The Xen patch is currently-based on 3.3.0+ > and I am in the process of updating it and cleaning it up, so > will post it in the near future, but can provide it to anyone > who is very interested in seeing/trying it now. I could > use some help on the "control plane" python software, > in performance evaluation, and in "porting". > > Comments and questions welcome. I also plan to submit an > abstract for the upcoming Xen summit and, if accepted, give > a talk about tmem there. > > Thanks, > Dan > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel