Dan Magenheimer
2009-Jul-07 16:17 UTC
[Xen-devel] [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Tmem [PATCH 0/4] (Take 2): Transcendent memory Transcendent memory - Take 2 Changes since take 1: 1) Patches can be applied serially; function names in diff (Rik van Riel) 2) Descriptions and diffstats for individual patches (Rik van Riel) 3) Restructure of tmem_ops to be more Linux-like (Jeremy Fitzhardinge) 4) Drop shared pools until security implications are understood (Pavel Machek and Jeremy Fitzhardinge) 5) Documentation/transcendent-memory.txt added including API description (see also below for API description). Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> Normal memory is directly addressable by the kernel, of a known normally-fixed size, synchronously accessible, and persistent (though not across a reboot). What if there was a class of memory that is of unknown and dynamically variable size, is addressable only indirectly by the kernel, can be configured either as persistent or as "ephemeral" (meaning it will be around for awhile, but might disappear without warning), and is still fast enough to be synchronously accessible? We call this latter class "transcendent memory" and it provides an interesting opportunity to more efficiently utilize RAM in a virtualized environment. However this "memory but not really memory" may also have applications in NON-virtualized environments, such as hotplug-memory deletion, SSDs, and page cache compression. Others have suggested ideas such as allowing use of highmem memory without a highmem kernel, or use of spare video memory. Transcendent memory, or "tmem" for short, provides a well-defined API to access this unusual class of memory. (A summary of the API is provided below.) The basic operations are page-copy-based and use a flexible object-oriented addressing mechanism. Tmem assumes that some "privileged entity" is capable of executing tmem requests and storing pages of data; this entity is currently a hypervisor and operations are performed via hypercalls, but the entity could be a kernel policy, or perhaps a "memory node" in a cluster of blades connected by a high-speed interconnect such as hypertransport or QPI. Since tmem is not directly accessible and because page copying is done to/from physical pageframes, it more suitable for in-kernel memory needs than for userland applications. However, there may be yet undiscovered userland possibilities. With the tmem concept outlined vaguely and its broader potential hinted, we will overview two existing examples of how tmem can be used by the kernel. "Precache" can be thought of as a page-granularity victim cache for clean pages that the kernel''s pageframe replacement algorithm (PFRA) would like to keep around, but can''t since there isn''t enough memory. So when the PFRA "evicts" a page, it first puts it into the precache via a call to tmem. And any time a filesystem reads a page from disk, it first attempts to get the page from precache. If it''s there, a disk access is eliminated. If not, the filesystem just goes to the disk like normal. Precache is "ephemeral" so whether a page is kept in precache (between the "put" and the "get") is dependent on a number of factors that are invisible to the kernel. "Preswap" IS persistent, but for various reasons may not always be available for use, again due to factors that may not be visible to the kernel (but, briefly, if the kernel is being "good" and has shared its resources nicely, then it will be able to use preswap, else it will not). Once a page is put, a get on the page will always succeed. So when the kernel finds itself in a situation where it needs to swap out a page, it first attempts to use preswap. If the put works, a disk write and (usually) a disk read are avoided. If it doesn''t, the page is written to swap as usual. Unlike precache, whether a page is stored in preswap vs swap is recorded in kernel data structures, so when a page needs to be fetched, the kernel does a get if it is in preswap and reads from swap if it is not in preswap. Both precache and preswap may be optionally compressed, trading off 2x space reduction vs 10x performance for access. Precache also has a sharing feature, which allows different nodes in a "virtual cluster" to share a local page cache. Tmem has some similarity to IBM''s Collaborative Memory Management, but creates more of a partnership between the kernel and the "privileged entity" and is not very invasive. Tmem may be applicable for KVM and containers; there is some disagreement on the extent of its value. Tmem is highly complementary to ballooning (aka page granularity hot plug) and memory deduplication (aka transparent content-based page sharing) but still has value when neither are present. Performance is difficult to quantify because some benchmarks respond very favorably to increases in memory and tmem may do quite well on those, depending on how much tmem is available which may vary widely and dynamically, depending on conditions completely outside of the system being measured. Ideas on how best to provide useful metrics would be appreciated. Tmem is now supported in Xen''s unstable tree (targeted for the Xen 3.5 release) and in Xen''s Linux 2.6.18-xen source tree. Again, Xen is not necessarily a requirement, but currently provides the only existing implementation of tmem. Lots more information about tmem can be found at: http://oss.oracle.com/projects/tmem and there will be a talk about it on the first day of Linux Symposium in July 2009. Tmem is the result of a group effort, including Dan Magenheimer, Chris Mason, Dave McCracken, Kurt Hackel and Zhigang Wang, with helpful input from Jeremy Fitzhardinge, Keir Fraser, Ian Pratt, Sunil Mushran, Joel Becker, and Jan Beulich. THE TRANSCENDENT MEMORY API Transcendent memory is made up of a set of pools. Each pool is made up of a set of objects. And each object contains a set of pages. The combination of a 32-bit pool id, a 64-bit object id, and a 32-bit page id, uniquely identify a page of tmem data, and this tuple is called a "handle." Commonly, the three parts of a handle are used to address a filesystem, a file within that filesystem, and a page within that file; however an OS can use any values as long as they uniquely identify a page of data. When a tmem pool is created, it is given certain attributes: It can be private or shared, and it can be persistent or ephemeral. Each combination of these attributes provides a different set of useful functionality and also defines a slightly different set of semantics for the various operations on the pool. Other pool attributes include the size of the page and a version number. Once a pool is created, operations are performed on the pool. Pages are copied between the OS and tmem and are addressed using a handle. Pages and/or objects may also be flushed from the pool. When all operations are completed, a pool can be destroyed. The specific tmem functions are called in Linux through a set of accessor functions: int (*new_pool)(struct tmem_pool_uuid uuid, u32 flags); int (*destroy_pool)(u32 pool_id); int (*put_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn); int (*get_page)(u32 pool_id, u64 object, u32 index, unsigned long pfn); int (*flush_page)(u32 pool_id, u64 object, u32 index); int (*flush_object)(u32 pool_id, u64 object); The new_pool accessor creates a new pool and returns a pool id which is a non-negative 32-bit integer. If the flags parameter specifies that the pool is to be shared, the uuid is a 128-bit "shared secret" else it is ignored. The destroy_pool accessor destroys the pool. (Note: shared pools are not supported until security implications are better understood.) The put_page accessor copies a page of data from the specified pageframe and associates it with the specified handle. The get_page accessor looks up a page of data in tmem associated with the specified handle and, if found, copies it to the specified pageframe. The flush_page accessor ensures that subsequent gets of a page with the specified handle will fail. The flush_object accessor ensures that subsequent gets of any page matching the pool id and object will fail. There are many subtle but critical behaviors for get_page and put_page: - Any put_page (with one notable exception) may be rejected and the client must be prepared to deal with that failure. A put_page copies, NOT moves, data; that is the data exists in both places. Linux is responsible for destroying or overwriting its own copy, or alternately managing any coherency between the copies. - Every page successfully put to a persistent pool must be found by a subsequent get_page that specifies the same handle. A page successfully put to an ephemeral pool has an indeterminate lifetime and even an immediately subsequent get_page may fail. - A get_page to a private pool is destructive, that is it behaves as if the get_page were atomically followed by a flush_page. A get_page to a shared pool is non-destructive. A flush_page behaves just like a get_page to a private pool except the data is thrown away. - Put-put-get coherency is guaranteed. For example, after the sequence: put_page(ABC,D1); put_page(ABC,D2); get_page(ABC,E) E may never contain the data from D1. However, even for a persistent pool, the get_page may fail if the second put_page indicates failure. - Get-get coherency is guaranteed. For example, in the sequence: put_page(ABC,D); get_page(ABC,E1); get_page(ABC,E2) if the first get_page fails, the second must also fail. - A tmem implementation provides no serialization guarantees (e.g. to an SMP Linux). So if different Linux threads are putting and flushing the same page, the results are indeterminate. guaranteed and must be synchronized by Linux. Changed core kernel files: fs/buffer.c | 5 + fs/ext3/super.c | 2 fs/mpage.c | 8 ++ fs/super.c | 5 + include/linux/fs.h | 7 ++ include/linux/swap.h | 57 +++++++++++++++++++++ include/linux/sysctl.h | 1 kernel/sysctl.c | 12 ++++ mm/Kconfig | 26 +++++++++ mm/Makefile | 3 + mm/filemap.c | 11 ++++ mm/page_io.c | 12 ++++ mm/swapfile.c | 46 ++++++++++++++-- mm/truncate.c | 10 +++ 14 files changed, 199 insertions(+), 6 deletions(-) Newly added core kernel files: Documentation/transcendent-memory.txt | 175 +++++++++++++ include/linux/tmem.h | 88 ++++++ mm/precache.c | 134 ++++++++++ mm/preswap.c | 273 +++++++++++++++++++++ 4 files changed, 670 insertions(+) Changed xen-specific files: arch/x86/include/asm/xen/hypercall.h | 8 +++ drivers/xen/Makefile | 1 include/xen/interface/tmem.h | 43 +++++++++++++++++++++ include/xen/interface/xen.h | 22 ++++++++++ 4 files changed, 74 insertions(+) Newly added xen-specific files: drivers/xen/tmem.c | 97 +++++++++++++++++++++ include/xen/interface/tmem.h | 43 +++++++++ 2 files changed, 140 insertions(+) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2009-Jul-07 17:28 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> "Preswap" IS persistent, but for various reasons may not always be > available for use, again due to factors that may not be visible to the > kernel (but, briefly, if the kernel is being "good" and has shared its > resources nicely, then it will be able to use preswap, else it will not). > Once a page is put, a get on the page will always succeed.What happens when all of the free memory on a system has been consumed by preswap by a few guests? Will the system be unable to start another guest, or is there some way to free the preswap memory? _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-07 19:53 UTC
[Xen-devel] RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
> From: Rik van Riel [mailto:riel@redhat.com]> Dan Magenheimer wrote: > > "Preswap" IS persistent, but for various reasons may not always be > > available for use, again due to factors that may not be > visible to the > > kernel (but, briefly, if the kernel is being "good" and has > shared its > > resources nicely, then it will be able to use preswap, else > it will not). > > Once a page is put, a get on the page will always succeed. > > What happens when all of the free memory on a system > has been consumed by preswap by a few guests? > Will the system be unable to start another guest,The default policy (and only policy implemented as of now) is that no guest is allowed to use more than max_mem for the sum of directly-addressable memory (e.g. RAM) and persistent tmem (e.g. preswap). So if a guest is using its default memory==max_mem and is doing no ballooning, nothing can be put in preswap by that guest.> or is there some way to free the preswap memory?Yes and no. There is no way externally to free preswap memory, but an in-guest userland root service can write to sysfs to affect preswap size. This essentially does a partial swapoff on preswap if there is sufficient (directly addressable) guest RAM available. (I have this prototyped as part of the xenballoond self-ballooning service in xen-unstable.) Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-08 22:56 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> Tmem [PATCH 0/4] (Take 2): Transcendent memory > Transcendent memory - Take 2 > Changes since take 1: > 1) Patches can be applied serially; function names in diff (Rik van Riel) > 2) Descriptions and diffstats for individual patches (Rik van Riel) > 3) Restructure of tmem_ops to be more Linux-like (Jeremy Fitzhardinge) > 4) Drop shared pools until security implications are understood (Pavel > Machek and Jeremy Fitzhardinge) > 5) Documentation/transcendent-memory.txt added including API description > (see also below for API description). > > Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> > > Normal memory is directly addressable by the kernel, of a known > normally-fixed size, synchronously accessible, and persistent (though > not across a reboot). > > What if there was a class of memory that is of unknown and dynamically > variable size, is addressable only indirectly by the kernel, can be > configured either as persistent or as "ephemeral" (meaning it will be > around for awhile, but might disappear without warning), and is still > fast enough to be synchronously accessible?I have trouble mapping this to a VMM capable of overcommit without just coming back to CMM2. In CMM2 parlance, ephemeral tmem pools is just normal kernel memory marked in the volatile state, no? It seems to me that an architecture built around hinting would be more robust than having to use separate memory pools for this type of memory (especially since you are requiring a copy to/from the pool). For instance, you can mark data DMA''d from disk (perhaps by read-ahead) as volatile without ever bringing it into the CPU cache. With tmem, if you wanted to use a tmem pool for all of the page cache, you''d likely suffer significant overhead due to copying. Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-08 23:31 UTC
RE: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Hi Anthony -- Thanks for the comments.> I have trouble mapping this to a VMM capable of overcommit > without just coming back to CMM2. > > In CMM2 parlance, ephemeral tmem pools is just normal kernel memory > marked in the volatile state, no?They are similar in concept, but a volatile-marked kernel page is still a kernel page, can be changed by a kernel (or user) store instruction, and counts as part of the memory used by the VM. An ephemeral tmem page cannot be directly written by a kernel (or user) store, can only be read via a "get" (which may or may not succeed), and doesn''t count against the memory used by the VM (even though it likely contains -- for awhile -- data useful to the VM).> It seems to me that an architecture built around hinting > would be more > robust than having to use separate memory pools for this type > of memory > (especially since you are requiring a copy to/from the pool).Depends on what you mean by robust, I suppose. Once you understand the basics of tmem, it is very simple and this is borne out in the low invasiveness of the Linux patch. Simplicity is another form of robustness.> For instance, you can mark data DMA''d from disk (perhaps by > read-ahead) > as volatile without ever bringing it into the CPU cache. > With tmem, if > you wanted to use a tmem pool for all of the page cache, you''d likely > suffer significant overhead due to copying.The copy may be expensive on an older machine, but on newer machines copying a page is relatively inexpensive. On a reasonable multi-VM-kernbench-like benchmark I''ll be presenting at Linux Symposium next week, the overhead is on the order of 0.01% for a fairly significant savings in IOs. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-08 23:57 UTC
Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> Hi Anthony -- > > Thanks for the comments. > > >> I have trouble mapping this to a VMM capable of overcommit >> without just coming back to CMM2. >> >> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory >> marked in the volatile state, no? >> > > They are similar in concept, but a volatile-marked kernel page > is still a kernel page, can be changed by a kernel (or user) > store instruction, and counts as part of the memory used > by the VM. An ephemeral tmem page cannot be directly written > by a kernel (or user) store,Why does tmem require a special store? A VMM can trap write operations pages can be stored on disk transparently by the VMM if necessary. I guess that''s the bit I''m missing.>> It seems to me that an architecture built around hinting >> would be more >> robust than having to use separate memory pools for this type >> of memory >> (especially since you are requiring a copy to/from the pool). >> > > Depends on what you mean by robust, I suppose. Once you > understand the basics of tmem, it is very simple and this > is borne out in the low invasiveness of the Linux patch. > Simplicity is another form of robustness. >The main disadvantage I see is that you need to explicitly convert portions of the kernel to use a data copying API. That seems like an invasive change to me. Hinting on the other hand can be done in a less-invasive way. I''m not really arguing against tmem, just the need to have explicit get/put mechanisms for the transcendent memory areas.> The copy may be expensive on an older machine, but on newer > machines copying a page is relatively inexpensive.I don''t think that''s a true statement at all :-) If you had a workload where data never came into the CPU cache (zero-copy) and now you introduce a copy, even with new system, you''re going to see a significant performance hit.> On a reasonable > multi-VM-kernbench-like benchmark I''ll be presenting at Linux > Symposium next week, the overhead is on the order of 0.01% > for a fairly significant savings in IOs. >But how would something like specweb do where you should be doing zero-copy IO from the disk to the network? This is the area where I would be concerned. For something like kernbench, you''re already bringing the disk data into the CPU cache anyway so I can appreciate that the copy could get lost in the noise. Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2009-Jul-09 00:17 UTC
Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On 07/08/09 16:57, Anthony Liguori wrote:> Why does tmem require a special store? > > A VMM can trap write operations pages can be stored on disk > transparently by the VMM if necessary. I guess that''s the bit I''m > missing.tmem doesn''t store anything to disk. It''s more about making sure that free host memory can be quickly and efficiently be handed out to guests as they need it; to increase "memory liquidity" as it were. Guests need to explicitly ask to use tmem, rather than having the host/hypervisor try to intuit what to do based on access patterns and hints; typically they''ll use tmem as the first line storage for memory which they were about to swap out anyway. There''s no point in making tmem swappable, because the guest is perfectly capable of swapping its own memory. The copying interface avoids a lot of the delicate corners of the CMM code, in which subtle races can lurk in fairly hard-to-test-for ways.>> The copy may be expensive on an older machine, but on newer >> machines copying a page is relatively inexpensive. > > I don''t think that''s a true statement at all :-) If you had a > workload where data never came into the CPU cache (zero-copy) and now > you introduce a copy, even with new system, you''re going to see a > significant performance hit.If the copy helps avoid physical disk IO, then it is cheap at the price. A guest generally wouldn''t push a page into tmem unless it was about to evict it anyway, so it has already determined the page is cold/unwanted, and the copy isn''t a great cost. Hot/busy pages shouldn''t be anywhere near tmem; if they are, it suggests you''ve cut your domain''s memory too aggressively. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-09 00:27 UTC
Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Jeremy Fitzhardinge wrote:> On 07/08/09 16:57, Anthony Liguori wrote: > >> Why does tmem require a special store? >> >> A VMM can trap write operations pages can be stored on disk >> transparently by the VMM if necessary. I guess that''s the bit I''m >> missing. >> > > tmem doesn''t store anything to disk. It''s more about making sure that > free host memory can be quickly and efficiently be handed out to guests > as they need it; to increase "memory liquidity" as it were. Guests need > to explicitly ask to use tmem, rather than having the host/hypervisor > try to intuit what to do based on access patterns and hints; typically > they''ll use tmem as the first line storage for memory which they were > about to swap out anyway.If the primary use of tmem is to avoid swapping when measure pressure would have forced it, how is this different using ballooning along with a shrinker callback? With virtio-balloon, a guest can touch any of the memory it''s ballooned to immediately reclaim that memory. I think the main difference with tmem is that you can also mark a page as being volatile. The hypervisor can then reclaim that page without swapping it (it can always reclaim memory and swap it) and generate a special fault to the guest if it attempts to access it. You can fail to put with tmem, right? You can also fail to get? In both cases though, these failures can be handled because Linux is able to recreate the page on it''s on (by doing disk IO). So why not just generate a special fault instead of having to introduce special accessors? Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2009-Jul-09 01:20 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Anthony Liguori wrote:> I have trouble mapping this to a VMM capable of overcommit without just > coming back to CMM2.Same for me. CMM2 has a more complex mechanism, but way easier policy than anything else out there.> In CMM2 parlance, ephemeral tmem pools is just normal kernel memory > marked in the volatile state, no?Basically.> It seems to me that an architecture built around hinting would be more > robust than having to use separate memory pools for this type of memory > (especially since you are requiring a copy to/from the pool).I agree. Something along the lines of CMM2 needs more infrastructure, but will be infinitely easier to get right from the policy side. Automatic ballooning is an option too, with fairly simple infrastructure, but potentially insanely complex policy issues to sort out... -- All rights reversed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-09 21:09 UTC
[Xen-devel] RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
> > I have trouble mapping this to a VMM capable of overcommit > without just > > coming back to CMM2. > > Same for me. CMM2 has a more complex mechanism, but way > easier policy than anything else out there.Although tmem and CMS have similar conceptual objectives, let me try to describe what I see as a fundamental difference in approach. The primary objective of both is to utilize RAM more efficiently. Both are ideally complemented with some longer term "memory shaping" mechanism such as automatic ballooning or hotplug. CMM2''s focus is on increasing the number of VM''s that can run on top of the hypervisor. To do this, it depends on hints provided by Linux to surreptitiously steal memory away from Linux. The stolen memory still "belongs" to Linux and if Linux goes to use it but the hypervisor has already given it to another Linux, the hypervisor must jump through hoops to give it back. If it guesses wrong and overcommits too aggressively, the hypervisor must swap some memory to a "hypervisor swap disk" (which btw has some policy challenges). IMHO this is more of a "mainframe" model. Tmem''s focus is on helping Linux to aggressively manage the amount of memory it uses (and thus reduce the amount of memory it would get "billed" for using). To do this, it provides two "safety valve" services, one to reduce the cost of "refaults" (Rik''s term) and the other to reduce the cost of swapping. Both services are almost always available, but if the memory of the physical machine get overcommitted, the most aggressive Linux guests must fall back to using their disks (because the hypervisor does not have a "hypervisor swap disk"). But when physical memory is undercommitted, it is still being used usefully without compromising "memory liquidity". (I like this term Jeremy!) IMHO this is more of a "cloud" model. In other words, CMM2, despite its name, is more of a "subservient" memory management system (Linux is subservient to the hypervisor) and tmem is more collaborative (Linux and the hypervisor share the responsibilities and the benefits/costs). I''m not saying either one is bad or good -- and I''m sure each can be adapted to approximately deliver the value of the other -- they are just approaching the same problem from different perspectives. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2009-Jul-09 21:27 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> I''m not saying either one is bad or good -- and I''m sure > each can be adapted to approximately deliver the value > of the other -- they are just approaching the same problem > from different perspectives.Indeed. Tmem and auto-ballooning have a simple mechanism, but the policy required to make it work right could well be too complex to ever get right. CMM2 has a more complex mechanism, but the policy is absolutely trivial. CMM2 and auto-ballooning seem to give about similar performance gains on zSystem. I suspect that for Xen and KVM, we''ll want to choose for the approach that has the simpler policy, because relying on different versions of different operating systems to all get the policy of auto-ballooning or tmem right is likely to result in bad interactions between guests and other intractable issues. -- All rights reversed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-09 21:41 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> CMM2''s focus is on increasing the number of VM''s that > can run on top of the hypervisor. To do this, it > depends on hints provided by Linux to surreptitiously > steal memory away from Linux. The stolen memory still > "belongs" to Linux and if Linux goes to use it but the > hypervisor has already given it to another Linux, the > hypervisor must jump through hoops to give it back. >It depends on how you define "jump through hoops".> If it guesses wrong and overcommits too aggressively, > the hypervisor must swap some memory to a "hypervisor > swap disk" (which btw has some policy challenges). > IMHO this is more of a "mainframe" model. >No, not at all. A guest marks a page as being "volatile", which tells the hypervisor it never needs to swap that page. It can discard it whenever it likes. If the guest later tries to access that page, it will get a special "discard fault". For a lot of types of memory, the discard fault handler can then restore that page transparently to the code that generated the discard fault. AFAICT, ephemeral tmem has the exact same characteristics as volatile CMM2 pages. The difference is that tmem introduces an API to explicitly manage this memory behind a copy interface whereas CMM2 uses hinting and a special fault handler to allow any piece of memory to be marked in this way.> In other words, CMM2, despite its name, is more of a > "subservient" memory management system (Linux is > subservient to the hypervisor) and tmem is more > collaborative (Linux and the hypervisor share the > responsibilities and the benefits/costs). >I don''t really agree with your analysis of CMM2. We can map CMM2 operations directly to ephemeral tmem interfaces so tmem is a subset of CMM2, no? What''s appealing to me about CMM2 is that it doesn''t change the guest semantically but rather just gives the VMM more information about how the VMM is using it''s memory. This suggests that it allows greater flexibility in the long term to the VMM and more importantly, provides an easier implementation across a wide range of guests. Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-09 21:48 UTC
[Xen-devel] RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
> > I''m not saying either one is bad or good -- and I''m sure > > each can be adapted to approximately deliver the value > > of the other -- they are just approaching the same problem > > from different perspectives. > > Indeed. Tmem and auto-ballooning have a simple mechanism, > but the policy required to make it work right could well > be too complex to ever get right. > > CMM2 has a more complex mechanism, but the policy is > absolutely trivial.Could you elaborate a bit more on what policy you are referring to and what decisions the policies are trying to guide? And are you looking at the policies in Linux or in the hypervisor or the sum of both? The Linux-side policies in the tmem patch seem trivial to me and the Xen-side implementation is certainly working correctly, though "working right" is a hard objective to measure. But depending on how you define "working right", the pageframe replacement algorithm in Linux may also be "too complex to ever get right" but it''s been working well enough for a long time.> CMM2 and auto-ballooning seem to give about similar > performance gains on zSystem.Tmem provides a huge advantage over my self-ballooning implementation, but maybe that''s because it is more aggressive than the CMM auto-ballooning, resulting in more refaults that must be "fixed".> I suspect that for Xen and KVM, we''ll want to choose > for the approach that has the simpler policy, because > relying on different versions of different operating > systems to all get the policy of auto-ballooning or > tmem right is likely to result in bad interactions > between guests and other intractable issues.Again, not sure what tmem policy in Linux you are referring to or what bad interactions you foresee. Could you clarify? Auto-ballooning policy is certainly a challenge, but that''s true whether CMM or tmem, right? Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-09 22:34 UTC
[Xen-devel] RE: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
> > If it guesses wrong and overcommits too aggressively, > > the hypervisor must swap some memory to a "hypervisor > > swap disk" (which btw has some policy challenges). > > IMHO this is more of a "mainframe" model. > > No, not at all. A guest marks a page as being "volatile", > which tells > the hypervisor it never needs to swap that page. It can discard it > whenever it likes. > > If the guest later tries to access that page, it will get a special > "discard fault". For a lot of types of memory, the discard fault > handler can then restore that page transparently to the code that > generated the discard fault.But this means that either the content of that page must have been preserved somewhere or the discard fault handler has sufficient information to go back and get the content from the source (e.g. the filesystem). Or am I misunderstanding? With tmem, the equivalent of the "failure to access a discarded page" is inline and synchronous, so if the tmem access "fails", the normal code immediately executes.> AFAICT, ephemeral tmem has the exact same characteristics as volatile > CMM2 pages. The difference is that tmem introduces an API to > explicitly > manage this memory behind a copy interface whereas CMM2 uses > hinting and > a special fault handler to allow any piece of memory to be marked in > this way. > : > I don''t really agree with your analysis of CMM2. We can map CMM2 > operations directly to ephemeral tmem interfaces so tmem is a > subset of CMM2, no?Not really. I suppose one *could* use tmem that way, immediately writing every page read from disk into tmem, though that would probably cause some real coherency challenges. But the patch as proposed only puts ready-to-be-replaced pages (as determined by Linux''s PFRA) into ephemeral tmem. The two services provided to Linux (in the proposed patch) by tmem are: 1) "I have a page of memory that I''m about to throw away because I''m not sure I need it any more and I have a better use for that pageframe right now. Mr Tmem might you have someplace you can squirrel it away for me in case I need it again? Oh, and by the way, if you can''t or you lose it, no big deal as I can go get it from disk if I need to." 2) "I''m out of memory and have to put this page somewhere. Mr Tmem, can you take it? But if you do take it, you have to promise to give it back when I ask for it! If you can''t promise, never mind, I''ll find something else to do with it."> > In other words, CMM2, despite its name, is more of a > > "subservient" memory management system (Linux is > > subservient to the hypervisor) and tmem is more > > collaborative (Linux and the hypervisor share the > > responsibilities and the benefits/costs). > > What''s appealing to me about CMM2 is that it doesn''t change the guest > semantically but rather just gives the VMM more information about how > the VMM is using it''s memory. This suggests that it allows greater > flexibility in the long term to the VMM and more importantly, > provides an easier implementation across a wide range of guests.I suppose changing Linux to utilize the two tmem services as described above is a semantic change. But to me it seems no more of a semantic change than requiring a new special page fault handler because a page of memory might disappear behind the OS''s back. But IMHO this is a corollary of the fundamental difference. CMM2''s is more the "VMware" approach which is that OS''s should never have to be modified to run in a virtual environment. (Oh, but maybe modified just slightly to make the hypervisor a little less clueless about the OS''s resource utilization.) Tmem asks: If an OS is going to often run in a virtualized environment, what can be done to share the responsibility for resource management so that the OS does what it can with the knowledge that it has and the hypervisor can most flexibly manage resources across all the guests? I do agree that adding an additional API binds the user and provider of the API less flexibly then without the API, but as long as the API is optional (as it is for both tmem and CMM2), I don''t see why CMM2 provides more flexibility. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Rik van Riel
2009-Jul-09 22:45 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> But this means that either the content of that page must have been > preserved somewhere or the discard fault handler has sufficient > information to go back and get the content from the source (e.g. > the filesystem). Or am I misunderstanding?The latter. Only pages which can be fetched from source again are marked as volatile.> But IMHO this is a corollary of the fundamental difference. CMM2''s > is more the "VMware" approach which is that OS''s should never have > to be modified to run in a virtual environment.Actually, the CMM2 mechanism is quite invasive in the guest operating system''s kernel.> ( I don''t see why CMM2 provides more flexibility.I don''t think anyone is arguing that. One thing that people have argued is that CMM2 can be more efficient, and easier to get the policy right in the face of multiple guest operating systems. -- All rights reversed. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-09 23:33 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> But this means that either the content of that page must have been > preserved somewhere or the discard fault handler has sufficient > information to go back and get the content from the source (e.g. > the filesystem). Or am I misunderstanding? >As Rik said, it''s the later.> With tmem, the equivalent of the "failure to access a discarded page" > is inline and synchronous, so if the tmem access "fails", the > normal code immediately executes. >Yup. This is the main difference AFAICT. It''s really just API semantics within Linux. You could clearly use the volatile state of CMM2 to implement tmem as an API in Linux. The get/put functions would set a flag such that if the discard handler was invoked as long as that operation happened, the operation could safely fail. That''s why I claimed tmem is a subset of CMM2.> I suppose changing Linux to utilize the two tmem services > as described above is a semantic change. But to me it > seems no more of a semantic change than requiring a new > special page fault handler because a page of memory might > disappear behind the OS''s back. > > But IMHO this is a corollary of the fundamental difference. CMM2''s > is more the "VMware" approach which is that OS''s should never have > to be modified to run in a virtual environment. (Oh, but maybe > modified just slightly to make the hypervisor a little less > clueless about the OS''s resource utilization.)While I always enjoy a good holy war, I''d like to avoid one here because I want to stay on the topic at hand. If there was one change to tmem that would make it more palatable, for me it would be changing the way pools are "allocated". Instead of getting an opaque handle from the hypervisor, I would force the guest to allocate it''s own memory and to tell the hypervisor that it''s a tmem pool. You could then introduce semantics about whether the guest was allowed to directly manipulate the memory as long as it was in the pool. It would be required to access the memory via get/put functions that under Xen, would end up being a hypercall and a copy. Presumably you would do some tricks with ballooning to allocate empty memory in Xen and then use those addresses as tmem pools. On KVM, we could do something more clever. The big advantage of keeping the tmem pool part of the normal set of guest memory is that you don''t introduce new challenges with respect to memory accounting. Whether or not tmem is directly accessible from the guest, it is another memory resource. I''m certain that you''ll want to do accounting of how much tmem is being consumed by each guest, and I strongly suspect that you''ll want to do tmem accounting on a per-process basis. I also suspect that doing tmem limiting for things like cgroups would be desirable. That all points to making tmem normal memory so that all that infrastructure can be reused. I''m not sure how well this maps to Xen guests, but it works out fine when the VMM is capable of presenting memory to the guest without actually allocating it (via overcommit). Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Jul-12 09:20 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On 07/10/2009 06:23 PM, Dan Magenheimer wrote:>> If there was one change to tmem that would make it more >> palatable, for >> me it would be changing the way pools are "allocated". Instead of >> getting an opaque handle from the hypervisor, I would force >> the guest to >> allocate it''s own memory and to tell the hypervisor that it''s a tmem >> pool. >> > > An interesting idea but one of the nice advantages of tmem being > completely external to the OS is that the tmem pool may be much > larger than the total memory available to the OS. As an extreme > example, assume you have one 1GB guest on a physical machine that > has 64GB physical RAM. The guest now has 1GB of directly-addressable > memory and 63GB of indirectly-addressable memory through tmem. > That 63GB requires no page structs or other data structures in the > guest. And in the current (external) implementation, the size > of each pool is constantly changing, sometimes dramatically so > the guest would have to be prepared to handle this. I also wonder > if this would make shared-tmem-pools more difficult. >Having no struct pages is also a downside; for example this guest cannot have more than 1GB of anonymous memory without swapping like mad. Swapping to tmem is fast but still a lot slower than having the memory available. tmem makes life a lot easier to the hypervisor and to the guest, but also gives up a lot of flexibility. There''s a difference between memory and a very fast synchronous backing store.> I can see how it might be useful for KVM though. Once the > core API and all the hooks are in place, a KVM implementation of > tmem could attempt something like this. >My worry is that tmem for kvm leaves a lot of niftiness on the table, since it was designed for a hypervisor with much simpler memory management. kvm can already use spare memory for backing guest swap, and can already convert unused guest memory to free memory (by swapping it). tmem doesn''t really integrate well with these capabilities. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-12 13:28 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Dan Magenheimer wrote:> Oops, sorry, I guess that was a bit inflammatory. What I meant to > say is that inferring resource utilization efficiency is a very > hard problem and VMware (and I''m sure IBM too) has done a fine job > with it; CMM2 explicitly provides some very useful information from > within the OS to the hypervisor so that it doesn''t have to infer > that information; but tmem is trying to go a step further by making > the cooperation between the OS and hypervisor more explicit > and directly beneficial to the OS. >KVM definitely falls into the camp of trying to minimize modification to the guest.>> If there was one change to tmem that would make it more >> palatable, for >> me it would be changing the way pools are "allocated". Instead of >> getting an opaque handle from the hypervisor, I would force >> the guest to >> allocate it''s own memory and to tell the hypervisor that it''s a tmem >> pool. >> > > An interesting idea but one of the nice advantages of tmem being > completely external to the OS is that the tmem pool may be much > larger than the total memory available to the OS. As an extreme > example, assume you have one 1GB guest on a physical machine that > has 64GB physical RAM. The guest now has 1GB of directly-addressable > memory and 63GB of indirectly-addressable memory through tmem. > That 63GB requires no page structs or other data structures in the > guest. And in the current (external) implementation, the size > of each pool is constantly changing, sometimes dramatically so > the guest would have to be prepared to handle this. I also wonder > if this would make shared-tmem-pools more difficult. > > I can see how it might be useful for KVM though. Once the > core API and all the hooks are in place, a KVM implementation of > tmem could attempt something like this. >It''s the core API that is really the issue. The semantics of tmem (external memory pool with copy interface) is really what is problematic. The basic concept, notifying the VMM about memory that can be recreated by the guest to avoid the VMM having to swap before reclaim, is great and I''d love to see Linux support it in some way.>> The big advantage of keeping the tmem pool part of the normal set of >> guest memory is that you don''t introduce new challenges with >> respect to memory accounting. Whether or not tmem is directly >> accessible from the guest, it is another memory resource. I''m >> certain that you''ll want to do accounting of how much tmem is being >> consumed by each guest >> > > Yes, the Xen implementation of tmem does accounting on a per-pool > and a per-guest basis and exposes the data via a privileged > "tmem control" hypercall. >I was talking about accounting within the guest. It''s not just a matter of accounting within the mm, it''s also about accounting in userspace. A lot of software out there depends on getting detailed statistics from Linux about how much memory is in use in order to determine things like memory pressure. If you introduce a new class of memory, you need a new class of statistics to expose to userspace and all those tools need updating. Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Jul-12 17:16 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On 07/12/2009 07:20 PM, Dan Magenheimer wrote:>>> that information; but tmem is trying to go a step further by making >>> the cooperation between the OS and hypervisor more explicit >>> and directly beneficial to the OS. >>> >> KVM definitely falls into the camp of trying to minimize >> modification to the guest. >> > > No argument there. Well, maybe one :-) Yes, but KVM > also heavily encourages unmodified guests. Tmem is > philosophically in favor of finding a balance between > things that work well with no changes to any OS (and > thus work just fine regardless of whether the OS is > running in a virtual environment or not), and things > that could work better if the OS is knowledgable that > it is running in a virtual environment. >CMM2 and tmem are not any different in this regard; both require OS modification, and both make information available to the hypervisor. In fact CMM2 is much more intrusive (but on the other hand provides much more information).> For those that believe virtualization is a flash-in- > the-pan, no modifications to the OS is the right answer. > For those that believe it will be pervasive in the > future, finding the right balance is a critical step > in operating system evolution. >You''re arguing for CMM2 here IMO.> Is it the tmem API or the precache/preswap API layered on > top of it that is problematic? Both currently assume copying > but perhaps the precache/preswap API could, with minor > modifications, meet KVM''s needs better? > >My take on this is that precache (predecache?) / preswap can be implemented even without tmem by using write-through backing for the virtual disk. For swap this is actually slight;y more efficient than tmem preswap, for preuncache slightly less efficient (since there will be some double caching). So I''m more interested in other use cases of tmem/CMM2.Well, first, tmem''s very name means memory that is "beyond the> range of normal perception". This is certainly not the first class > of memory in use in data centers that can''t be accounted at > process granularity. I''m thinking disk array caches as the > primary example. Also lots of tools that work great in a > non-virtualized OS are worthless or misleading in a virtual > environment. > >Right, the transient uses of tmem when applied to disk objects (swap/pagecache) are very similar to disk caches. Which is why you can get a very similar effect when caching your virtual disks; this can be done without any guest modification. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Jul-12 17:27 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On 07/12/2009 07:28 PM, Dan Magenheimer wrote:>> Having no struct pages is also a downside; for example this >> guest cannot >> have more than 1GB of anonymous memory without swapping like mad. >> Swapping to tmem is fast but still a lot slower than having >> the memory >> available. >> > > Yes, true. Tmem offers little additional advantage for workloads > that have a huge variation in working set size that is primarily > anonymous memory. That larger scale "memory shaping" is left to > ballooning and hotplug. >And this is where the policy problems erupt. When do you balloon in favor of tmem? which guest do you balloon? do you leave it to the administrator? there''s the host''s administrator and the guests'' administrators. CMM2 solves this neatly by providing information to the host. The host can pick the least recently used page (or a better algorithm) and evict it using information from the guest, either dropping it or swapping it. It also provides information back to the guest when it references an evicted page: either the guest needs to recreate the page or it just needs to wait.>> tmem makes life a lot easier to the hypervisor and to the guest, but >> also gives up a lot of flexibility. There''s a difference >> between memory >> and a very fast synchronous backing store. >> > > I don''t see that it gives up that flexibility. System adminstrators > are still free to size their guests properly. Tmem''s contribution > is in environments that are highly dynamic, where the only > alternative is really sizing memory maximally (and thus wasting > it for the vast majority of time in which the working set is smaller). >I meant that once a page is converted to tmem, there''s a limited amount of things you can do with it compared to normal memory. For example tmem won''t help with a dcache intensive workload.> I''m certainly open to identifying compromises and layer modifications > that help meet the needs of both Xen and KVM (and others). For > example, if we can determine that the basic hook placement for > precache/preswap (or even just precache for KVM) can be built > on different underlying layers, that would be great! >I''m not sure preswap/precache by itself justifies tmem since it can be emulated by backing the disk with a cached file. What I''m missing in tmem is the ability for the hypervisor to take a global view on memory; instead it''s forced to look at memory and tmem separately. That''s fine for Xen since it can''t really make any decisions on normal memory (lacking swap); on the other hand kvm doesn''t map well to tmem since "free memory" is already used by the host pagecache. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-12 19:34 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Avi Kivity wrote:> > In fact CMM2 is much more intrusive (but on the other hand provides > much more information).I don''t think this will remain true long term. CMM2 touches a lot of core mm code and certainly qualifies as intrusive. However the result is that the VMM has a tremendous amount of insight into how the guest is using it''s memory and can implement all sorts of fancy policy for reclaim. Since the reclaim policy can evolve without any additional assistance from the guest, the guest doesn''t have to change as policy evolves. Since tmem requires that reclaim policy is implemented within the guest, I think in the long term, tmem will have to touch a broad number of places within Linux. Beside the core mm, the first round of patches already touch filesystems (just ext3 to start out with). To truly be effective, tmem would have to be a first class kernel citizen and I suspect a lot of code would have to be aware of it. So while CMM2 does a lot of code no one wants to touch, I think in the long term it would remain relatively well contained compared to tmem which will steadily increase in complexity within the guest. Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-12 20:39 UTC
RE: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
> CMM2 and tmem are not any different in this regard; both require OS > modification, and both make information available to the > hypervisor. In > fact CMM2 is much more intrusive (but on the other hand provides much > more information). > > > For those that believe it will be pervasive in the > > future, finding the right balance is a critical step > > in operating system evolution. > > You''re arguing for CMM2 here IMO.I''m arguing that both are a good thing and a step in the right direction. In some ways, tmem is a bigger step and in some ways CMM2 is a bigger step.> My take on this is that precache (predecache?) / preswap can be > implemented even without tmem by using write-through backing for the > virtual disk. For swap this is actually slight;y more efficient than > tmem preswap, for preuncache slightly less efficient (since > there will > be some double caching). So I''m more interested in other use > cases of tmem/CMM2. > > Right, the transient uses of tmem when applied to disk objects > (swap/pagecache) are very similar to disk caches. Which is > why you can > get a very similar effect when caching your virtual disks; > this can be > done without any guest modification.Write-through backing and virtual disk cacheing offer a similar effect, but it is far from the same. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Jul-12 20:43 UTC
Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On 07/12/2009 11:39 PM, Dan Magenheimer wrote:>> Right, the transient uses of tmem when applied to disk objects >> (swap/pagecache) are very similar to disk caches. Which is >> why you can >> get a very similar effect when caching your virtual disks; >> this can be >> done without any guest modification. >> > > Write-through backing and virtual disk cacheing offer a > similar effect, but it is far from the same. >Can you explain how it differs for the swap case? Maybe I don''t understand how tmem preswap works. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jul-12 21:08 UTC
RE: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
> >> Right, the transient uses of tmem when applied to disk objects > >> (swap/pagecache) are very similar to disk caches. Which is > >> why you can > >> get a very similar effect when caching your virtual disks; > >> this can be > >> done without any guest modification. > > > > Write-through backing and virtual disk cacheing offer a > > similar effect, but it is far from the same. > > Can you explain how it differs for the swap case? Maybe I don''t > understand how tmem preswap works.The key differences I see are the "please may I store something" API and the fact that the reply (yes or no) can vary across time depending on the state of the collective of guests. Virtual disk cacheing requires the host to always say yes and always deliver persistence. I can see that this is less of a concern for KVM because the host can swap... though doesn''t this hide information from the guest and potentially have split-brain swapping issues? (thanks for the great discussion so far... going offline mostly now for a few days) Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Jul-13 11:33 UTC
Re: [Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On 07/13/2009 12:08 AM, Dan Magenheimer wrote:>> Can you explain how it differs for the swap case? Maybe I don''t >> understand how tmem preswap works. >> > > The key differences I see are the "please may I store something" > API and the fact that the reply (yes or no) can vary across time > depending on the state of the collective of guests. Virtual > disk cacheing requires the host to always say yes and always > deliver persistence.We need to compare tmem+swap to swap+cache, not just tmem to cache. Here''s how I see it: tmem+swap swapout: - guest copies page to tmem (may fail) - guest writes page to disk cached drive swapout: - guest writes page to disk - host copies page to cache tmem+swap swapin: - guest reads page from tmem (may fail) - on tmem failure, guest reads swap from disk - guest drops tmem page cached drive swapin: - guest reads page from disk - host may satisfy read from cache tmem+swap ageing: - host may drop tmem page at any time cached drive ageing: - host may drop cached page at any time So they''re pretty similar. The main difference is that tmem can drop the page on swapin. It could be made to work with swap by supporting the TRIM command.> I can see that this is less of a concern > for KVM because the host can swap... though doesn''t this hide > information from the guest and potentially have split-brain > swapping issues? >Double swap is bad for performance, yes. CMM2 addresses it nicely. tmem doesn''t address it at all - it assumes you have excess memory.> (thanks for the great discussion so far... going offline mostly now > for a few days) >I''m going offline too so it cancels out. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Mason
2009-Jul-13 20:17 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On Sun, Jul 12, 2009 at 02:34:25PM -0500, Anthony Liguori wrote:> Avi Kivity wrote: >> >> In fact CMM2 is much more intrusive (but on the other hand provides >> much more information). > I don''t think this will remain true long term. CMM2 touches a lot of > core mm code and certainly qualifies as intrusive. However the result > is that the VMM has a tremendous amount of insight into how the guest is > using it''s memory and can implement all sorts of fancy policy for > reclaim. Since the reclaim policy can evolve without any additional > assistance from the guest, the guest doesn''t have to change as policy > evolves. > > Since tmem requires that reclaim policy is implemented within the guest, > I think in the long term, tmem will have to touch a broad number of > places within Linux. Beside the core mm, the first round of patches > already touch filesystems (just ext3 to start out with). To truly be > effective, tmem would have to be a first class kernel citizen and I > suspect a lot of code would have to be aware of it.This depends on the extent to which tmem is integrated into the VM. For filesystem usage, the hooks are relatively simple because we already have a lot of code sharing in this area. Basically tmem is concerned with when we free a clean page and when the contents of a particular offset in the file are no longer valid. The nice part about tmem is that any time a given corner case gets tricky, you can just invalidate that offset in tmem and move on. -chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-13 20:38 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Chris Mason wrote:> This depends on the extent to which tmem is integrated into the VM. For > filesystem usage, the hooks are relatively simple because we already > have a lot of code sharing in this area. Basically tmem is concerned > with when we free a clean page and when the contents of a particular > offset in the file are no longer valid. >But filesystem usage is perhaps the least interesting part of tmem. The VMM already knows which pages in the guest are the result of disk IO (it''s the one that put it there, afterall). It also knows when those pages have been invalidated (or it can tell based on write-faulting). The VMM also knows when the disk IO has been rerequested by tracking previous requests. It can keep the old IO requests cached in memory and use that to satisfy re-reads as long as the memory isn''t needed for something else. Basically, we have tmem today with kvm and we use it by default by using the host page cache to do I/O caching (via cache=writethrough). The difference between our "tmem" is that instead of providing an interface where the guest explicitly says, "I''m throwing away this memory, I may need it later", and then asking again for it, the guest throws away the page and then we can later satisfy the disk I/O request that results from re-requesting the page instantaneously. This transparent approach is far superior too because it enables transparent sharing across multiple guests. This works well for CoW images and would work really well if we had a file system capable of block-level deduplification... :-) Regards, Anthony Liguori _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Chris Mason
2009-Jul-13 21:01 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On Mon, Jul 13, 2009 at 03:38:45PM -0500, Anthony Liguori wrote:> Chris Mason wrote: >> This depends on the extent to which tmem is integrated into the VM. For >> filesystem usage, the hooks are relatively simple because we already >> have a lot of code sharing in this area. Basically tmem is concerned >> with when we free a clean page and when the contents of a particular >> offset in the file are no longer valid. >> > > But filesystem usage is perhaps the least interesting part of tmem. > > The VMM already knows which pages in the guest are the result of disk IO > (it''s the one that put it there, afterall). It also knows when those > pages have been invalidated (or it can tell based on write-faulting). > > The VMM also knows when the disk IO has been rerequested by tracking > previous requests. It can keep the old IO requests cached in memory and > use that to satisfy re-reads as long as the memory isn''t needed for > something else. Basically, we have tmem today with kvm and we use it by > default by using the host page cache to do I/O caching (via > cache=writethrough).I''ll definitely grant that caching with writethough adds more caching, but it does need trim support before it is similar to tmem. The caching is transparent to the guest, but it is also transparent to qemu, and so it is harder to manage and size (or even get a stat for how big it currently is).> > The difference between our "tmem" is that instead of providing an > interface where the guest explicitly says, "I''m throwing away this > memory, I may need it later", and then asking again for it, the guest > throws away the page and then we can later satisfy the disk I/O request > that results from re-requesting the page instantaneously. > > This transparent approach is far superior too because it enables > transparent sharing across multiple guests. This works well for CoW > images and would work really well if we had a file system capable of > block-level deduplification... :-)Grin, I''m afraid that even if someone were to jump in and write the perfect cow based filesystem and then find a willing contributor to code up a dedup implementation, each cow image would be a different file and so it would have its own address space. Dedup and COW are an easy way to have hints about which pages are supposed to be have the same contents, but they would have to go with some other duplicate page sharing scheme. -chris _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Anthony Liguori
2009-Jul-13 21:17 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
Chris Mason wrote:> On Mon, Jul 13, 2009 at 03:38:45PM -0500, Anthony Liguori wrote: > > I''ll definitely grant that caching with writethough adds more caching, > but it does need trim support before it is similar to tmem.I think trim is somewhat orthogonal but even if you do need it, the nice thing about implementing ATA trim support verses a paravirtualization is that it works with a wide variety of guests. From the perspective of the VMM, it seems like a good thing.> The caching > is transparent to the guest, but it is also transparent to qemu, and so > it is harder to manage and size (or even get a stat for how big it > currently is). >That''s certainly a fixable problem though. We could expose statistics to userspace and then further expose those to guests. I think the first question to answer though is what you would use those statistics for.>> The difference between our "tmem" is that instead of providing an >> interface where the guest explicitly says, "I''m throwing away this >> memory, I may need it later", and then asking again for it, the guest >> throws away the page and then we can later satisfy the disk I/O request >> that results from re-requesting the page instantaneously. >> >> This transparent approach is far superior too because it enables >> transparent sharing across multiple guests. This works well for CoW >> images and would work really well if we had a file system capable of >> block-level deduplification... :-) >> > > Grin, I''m afraid that even if someone were to jump in and write the > perfect cow based filesystem and then find a willing contributor to code > up a dedup implementation, each cow image would be a different file > and so it would have its own address space. > > Dedup and COW are an easy way to have hints about which pages are > supposed to be have the same contents, but they would have to go with > some other duplicate page sharing scheme. >Yes. We have the information we need to dedup this memory though. We just need a way to track non-dirty pages that result from DMA, map the host page cache directly into the guest, and then CoW when the guest tries to dirty that memory. We don''t quite have the right infrastructure in Linux yet to do this effectively, but this is entirely an issue with the host. The guest doesn''t need any changes here. Regards, Anthony Liguori> -chris > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Avi Kivity
2009-Jul-26 15:00 UTC
[Xen-devel] Re: [RFC PATCH 0/4] (Take 2): transcendent memory ("tmem") for Linux
On 07/14/2009 12:17 AM, Anthony Liguori wrote:> Chris Mason wrote: >> On Mon, Jul 13, 2009 at 03:38:45PM -0500, Anthony Liguori wrote: >> I''ll definitely grant that caching with writethough adds more caching, >> but it does need trim support before it is similar to tmem. > > I think trim is somewhat orthogonal but even if you do need it, the > nice thing about implementing ATA trim support verses a > paravirtualization is that it works with a wide variety of guests. > > From the perspective of the VMM, it seems like a good thing.trim is also lovely in that images will no longer grow monotonously even though guest disk usage is constant or is even reduced. -- error compiling committee.c: too many arguments to function _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel