Dan Magenheimer
2009-Jun-09 22:17 UTC
[Xen-devel] [PATCH] [linux-2.6.18-xen] transcendent memory ("tmem") linux-side changes
Transcendent memory ("tmem") for Linux Tmem, when called from a tmem-capable (paravirtualized) guest, makes use of otherwise unutilized ("fallow") memory to create and manage pools of pages that can be accessed from the guest either as "ephemeral" pages or as "persistent" pages. In either case, the pages are not directly addressible by the guest, only copied to and fro via the tmem interface. Ephemeral pages are a nice place for a guest to put recently evicted clean pages that it might need again; these pages can be reclaimed synchronously by Xen for other guests or other uses. Persistent pages are a nice place for a guest to put "swap" pages to avoid sending them to disk. These pages retain data as long as the guest lives, but count against the guest memory allocation. This patch contains the Linux paravirtualization changes to complement the tmem Xen patch (xen-unstable c/s 19646). It implements "precache" (ext3 only as of now), "preswap", and limited "shared precache" (ocfs2 only as of now) support. CONFIG options are required to turn on the support (but in this patch they default to "y"). If the underlying Xen does not have tmem support or has it turned off, this is sensed early to avoid nearly all hypercalls. Lots of useful prose about tmem can be found at http://oss.oracle.com/projects/tmem Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com> _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Jun-10 07:58 UTC
Re: [Xen-devel] [PATCH] [linux-2.6.18-xen] transcendent memory ("tmem")linux-side changes
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 10.06.09 00:17 >>>A few general remarks:>--- a/include/linux/swap.h Mon Jun 08 12:23:24 2009 +0100 >+++ b/include/linux/swap.h Tue Jun 09 15:37:03 2009 -0600 >@@ -6,6 +6,7 @@ > #include <linux/mmzone.h> > #include <linux/list.h> > #include <linux/sched.h> >+#include <linux/vmalloc.h> > > #include <asm/atomic.h> > #include <asm/page.h> >@@ -143,8 +144,69 @@ struct swap_info_struct { > unsigned int pages; > unsigned int max; > unsigned int inuse_pages; >+ unsigned long *preswap_map; >+ unsigned int preswap_pages;Missing #ifdef CONFIG_PRESWAP?> int next; /* next entry on swap list */ > }; >+ >+#ifdef CONFIG_PRESWAP >+ >+#include <linux/sysctl.h> >+extern int preswap_sysctl_handler(struct ctl_table *, int, struct file *, >+ void __user *, size_t *, loff_t *); >+extern const unsigned long preswap_zero, preswap_infinity; >+ >+extern void preswap_shrink(unsigned long); >+extern int preswap_test(struct swap_info_struct *, unsigned long); >+extern void preswap_init(unsigned); >+extern int preswap_put(struct page *); >+extern int preswap_get(struct page *); >+extern void preswap_flush(unsigned, unsigned long); >+extern void preswap_flush_area(unsigned); >+/* in swapfile.c */ >+extern int try_to_unuse(unsigned int, unsigned int, unsigned long); >+ >+static inline void *preswap_malloc(unsigned long size) >+{ >+ return vmalloc(size);I would view this as problematic on 32-bit, as the vmalloc space is rather restricted there. Not because the allocation here may fail, but because it may cause failures elsewhere. Some feedback on the amount of vmalloc space used and/or still available is going to be needed here I''d think.>... >@@ -247,7 +309,7 @@ extern int can_share_swap_page(struct pa > extern int can_share_swap_page(struct page *); > extern int remove_exclusive_swap_page(struct page *); > struct backing_dev_info; >- >+extern struct swap_list_t swap_list; > extern spinlock_t swap_lock; > > /* linux/mm/thrash.c */That''s not good - swap_list being non-static in 2.6.18 is just an oversight afaict, which is being fixed in later kernels. I think it would be cleaner (also wrt. a possible pv-ops implementation) to move some or all of the code in mm/preswap.c into mm/swapfile.c (would likewise eliminate the need to remove the ''static'' from try_to_unuse()).>--- a/mm/Kconfig Mon Jun 08 12:23:24 2009 +0100 >+++ b/mm/Kconfig Tue Jun 09 15:37:03 2009 -0600 >@@ -152,3 +152,34 @@ config RESOURCES_64BIT > default 64BIT > help > This option allows memory and IO resources to be 64 bit. >+ >+# >+# support for transcendent memory >+# >+config TMEM >+ bool "Transcendent memory support" >+ def_bool y >+ depends on XEN >+ help >+ In a virtualized environment, allows unused and underutilized >+ system physical memory to be made accessible through a narrow >+ well-defined page-copy-based API. If unsure, say Y. >+ >+config PRECACHE >+ bool "Cache clean pages in transcendent memory" >+ def_bool y >+ depends on TMEM >+ help >+ Allows the transcendent memory pool to be used to store clean >+ page-cache pages which, under some circumstances, will greatly >+ reduce paging and thus improve performance. If unsure, say Y. >+ >+config PRESWAP >+ bool "Swap pages to transcendent memory" >+ def_bool y >+ depends on TMEM >+ help >+ Allows the transcendent memory pool to be used as a pseudo-swap >+ device which, under some circumstances, will greatly reduce >+ swapping and thus improve performance. If unsure, say Y. >+''bool'' followed by ''def_bool''? And even more, an experimental (as I would view it at this stage) feature that defaults to ''yes''? Also, does have selecting TMEM on its own any meaning? If not (which I think is the case), then ''select''ing TMEM from the other two options may be more appropriate (while TMEM then shouldn''t have a prompt anymore).>--- /dev/null Thu Jan 01 00:00:00 1970 +0000 >+++ b/mm/tmem.c Tue Jun 09 15:37:03 2009 -0600 >@@ -0,0 +1,41 @@ >+/* >+ * Xen implementation for transcendent memory (tmem) >+ * >+ * Dan Magenheimer <dan.magenheimer@oracle.com> 2009 >+ */ >+ >+#include <linux/types.h> >+#include <xen/interface/xen.h> >+#include <asm/hypervisor.h>This is Xen-only code sitting in an apparently generic source file. Wouldn''t that better be placed in drivers/xen/core/ if it''s architecture neutral? Also, it doesn''t include its own header (but see below for comments on that one)...>--- /dev/null Thu Jan 01 00:00:00 1970 +0000 >+++ b/mm/tmem.h Tue Jun 09 15:37:03 2009 -0600 >@@ -0,0 +1,111 @@ >+/* >+ * linux/mm/tmem.h >+ * >+ * Interface to transcendent memory, used by mm/precache.c and mm/preswap.c >+ * Currently implemented on XEN, but may be implemented elsewhere in future. >+ * >+ * Copyright (C) 2008,2009 Dan Magenheimer, Oracle Corp. >+ */ >+ >+ >+#define TMEM_CONTROL 0 >+#define TMEM_NEW_POOL 1 >+#define TMEM_DESTROY_POOL 2 >...While I partly understand your intention of potentially sharing the implementation with other hypervisors, I don''t think Xen interface definitions belong here - you''re not using include/xen/interface/tmem.h at all, while that''s where the definitions should come from imo. If you really want this to represent an abstract interface, then ensuring consistency with the Xen interface would be a minimal requirement (by e.g. including the Xen header in the #ifdef CONFIG_XEN block). Otoh I''m getting the impression that the patch as a whole isn''t following this abstract interface goal, so it should probably be consistent here and not try to do so in a just single place.>... >+static inline int tmem_new_pool(u64 uuid_lo, u64 uuid_hi, u32 flags) >+{ >+ flags |= (PAGE_SHIFT - 12) << TMEM_POOL_PAGESIZE_SHIFT;This hard-coded literal 12 seems dubious: Is this part of the tmem hypervisor interface? If so, there should be a #define in the interface header, plus a (ideally build-time) check that the value really fits TMEM_POOL_PAGESIZE_MASK. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Jun-10 14:51 UTC
RE: [Xen-devel] [PATCH] [linux-2.6.18-xen] transcendent memory ("tmem")linux-side changes
> A few general remarks:Excellent! Thanks very much for the feedback! A few comments and explanations below. Your comments and feedback will be especially useful when proposing the patch upstream. After reading my replies, please let me know if there is anything you see as a *must-fix* for the patch to be put into the (soon-to-be-obsolescent) 2.6.18-xen tree, where its primary goal is to allow tmem to be more easily used by xen developers and leading-edge xen users.> >+ unsigned long *preswap_map; > >+ unsigned int preswap_pages; > > Missing #ifdef CONFIG_PRESWAP?I had one there in an earlier draft but it causes more ifdef''s elsewhere. Since swap_info_struct is so rarely instantiated (one per swap device), I thought maybe avoiding additional ifdef''s might be more attractive than saving a few bytes system-wide. Thoughts?> >+static inline void *preswap_malloc(unsigned long size) > >+{ > >+ return vmalloc(size); > > I would view this as problematic on 32-bit, as the vmalloc > space is rather > restricted there. Not because the allocation here may fail, > but because it > may cause failures elsewhere. Some feedback on the amount of vmalloc > space used and/or still available is going to be needed here > I''d think.The swapmap requires one byte vmalloc''ed for every page in every swap device. Preswap_map requires one additional bit to be vmalloc''ed for every swapmap byte. Is this likely to be a problem then? By "some feedback", do you mean I should add a comment?> >+extern struct swap_list_t swap_list; > > extern spinlock_t swap_lock; > > That''s not good - swap_list being non-static in 2.6.18 is > just an oversight > afaict, which is being fixed in later kernels. I think it > would be cleaner (also > wrt. a possible pv-ops implementation) to move some or all of > the code in > mm/preswap.c into mm/swapfile.c (would likewise eliminate the need to > remove the ''static'' from try_to_unuse()).I noticed in recent kernel versions that swap_list has seemed to bounce back and forth between static and non-static. It isn''t clear to me if this is for data-hiding or just because someone noticed it wasn''t being used in other source files. I thought about adding list-walking macros, but it seemed a bit extreme just to avoid extern-ifying a single static. In any case, unless you object, I''d like to leave this as is for 2.6.18-xen but I do expect some feedback and possible changes when proposing it upstream. (And for upstream, I''d rather start with a cleaner diffstat, then move chunks of code to swapfile.c in response to maintainer feedback :-)> >+config TMEM > >+ bool "Transcendent memory support" > >+ def_bool y > >+ depends on XEN > >+ help > >+ In a virtualized environment, allows unused and underutilized > >+ system physical memory to be made accessible through a narrow > >+ well-defined page-copy-based API. If unsure, say Y. > > ''bool'' followed by ''def_bool''? And even more, an experimental > (as I would > view it at this stage) feature that defaults to ''yes''?I have to admit complete bafflement at Kconfig syntax. It seems I can never get it right. (Is there a good README someplace that I can learn from?) It would certainly never default to ''yes'' upstream, but I thought it might be OK in 2.6.18-xen, especially since it does essentially nothing if Xen doesn''t have tmem (or if it has it but it is not enabled). If ''no'' is best, what''s the best Kconfig syntax for that? (I tried changing "def_bool y" to "def_bool n" once and it didn''t make any difference!)> Also, does have selecting TMEM on its own any meaning? If not (which I > think is the case), then ''select''ing TMEM from the other two > options may > be more appropriate (while TMEM then shouldn''t have a prompt anymore).I am anticipating other kernel uses for tmem, though what you suggest makes sense (if I could figure out the Kconfig syntax to do it! :-)> >+++ b/mm/tmem.c Tue Jun 09 15:37:03 2009 -0600 > This is Xen-only code sitting in an apparently generic source > ... > While I partly understand your intention of potentially > sharing the implementation > with other hypervisors, I don''t think Xen interface > definitions belong here - you''re > not using include/xen/interface/tmem.h at all, while that''s > where the definitions > should come from imo. If you really want this to represent an > abstract interface, > then ensuring consistency with the Xen interface would be a > minimal requirement > (by e.g. including the Xen header in the #ifdef CONFIG_XEN > block). Otoh I''m > getting the impression that the patch as a whole isn''t > following this abstract > interface goal, so it should probably be consistent here and > not try to do so in > a just single place.The concept of tmem is very generic and I foresee uses for it in the native kernel. I agree that the patch is kind of halfway between xen-specific and abstract. I hesitate to completely xen-ify it but nor can I fully abstract it yet unless/until there is another user for it.> >+static inline int tmem_new_pool(u64 uuid_lo, u64 uuid_hi, u32 flags) > >+{ > >+ flags |= (PAGE_SHIFT - 12) << TMEM_POOL_PAGESIZE_SHIFT; > > This hard-coded literal 12 seems dubious: Is this part of the > tmem hypervisor > interface? If so, there should be a #define in the interface > header, plus a > (ideally build-time) check that the value really fits > TMEM_POOL_PAGESIZE_MASK.Good point. To explain, the tmem API supports other pagesizes than 4K, but only pagesizes that are a multiple of 4K. I agree that I should have a mask (which is 4 bits, allowing pagesizes up to 128MB) and a define for "12" (TMEM_POOL_MIN_PAGESIZE_SHIFT). Thanks again, Jan! Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jan Beulich
2009-Jun-10 15:21 UTC
RE: [Xen-devel] [PATCH] [linux-2.6.18-xen] transcendent memory("tmem")linux-side changes
>>> Dan Magenheimer <dan.magenheimer@oracle.com> 10.06.09 16:51 >>> >> >+ unsigned long *preswap_map; >> >+ unsigned int preswap_pages; >> >> Missing #ifdef CONFIG_PRESWAP? > >I had one there in an earlier draft but it causes more >ifdef''s elsewhere. Since swap_info_struct is so rarely >instantiated (one per swap device), I thought maybe >avoiding additional ifdef''s might be more attractive >than saving a few bytes system-wide. Thoughts?Of course in the end it''s up to you, but in our kernels we''re trying to keep the native kernel as unaffected as possible, which means that conditonals should be where-ever code is added without abstracting macros/inline functions which would expand to nothing in native.>> >+static inline void *preswap_malloc(unsigned long size) >> >+{ >> >+ return vmalloc(size); >> >> I would view this as problematic on 32-bit, as the vmalloc >> space is rather >> restricted there. Not because the allocation here may fail, >> but because it >> may cause failures elsewhere. Some feedback on the amount of vmalloc >> space used and/or still available is going to be needed here >> I''d think. > >The swapmap requires one byte vmalloc''ed for every page >in every swap device. Preswap_map requires one additional >bit to be vmalloc''ed for every swapmap byte. Is thisOkay, I thought I saw it to be a pointer per page. In any case, the question is how large a swap file is "reasonable" on 32-bits, and what consequences exceeding "reasonable" would have. In any case, the supported limit is 4G pages afair, and having a swap file anywhere in the tens or hundreds of Gb range could become a problem. But I agree that this may be more a theoretical issue, since such configurations should rather use 64-bit setups.>likely to be a problem then? By "some feedback", do >you mean I should add a comment?No, I really meant feedback from the vmalloc allocator.>I noticed in recent kernel versions that swap_list has >seemed to bounce back and forth between static and non-static. >It isn''t clear to me if this is for data-hiding or just because >someone noticed it wasn''t being used in other source files. > >I thought about adding list-walking macros, but it seemed >a bit extreme just to avoid extern-ifying a single static. > >In any case, unless you object, I''d like to leave this >as is for 2.6.18-xen but I do expect some feedback and >possible changes when proposing it upstream. (And for >upstream, I''d rather start with a cleaner diffstat, then >move chunks of code to swapfile.c in response to maintainer >feedback :-)As above, it''s your choice, but will make integrating this into our kernels mode difficult (until there''s a production ready pv-ops based kernel that we could leverage).>> >+config TMEM >> >+ bool "Transcendent memory support" >> >+ def_bool y >> >+ depends on XEN >> >+ help >> >+ In a virtualized environment, allows unused and underutilized >> >+ system physical memory to be made accessible through a narrow >> >+ well-defined page-copy-based API. If unsure, say Y. >> >> ''bool'' followed by ''def_bool''? And even more, an experimental >> (as I would >> view it at this stage) feature that defaults to ''yes''? > >I have to admit complete bafflement at Kconfig syntax. It >seems I can never get it right. (Is there a good README >someplace that I can learn from?)I''ve never seen one other than Linux'' Documentation/kbuld/kconfig-language.txt (which likely isn''t what you''re after).>It would certainly never default to ''yes'' upstream, but I >thought it might be OK in 2.6.18-xen, especially since it >does essentially nothing if Xen doesn''t have tmem (or if >it has it but it is not enabled). > >If ''no'' is best, what''s the best Kconfig syntax for that? >(I tried changing "def_bool y" to "def_bool n" once and >it didn''t make any difference!) > >> Also, does have selecting TMEM on its own any meaning? If not (which I >> think is the case), then ''select''ing TMEM from the other two >> options may >> be more appropriate (while TMEM then shouldn''t have a prompt anymore). > >I am anticipating other kernel uses for tmem, though what >you suggest makes sense (if I could figure out the Kconfig >syntax to do it! :-)Here a completely untested suggestion: # # support for transcendent memory # config TMEM bool config PRECACHE bool "Cache clean pages in transcendent memory" depends on XEN select TMEM help Allows the transcendent memory pool to be used to store clean page-cache pages which, under some circumstances, will greatly reduce paging and thus improve performance. If unsure, say Y. config PRESWAP bool "Swap pages to transcendent memory" depends on XEN select TMEM help Allows the transcendent memory pool to be used as a pseudo-swap device which, under some circumstances, will greatly reduce swapping and thus improve performance. If unsure, say Y. (whether you''d want to keep the help previously associated with TMEM is a matter of taste - an option without prompt has no way to ever display its help text, so it would just serve documentation purposes in the Kconfig file itself). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel