Dan Magenheimer
2009-Feb-13 22:26 UTC
[Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
Keir (and xen physical memory management experts) -- Alright, I think I am ready for the final step of plugging in tmem into the existing xen physical memory management code. ** This is a bit long, but I''d appreciate some design feedback before I proceed with this. And that requires a bit of background explanation... if this isn''t enough background, I''ll be happy to answer any questions. (Note that tmem is not intended to be deployed on a 32-bit hypervisor -- due to xenheap constraints -- and should port easily (though hasn''t been ported yet) to ia64. It is currently controlled by a xen command-line option, default off; and it requires tmem-modified guests.) Tmem absorbs essentially all free memory on the machine for its use, but the vast majority of that memory can be easily freed, synchronously and on demand, for other uses. Tmem now maintains its own page list, tmem_page_list, which holds tmem pages when they (temporarily) don''t contain data. (There''s no sense scrubbing and freeing these to xenheap or domheap, when tmem is just going to grab them again and overwrite them anyway.) So tmem holds three types of memory: (1) Machine-pages (4K) on the tmem_page_list (2) Pages containing "ephemeral" data managed by tmem (3) Ppages containing "persistent" data managed by tmem Pages regularly move back and forth between ((2)or(3)) and (1) as part of tmem''s normal operations. When a page is moved "involuntarily" from (2) to (1), we call this an "eviction". Note that, due to compression, evicting a tmem ephemeral data page does not necessarily free up a raw machine page (4K) of memory... partial pages are kept in a tmem-specific tlsf pool, and tlsf frees up the machine page when all allocations on it are freed. (tlsf is the mechanism underlying the new highly-efficient xmalloc added to xen-unstable late last year.) Now let''s assume that Xen has need of memory but tmem has absorbed it all. Xen''s demand is always one of the following: (here, a page is a raw machine page (4K)) A) a page B) a large number of individual non-consecutive pages C) a block of 2**N consecutive pages (order N > 0) Of these: (A) eventually finds its way to alloc_heap_pages() (B) happens in (at least) two circumstances: (i) when a new domain is created, and (ii) when a domain makes a balloon request. (C) happens mostly at system startup and then rarely after that (when? why? see below) Tmem will export this API: a) struct page_info *tmem_relinquish_page(void) b) struct page_info *tmem_relinquish_pageblock(int order) c) uint32_t tmem_evict_npages(uint32_t npages) d) uint32_t tmem_relinquish_pages(uint32_t npages) (a) and (b) are internal to the hypervisor. (c) and (d) are internal and accessible via privileged hypercall. (a) is fairly straightforward and synchronous, though it may be a bit slow since it has to scrub the page before returning. If there is a page in tmem_page_list, it will (scrub and) return it. If not, it will evict tmem ephemeral data until there is a page freed to tmem_page_list and then it will (scrub and) return it. If tmem has no more ephemeral pages to evict and there''s nothing in tmem_page_list, it will return NULL. (a) can be used in, for example, alloc_heap_pages when "No suitable memory blocks" can be found so as to avoid failing the request. (b) is similar but is used if order > 0 (ie. a bigger chunk of pages is needed). It works the same way except that, due to fragmentation, it may have to evict MANY pages, in fact possibly ALL ephemeral data. Even then it still may not find enough consecutive pages to satisfy the request. Further, tmem doesn''t use a buddy allocator... because it uses nothing larger than a machine page, it never needs one internally. So all of those pages need to be scrubbed and freed to the xen heap before it can be determined if the request can be satisfied. As a result, this is potentially VERY slow and still has a high probability of failure. Fortunately, requests for order>0 are, I think, rare. (c) and (d) are intentionally not combined. (c) evicts tmem ephemeral pages until it has added at least npages (machine pages) into the tmem_page_list. This may be slow. For (d), I''m thinking it will transfer npages from tmem_page_list to the scrub_list, where the existing page_scrub_timer will eventually scrub them and free them to xen''s heap. (c) will return the number of pages it successfuly added to tmem_page_list. And (d) will return the number of pages it successfully moved from tmem_page_list to scrub_list. So this leaves some design questions: 1) Does this design make sense? 2) Are there places other than in alloc_heap_pages in Xen where I need to add "hooks" for tmem to relinquish a page or a block of pages? 3) Are there are any other circumstances I''ve forgotten where large npages are requested? 4) Does anybody have a list of alloc requests of order > 0 that occur after xen startup (e.g. when launching a new domain) and the consequences of failing the request? I''d consider not providing interface (b) at all if it never happens or if multi-page requests always fail gracefully (e.g. get broken into smaller order requests). I''m thinking for now that I may not implement this, just fail it and printk and see if any bad things happen. Thanks for taking the time to read through this... any feedback is appreciated. Dan ** tmem has been working for months but the code has until now allocated (and freed) to (and from) xenheap and domheap. This has been a security hole as the pages were released unscrubbed and so data could easily leak between domains. Obviously this needed to be fixed :-) And scrubbing data at every transfer from tmem to domheap/xenheap would be a huge waste of CPU cycles, especially since the most likely next consumer of that same page is tmem again. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Feb-14 07:41 UTC
Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
On 13/02/2009 22:26, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> 4) Does anybody have a list of alloc requests of > order > 0Domain and vcpu structs are order 1. Shadow pages are allocated in order-2 blocks.> ** tmem has been working for months but the code has > until now allocated (and freed) to (and from) > xenheap and domheap. This has been a security hole > as the pages were released unscrubbed and so data > could easily leak between domains. Obviously this > needed to be fixed :-) And scrubbing data at every > transfer from tmem to domheap/xenheap would be a huge > waste of CPU cycles, especially since the most likely > next consumer of that same page is tmem again.Then why not mark pages as coming from tmem when you free them, and scrub them on next use if it isn''t going back to tmem? I wasn''t clear on who would call your C and D functions, and why they can''t be merged. I might veto those depending on how ugly and exposed the changes are outside tmem. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Feb-14 15:58 UTC
RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
Thanks much for the reply!> > 4) Does anybody have a list of alloc requests of > > order > 0 > > Domain and vcpu structs are order 1. Shadow pages are > allocated in order-2 blocks.Are all of these allocated at domain startup only? Or are any (shadow pages perhaps?) allocated at relatively random times? If random, what are the consequences if the allocation fails? Isn''t it quite possible for a random order>0 allocation to fail today due to "natural causes"? E.g. because the currently running domains by coincidence (or by ballooning) have used up all available memory? Have we just been "lucky" to date, because fragmentation is so bad and ballooning is so rarely used, that we haven''t seen failures of order>0 allocations? (Or maybe have seen them but didn''t know it because the observable symptoms are a failed domain creation or a failed migration?) In other words, I''m wondering if tmem doesn''t create this problem, just increases the probability that it will happen? Perhaps Jan''s idea of using xenheap as an "emergency fund" for free pages is really a good idea?> > ** tmem has been working for months but the code has > > until now allocated (and freed) to (and from) > > xenheap and domheap. This has been a security hole > > as the pages were released unscrubbed and so data > > could easily leak between domains. Obviously this > > needed to be fixed :-) And scrubbing data at every > > transfer from tmem to domheap/xenheap would be a huge > > waste of CPU cycles, especially since the most likely > > next consumer of that same page is tmem again. > > Then why not mark pages as coming from tmem when you free > them, and scrub > them on next use if it isn''t going back to tmem?That''s a reasonable idea... maybe with a "scrub_me" flag set in the struct page_info by tmem and checked by the existing alloc_heap_pages (and ignored if a memflags flag is passed to alloc_xxxheap_pages() set to "ignore_scrub_me")? There''d also need to be a free_and_scrub_domheap_pages(). If you prefer that approach, I''ll give it a go. But still some (most?) of the time, there will be no free pages so alloc_heap_pages will still need to have a hook to tmem for that case.> I wasn''t clear on who would call your C and D functions, and > why they can''t > be merged. I might veto those depending on how ugly and > exposed the changes > are outside tmem.I *think* these calls are just in python code (domain creation and ballooning) and, if so, will just go through the existing tmem hypercall. Thanks, Dan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Feb-14 20:20 UTC
Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
On 14/02/2009 15:58, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Are all of these allocated at domain startup only? Or > are any (shadow pages perhaps?) allocated at relatively > random times? If random, what are the consequences > if the allocation fails? Isn''t it quite possible > for a random order>0 allocation to fail today due > to "natural causes"? E.g. because the currently running > domains by coincidence (or by ballooning) have used > up all available memory? Have we just been "lucky" > to date, because fragmentation is so bad and ballooning > is so rarely used, that we haven''t seen failures > of order>0 allocations? (Or maybe have seen them but > didn''t know it because the observable symptoms are > a failed domain creation or a failed migration?)I think the per-domain shadow pool is pre-reserved, so it should be okay. Lack of memory simply causes domain creation failure. Any extra memory that shadow code would try to allocate would just be gravy I''m pretty sure.> Perhaps Jan''s idea of using xenheap as an "emergency > fund" for free pages is really a good idea?It''s a can of worms. How big to make the pool? Who should be allowed to allocate from it and when? What if the emergency pool becomes exhausted?> That''s a reasonable idea... maybe with a "scrub_me" > flag set in the struct page_info by tmem and checked by the > existing alloc_heap_pages (and ignored if a memflags flag > is passed to alloc_xxxheap_pages() set to "ignore_scrub_me")? > There''d also need to be a free_and_scrub_domheap_pages(). > > If you prefer that approach, I''ll give it a go. But still > some (most?) of the time, there will be no free pages so > alloc_heap_pages will still need to have a hook to tmem > for that case.I''m not super fussed, it''s just an idea to consider. Could doing it this new way make it less possible to scrub pages asynchronously before they''re needed?> I *think* these calls are just in python code (domain creation > and ballooning) and, if so, will just go through the existing > tmem hypercall.Well, probably okay. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Feb-19 22:20 UTC
RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
OK, here''s the changes I''ve implemented to plug tmem into the existing xen physical memory management code. Hopefully it looks OK. For easier review, this patch and the diffstat below includes only files changed in the hypervisor for tmem. I had some difficulty understanding the page_list macros so left page_list_splice unimplemented for now. The working code removes pages from the tmem list one at a time and adds them to the scrub list, but since the pages could number in the millions for a large-memory machine, this could be very slow. Also, I''m uncertain about the change in alloc_heap_page... is any tlb flushing required given that tmem pages are never visible outside of the hypervisor? Thanks, Dan arch/x86/mm.c | 36 +++++++++++++++++++++++++++++++++++ arch/x86/setup.c | 3 ++ arch/x86/x86_32/entry.S | 2 + arch/x86/x86_64/compat/entry.S | 2 + arch/x86/x86_64/entry.S | 2 + common/Makefile | 4 +++ common/compat/Makefile | 1 common/domain.c | 4 +++ common/page_alloc.c | 42 +++++++++++++++++++++++++++++++++++------ common/xmalloc_tlsf.c | 33 ++++++++++++++++++++++---------- include/Makefile | 1 include/asm-x86/mm.h | 2 + include/public/xen.h | 1 include/xen/hypercall.h | 5 ++++ include/xen/mm.h | 13 ++++++++++++ include/xen/sched.h | 3 ++ include/xen/xmalloc.h | 8 ++++++- include/xlat.lst | 3 ++ 18 files changed, 148 insertions(+), 17 deletions(-) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer
2009-Feb-19 22:24 UTC
RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
Oops, always after the email is sent! :-) I neglected the page_scrub_lock around scrub_list_add and scrub_list_splice. Consider those already added. Dan> -----Original Message----- > From: Dan Magenheimer > Sent: Thursday, February 19, 2009 3:21 PM > To: Keir Fraser; Xen-Devel (E-mail) > Subject: RE: [Xen-devel] [RFC] design/API for plugging tmem into > existing xen physical memory management code > > > OK, here''s the changes I''ve implemented to plug tmem into the > existing xen physical memory management code. Hopefully it looks OK. > > For easier review, this patch and the diffstat below includes only > files changed in the hypervisor for tmem. > > I had some difficulty understanding the page_list macros so left > page_list_splice unimplemented for now. The working code > removes pages > from the tmem list one at a time and adds them to the scrub list, > but since the pages could number in the millions for a large-memory > machine, this could be very slow. > > Also, I''m uncertain about the change in alloc_heap_page... is > any tlb flushing required given that tmem pages are never visible > outside of the hypervisor? > > Thanks, > Dan > > arch/x86/mm.c | 36 > +++++++++++++++++++++++++++++++++++ > arch/x86/setup.c | 3 ++ > arch/x86/x86_32/entry.S | 2 + > arch/x86/x86_64/compat/entry.S | 2 + > arch/x86/x86_64/entry.S | 2 + > common/Makefile | 4 +++ > common/compat/Makefile | 1 > common/domain.c | 4 +++ > common/page_alloc.c | 42 > +++++++++++++++++++++++++++++++++++------ > common/xmalloc_tlsf.c | 33 > ++++++++++++++++++++++---------- > include/Makefile | 1 > include/asm-x86/mm.h | 2 + > include/public/xen.h | 1 > include/xen/hypercall.h | 5 ++++ > include/xen/mm.h | 13 ++++++++++++ > include/xen/sched.h | 3 ++ > include/xen/xmalloc.h | 8 ++++++- > include/xlat.lst | 3 ++ > 18 files changed, 148 insertions(+), 17 deletions(-)_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Feb-20 08:29 UTC
Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code
On 19/02/2009 22:20, "Dan Magenheimer" <dan.magenheimer@oracle.com> wrote:> Also, I''m uncertain about the change in alloc_heap_page... is > any tlb flushing required given that tmem pages are never visible > outside of the hypervisor?No, the TLB flushes are always to flush guest mappings. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel