thr3ads.net - Xen devel - [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code [Feb 2009]

If this information is useful, please help other people find it:
Share via:

Dan Magenheimer

2009-Feb-13 22:26 UTC

[Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

Keir (and xen physical memory management experts) --

Alright, I think I am ready for the final step of plugging
in tmem into the existing xen physical memory management
code. **

This is a bit long, but I''d appreciate some design feedback
before I proceed with this.  And that requires a bit of
background explanation... if this isn''t enough background,
I''ll be happy to answer any questions.

(Note that tmem is not intended to be deployed on a 32-bit
hypervisor -- due to xenheap constraints -- and should port
easily (though hasn''t been ported yet) to ia64.  It is
currently controlled by a xen command-line option, default
off; and it requires tmem-modified guests.)

Tmem absorbs essentially all free memory on the machine
for its use, but the vast majority of that memory can be
easily freed, synchronously and on demand, for other uses.
Tmem now maintains its own page list, tmem_page_list,
which holds tmem pages when they (temporarily) don''t contain
data.  (There''s no sense scrubbing and freeing these to
xenheap or domheap, when tmem is just going to grab them
again and overwrite them anyway.)  So tmem holds three
types of memory:

(1) Machine-pages (4K) on the tmem_page_list
(2) Pages containing "ephemeral" data managed by tmem
(3) Ppages containing "persistent" data managed by tmem

Pages regularly move back and forth between ((2)or(3))
and (1) as part of tmem''s normal operations. When a page
is moved "involuntarily" from (2) to (1), we call this
an "eviction".  Note that, due to compression, evicting
a tmem ephemeral data page does not necessarily free up
a raw machine page (4K) of memory... partial pages
are kept in a tmem-specific tlsf pool, and tlsf frees
up the machine page when all allocations on it are freed.
(tlsf is the mechanism underlying the new highly-efficient
xmalloc added to xen-unstable late last year.)

Now let''s assume that Xen has need of memory but tmem
has absorbed it all.  Xen''s demand is always one of
the following: (here, a page is a raw machine page (4K))

A) a page
B) a large number of individual non-consecutive pages
C) a block of 2**N consecutive pages (order N > 0)

Of these:
(A) eventually finds its way to alloc_heap_pages()
(B) happens in (at least) two circumstances:
 (i) when a new domain is created, and
 (ii) when a domain makes a balloon request.
(C) happens mostly at system startup and then rarely
    after that (when? why? see below)

Tmem will export this API:

a) struct page_info *tmem_relinquish_page(void)
b) struct page_info *tmem_relinquish_pageblock(int order)
c) uint32_t tmem_evict_npages(uint32_t npages)
d) uint32_t tmem_relinquish_pages(uint32_t npages)

(a) and (b) are internal to the hypervisor.  (c) and
(d) are internal and accessible via privileged hypercall.

(a) is fairly straightforward and synchronous, though it
may be a bit slow since it has to scrub the page before
returning.  If there is a page in tmem_page_list, it will
(scrub and) return it.  If not, it will evict tmem ephemeral
data until there is a page freed to tmem_page_list and
then it will (scrub and) return it.  If tmem has no
more ephemeral pages to evict and there''s nothing in
tmem_page_list, it will return NULL.  (a) can be used in,
for example, alloc_heap_pages when "No suitable memory blocks"
can be found so as to avoid failing the request.

(b) is similar but is used if order > 0 (ie. a bigger chunk
of pages is needed).  It works the same way except that,
due to fragmentation, it may have to evict MANY pages,
in fact possibly ALL ephemeral data.  Even then it still may
not find enough consecutive pages to satisfy the request.
Further, tmem doesn''t use a buddy allocator... because it
uses nothing larger than a machine page, it never needs one
internally. So all of those
pages need to be scrubbed and freed to the xen heap before
it can be determined if the request can be satisfied.
As a result, this is potentially VERY slow and still has
a high probability of failure.  Fortunately, requests for
order>0 are, I think, rare.

(c) and (d) are intentionally not combined.  (c) evicts
tmem ephemeral pages until it has added at least npages
(machine pages) into the tmem_page_list. This may be slow.
For (d), I''m thinking it will transfer npages from
tmem_page_list to the scrub_list, where the existing
page_scrub_timer will eventually scrub them and free
them to xen''s heap. (c) will return the number of pages
it successfuly added to tmem_page_list.  And (d) will
return the number of pages it successfully moved from
tmem_page_list to scrub_list.

So this leaves some design questions:

1) Does this design make sense?
2) Are there places other than in alloc_heap_pages
   in Xen where I need to add "hooks" for tmem
   to relinquish a page or a block of pages?
3) Are there are any other circumstances I''ve
   forgotten where large npages are requested?
4) Does anybody have a list of alloc requests of
     order > 0
   that occur after xen startup (e.g. when launching
   a new domain) and the consequences of failing the
   request?  I''d consider not providing interface (b)
   at all if it never happens or if multi-page requests
   always fail gracefully (e.g. get broken into smaller
   order requests).  I''m thinking for now that I may
   not implement this, just fail it and printk and
   see if any bad things happen.

Thanks for taking the time to read through this... any
feedback is appreciated.

Dan

** tmem has been working for months but the code has
until now allocated (and freed) to (and from)
xenheap and domheap.  This has been a security hole
as the pages were released unscrubbed and so data
could easily leak between domains.  Obviously this
needed to be fixed :-)  And scrubbing data at every
transfer from tmem to domheap/xenheap would be a huge
waste of CPU cycles, especially since the most likely
next consumer of that same page is tmem again.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Feb-14 07:41 UTC

head link

Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

On 13/02/2009 22:26, "Dan Magenheimer"
<dan.magenheimer@oracle.com> wrote:
> 4) Does anybody have a list of alloc requests of
>      order > 0
Domain and vcpu structs are order 1. Shadow pages are allocated in order-2
blocks.
> ** tmem has been working for months but the code has
> until now allocated (and freed) to (and from)
> xenheap and domheap.  This has been a security hole
> as the pages were released unscrubbed and so data
> could easily leak between domains.  Obviously this
> needed to be fixed :-)  And scrubbing data at every
> transfer from tmem to domheap/xenheap would be a huge
> waste of CPU cycles, especially since the most likely
> next consumer of that same page is tmem again.
Then why not mark pages as coming from tmem when you free them, and scrub
them on next use if it isn''t going back to tmem?

I wasn''t clear on who would call your C and D functions, and why they
can''t
be merged. I might veto those depending on how ugly and exposed the changes
are outside tmem.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Feb-14 15:58 UTC

head link

RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

Thanks much for the reply!
> > 4) Does anybody have a list of alloc requests of
> >      order > 0
> 
> Domain and vcpu structs are order 1. Shadow pages are 
> allocated in order-2 blocks.
Are all of these allocated at domain startup only?  Or
are any (shadow pages perhaps?) allocated at relatively
random times?  If random, what are the consequences
if the allocation fails?   Isn''t it quite possible
for a random order>0 allocation to fail today due
to "natural causes"?  E.g. because the currently running
domains by coincidence (or by ballooning) have used
up all available memory?  Have we just been "lucky"
to date, because fragmentation is so bad and ballooning
is so rarely used, that we haven''t seen failures
of order>0 allocations? (Or maybe have seen them but
didn''t know it because the observable symptoms are
a failed domain creation or a failed migration?)

In other words, I''m wondering if tmem doesn''t create
this problem, just increases the probability that it will
happen?

Perhaps Jan''s idea of using xenheap as an "emergency
fund" for free pages is really a good idea?
> > ** tmem has been working for months but the code has
> > until now allocated (and freed) to (and from)
> > xenheap and domheap.  This has been a security hole
> > as the pages were released unscrubbed and so data
> > could easily leak between domains.  Obviously this
> > needed to be fixed :-)  And scrubbing data at every
> > transfer from tmem to domheap/xenheap would be a huge
> > waste of CPU cycles, especially since the most likely
> > next consumer of that same page is tmem again.
> 
> Then why not mark pages as coming from tmem when you free 
> them, and scrub
> them on next use if it isn''t going back to tmem?
That''s a reasonable idea... maybe with a "scrub_me"
flag set in the struct page_info by tmem and checked by the
existing alloc_heap_pages (and ignored if a memflags flag
is passed to alloc_xxxheap_pages() set to "ignore_scrub_me")?
There''d also need to be a free_and_scrub_domheap_pages().

If you prefer that approach, I''ll give it a go.  But still
some (most?) of the time, there will be no free pages so
alloc_heap_pages will still need to have a hook to tmem
for that case.
> I wasn''t clear on who would call your C and D functions, and 
> why they can''t
> be merged. I might veto those depending on how ugly and 
> exposed the changes
> are outside tmem.
I *think* these calls are just in python code (domain creation
and ballooning) and, if so, will just go through the existing
tmem hypercall.

Thanks,
Dan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Feb-14 20:20 UTC

head link

Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

On 14/02/2009 15:58, "Dan Magenheimer"
<dan.magenheimer@oracle.com> wrote:
> Are all of these allocated at domain startup only?  Or
> are any (shadow pages perhaps?) allocated at relatively
> random times?  If random, what are the consequences
> if the allocation fails?   Isn''t it quite possible
> for a random order>0 allocation to fail today due
> to "natural causes"?  E.g. because the currently running
> domains by coincidence (or by ballooning) have used
> up all available memory?  Have we just been "lucky"
> to date, because fragmentation is so bad and ballooning
> is so rarely used, that we haven''t seen failures
> of order>0 allocations? (Or maybe have seen them but
> didn''t know it because the observable symptoms are
> a failed domain creation or a failed migration?)
I think the per-domain shadow pool is pre-reserved, so it should be okay.
Lack of memory simply causes domain creation failure. Any extra memory that
shadow code would try to allocate would just be gravy I''m pretty sure.
> Perhaps Jan''s idea of using xenheap as an "emergency
> fund" for free pages is really a good idea?
It''s a can of worms. How big to make the pool? Who should be allowed to
allocate from it and when? What if the emergency pool becomes exhausted?
> That''s a reasonable idea... maybe with a "scrub_me"
> flag set in the struct page_info by tmem and checked by the
> existing alloc_heap_pages (and ignored if a memflags flag
> is passed to alloc_xxxheap_pages() set to "ignore_scrub_me")?
> There''d also need to be a free_and_scrub_domheap_pages().
>
> If you prefer that approach, I''ll give it a go.  But still
> some (most?) of the time, there will be no free pages so
> alloc_heap_pages will still need to have a hook to tmem
> for that case.
I''m not super fussed, it''s just an idea to consider. Could
doing it this new
way make it less possible to scrub pages asynchronously before they''re
needed?
> I *think* these calls are just in python code (domain creation
> and ballooning) and, if so, will just go through the existing
> tmem hypercall.
Well, probably okay.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Feb-19 22:20 UTC

head link

RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

OK, here''s the changes I''ve implemented to plug tmem into the
existing xen physical memory management code.  Hopefully it looks OK.

For easier review, this patch and the diffstat below includes only
files changed in the hypervisor for tmem.

I had some difficulty understanding the page_list macros so left
page_list_splice unimplemented for now.  The working code removes pages
from the tmem list one at a time and adds them to the scrub list,
but since the pages could number in the millions for a large-memory
machine, this could be very slow.

Also, I''m uncertain about the change in alloc_heap_page... is
any tlb flushing required given that tmem pages are never visible
outside of the hypervisor?

Thanks,
Dan

 arch/x86/mm.c                  |   36 +++++++++++++++++++++++++++++++++++
 arch/x86/setup.c               |    3 ++
 arch/x86/x86_32/entry.S        |    2 +
 arch/x86/x86_64/compat/entry.S |    2 +
 arch/x86/x86_64/entry.S        |    2 +
 common/Makefile                |    4 +++
 common/compat/Makefile         |    1 
 common/domain.c                |    4 +++
 common/page_alloc.c            |   42 +++++++++++++++++++++++++++++++++++------
 common/xmalloc_tlsf.c          |   33 ++++++++++++++++++++++----------
 include/Makefile               |    1 
 include/asm-x86/mm.h           |    2 +
 include/public/xen.h           |    1 
 include/xen/hypercall.h        |    5 ++++
 include/xen/mm.h               |   13 ++++++++++++
 include/xen/sched.h            |    3 ++
 include/xen/xmalloc.h          |    8 ++++++-
 include/xlat.lst               |    3 ++
 18 files changed, 148 insertions(+), 17 deletions(-)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Dan Magenheimer

2009-Feb-19 22:24 UTC

head link

RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

Oops, always after the email is sent! :-)

I neglected the page_scrub_lock around scrub_list_add and
scrub_list_splice.  Consider those already added.

Dan
> -----Original Message-----
> From: Dan Magenheimer 
> Sent: Thursday, February 19, 2009 3:21 PM
> To: Keir Fraser; Xen-Devel (E-mail)
> Subject: RE: [Xen-devel] [RFC] design/API for plugging tmem into
> existing xen physical memory management code
> 
> 
> OK, here''s the changes I''ve implemented to plug tmem into
the
> existing xen physical memory management code.  Hopefully it looks OK.
> 
> For easier review, this patch and the diffstat below includes only
> files changed in the hypervisor for tmem.
> 
> I had some difficulty understanding the page_list macros so left
> page_list_splice unimplemented for now.  The working code 
> removes pages
> from the tmem list one at a time and adds them to the scrub list,
> but since the pages could number in the millions for a large-memory
> machine, this could be very slow.
> 
> Also, I''m uncertain about the change in alloc_heap_page... is
> any tlb flushing required given that tmem pages are never visible
> outside of the hypervisor?
> 
> Thanks,
> Dan
> 
>  arch/x86/mm.c                  |   36 
> +++++++++++++++++++++++++++++++++++
>  arch/x86/setup.c               |    3 ++
>  arch/x86/x86_32/entry.S        |    2 +
>  arch/x86/x86_64/compat/entry.S |    2 +
>  arch/x86/x86_64/entry.S        |    2 +
>  common/Makefile                |    4 +++
>  common/compat/Makefile         |    1 
>  common/domain.c                |    4 +++
>  common/page_alloc.c            |   42 
> +++++++++++++++++++++++++++++++++++------
>  common/xmalloc_tlsf.c          |   33 
> ++++++++++++++++++++++----------
>  include/Makefile               |    1 
>  include/asm-x86/mm.h           |    2 +
>  include/public/xen.h           |    1 
>  include/xen/hypercall.h        |    5 ++++
>  include/xen/mm.h               |   13 ++++++++++++
>  include/xen/sched.h            |    3 ++
>  include/xen/xmalloc.h          |    8 ++++++-
>  include/xlat.lst               |    3 ++
>  18 files changed, 148 insertions(+), 17 deletions(-)
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2009-Feb-20 08:29 UTC

head link

Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

On 19/02/2009 22:20, "Dan Magenheimer"
<dan.magenheimer@oracle.com> wrote:
> Also, I''m uncertain about the change in alloc_heap_page... is
> any tlb flushing required given that tmem pages are never visible
> outside of the hypervisor?
No, the TLB flushes are always to flush guest mappings.

 -- Keir



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Feb 2009 - [RFC] design/API for plugging tmem into existing xen physical memory management code

[Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

RE: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code

Re: [Xen-devel] [RFC] design/API for plugging tmem into existing xen physical memory management code