thr3ads.net - Xen devel - domain creation vs querying free memory (xend and xl) [Sep 2012]

If this information is useful, please help other people find it:
Share via:

Dan Magenheimer

2012-Sep-26 21:17 UTC

domain creation vs querying free memory (xend and xl)

I was asked a question that seems like it should be obvious
but it doesn''t seem to be, at least in xm-land.  I''ll look
into it further, as well as for xl, but I thought I''d ask
first to see if there is a known answer or if this is a known
problem:

Suppose that xm/xl create is issued on a large-memory
domain (PV or HVM or, future, PVH).  It takes awhile
for this domain to launch and during at least part of this
time, the toolset hasn''t yet requested all of the
required memory from the hypervisor to complete the
launch of the domain...  or perhaps the toolset has,
but the hypervisor is slow about calling the long sequence
of page allocations (e.g. maybe because it is zeroing
each page?).

Then it is desired to launch a second large-memory domain.
The tools can query Xen to see if there is sufficient RAM
and there is, because the first launch has not yet
allocated all the RAM assigned to it.

But the second domain launch fails, possibly after
several minutes because, actually, there isn''t enough
physical RAM for both.

Does this make sense?  Should the tools "reserve"
maxmem as a "transaction" and/or ensure that "xm/xl
free" calls account for the entire requested amount
of RAM?  Or maybe xl _does_ work this way?

Thanks for any comments or discussion!

Dan

Konrad Rzeszutek Wilk

2012-Sep-27 11:26 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Wed, Sep 26, 2012 at 02:17:06PM -0700, Dan Magenheimer
wrote:> I was asked a question that seems like it should be obvious
> but it doesn''t seem to be, at least in xm-land.  I''ll
look
> into it further, as well as for xl, but I thought I''d ask
> first to see if there is a known answer or if this is a known
> problem:
> 
> Suppose that xm/xl create is issued on a large-memory
> domain (PV or HVM or, future, PVH).  It takes awhile
> for this domain to launch and during at least part of this
> time, the toolset hasn''t yet requested all of the
> required memory from the hypervisor to complete the
> launch of the domain...  or perhaps the toolset has,
> but the hypervisor is slow about calling the long sequence
> of page allocations (e.g. maybe because it is zeroing
> each page?).
> 
> Then it is desired to launch a second large-memory domain.
> The tools can query Xen to see if there is sufficient RAM
> and there is, because the first launch has not yet
> allocated all the RAM assigned to it.
> 
> But the second domain launch fails, possibly after
> several minutes because, actually, there isn''t enough
> physical RAM for both.
> 
> Does this make sense?  Should the tools "reserve"
> maxmem as a "transaction" and/or ensure that "xm/xl
> free" calls account for the entire requested amount
> of RAM?  Or maybe xl _does_ work this way?
So say "freeze" the amount of free memory. Lets CC the XCP
folks> 
> Thanks for any comments or discussion!
> 
> Dan

George Shuklin

2012-Sep-27 15:24 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

not sure about xl/xm, but xapi performs one start at time, so there is 
no race between domains for memory or other resources.

27.09.2012 01:17, Dan Magenheimer пишет:> I was asked a question that seems like it should be obvious
> but it doesn't seem to be, at least in xm-land.  I'll look
> into it further, as well as for xl, but I thought I'd ask
> first to see if there is a known answer or if this is a known
> problem:
>
> Suppose that xm/xl create is issued on a large-memory
> domain (PV or HVM or, future, PVH).  It takes awhile
> for this domain to launch and during at least part of this
> time, the toolset hasn't yet requested all of the
> required memory from the hypervisor to complete the
> launch of the domain...  or perhaps the toolset has,
> but the hypervisor is slow about calling the long sequence
> of page allocations (e.g. maybe because it is zeroing
> each page?).
>
> Then it is desired to launch a second large-memory domain.
> The tools can query Xen to see if there is sufficient RAM
> and there is, because the first launch has not yet
> allocated all the RAM assigned to it.
>
> But the second domain launch fails, possibly after
> several minutes because, actually, there isn't enough
> physical RAM for both.
>
> Does this make sense?  Should the tools "reserve"
> maxmem as a "transaction" and/or ensure that "xm/xl
> free" calls account for the entire requested amount
> of RAM?  Or maybe xl _does_ work this way?
>
> Thanks for any comments or discussion!
>
> Dan
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Dan Magenheimer

2012-Sep-27 15:32 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Konrad Rzeszutek Wilk
> Subject: Re: domain creation vs querying free memory (xend and xl)
> 
> On Wed, Sep 26, 2012 at 02:17:06PM -0700, Dan Magenheimer wrote:
> > I was asked a question that seems like it should be obvious
> > but it doesn''t seem to be, at least in xm-land. 
I''ll look
> > into it further, as well as for xl, but I thought I''d ask
> > first to see if there is a known answer or if this is a known
> > problem:
> >
> > Suppose that xm/xl create is issued on a large-memory
> > domain (PV or HVM or, future, PVH).  It takes awhile
> > for this domain to launch and during at least part of this
> > time, the toolset hasn''t yet requested all of the
> > required memory from the hypervisor to complete the
> > launch of the domain...  or perhaps the toolset has,
> > but the hypervisor is slow about calling the long sequence
> > of page allocations (e.g. maybe because it is zeroing
> > each page?).
> >
> > Then it is desired to launch a second large-memory domain.
> > The tools can query Xen to see if there is sufficient RAM
> > and there is, because the first launch has not yet
> > allocated all the RAM assigned to it.
> >
> > But the second domain launch fails, possibly after
> > several minutes because, actually, there isn''t enough
> > physical RAM for both.
> >
> > Does this make sense?  Should the tools "reserve"
> > maxmem as a "transaction" and/or ensure that "xm/xl
> > free" calls account for the entire requested amount
> > of RAM?  Or maybe xl _does_ work this way?
> 
> So say "freeze" the amount of free memory. Lets CC the XCP folks
Hmmm... the problem is the opposite (I think, since I don''t
have hardware at hand to reproduce it).

Assume a machine has 2TB of physical RAM and a "xm create"
is started to launch a 1TB guest called "X".  While X is
being launched, another thread watches "xm free" and sees
that it slowly goes down from 1.995TB.  That thread does not
know what the eventual "floor" will be.  Now a third thread
does a "xm create" to launch a second 1TB guest "Y".
The "xm create" asks the hypervisor and sees, yep, there
is, at this moment, 1.376TB of free memory so it commences
launching the guest.

Because the hypervisor and dom0 consume some RAM, both
of these "xm create" will eventually fail, possibly
after several minutes.

Seems like a "xm unreserved" is needed, similar to "xm free"
but takes into account the tools'' knowledge of what RAM is
in the process of being reserved for launching domains,
not just the allocation requests the hypervisor has already
processed.

Dario Faggioli

2012-Sep-28 16:08 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Thu, 2012-09-27 at 19:24 +0400, George Shuklin wrote:
> not sure about xl/xm, but xapi performs one start at time, so there is 
> no race between domains for memory or other resources.
> IIRC, xl has a vary coarse grain locking mechanism in place for domain
creation too. As a result of that, you shouldn''t be able to create two
domains at the same time, which should be enough for preventing the
situation described in the original e-mail for occurring.

Looking at acquire_lock() and release_lock() (and at where they are
called) in xl code should clarify whether or not that is enough to
actually avoid the race (which I think it is, but I might be wrong :-D).

That being said, there still is the room for races, although not wrt
domain creation, as, for instance, there isn''t any synchronization
between creation and ballooning, which both manipulate memory. So, maybe
thinking about some kind of reservation-based/transactional mechanism at
some level might make actual sense.

Unfortunately, I''ve no idea about how xm works in that respect.

Hope this at least help clarifying the situation a bit. :-)

Thanks and Regards,
Dario

-- 
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

Ian Jackson

2012-Sep-28 17:12 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

Dan Magenheimer writes ("[Xen-devel] domain creation vs querying free
memory (xend and xl)"):> But the second domain launch fails, possibly after
> several minutes because, actually, there isn''t enough
> physical RAM for both.
This is a real problem.  The solution is not easy, and may not make it
for 4.3.  It would involve a rework of the memory handling code in
libxl.

Ian.

Dan Magenheimer

2012-Oct-01 20:03 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> Sent: Friday, September 28, 2012 11:12 AM
> To: Dan Magenheimer
> Cc: xen-devel@lists.xen.org; Kurt Hackel; Konrad Wilk
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> Dan Magenheimer writes ("[Xen-devel] domain creation vs querying free
memory (xend and xl)"):
> > But the second domain launch fails, possibly after
> > several minutes because, actually, there isn''t enough
> > physical RAM for both.
> 
> This is a real problem.  The solution is not easy, and may not make it
> for 4.3.  It would involve a rework of the memory handling code in
> libxl.
[broadening cc to "Xen memory technology people", please forward/add
if I missed someone]

Hi Ian --

If you can estimate the difficulty, it would appear you have
a specific libxl design in mind?  Maybe it would be useful to
brainstorm a bit to see if there might be a simpler/different
solution?

Bearing in mind that I know almost nothing about xl or
the tools layer, and that, as a result, I tend to look
for hypervisor solutions, I''m thinking it''s not possible to
solve this without direct participation of the hypervisor anyway,
at least while ensuring the solution will successfully
work with any memory technology that involves ballooning
with the possibility of overcommit (i.e. tmem, page sharing
and host-swapping, manual ballooning, PoD)...  EVEN if the
toolset is single threaded (i.e. only one domain may
be created at a time, such as xapi). [1]

As a result, I''ve cc''ed other parties involved in memory
technologies who can chime in if they think the above
statement is incorrect for their technology...

Back to design brainstorming:

The way I am thinking about it, the tools need to be involved
to the extent that they would need to communicate to the
hypervisor the following facts (probably via new hypercall):

X1) I am launching a domain X and it is eventually going to
   consume up to a maximum of N MB.  Please tell me if
   there is sufficient RAM available AND, if so, reserve
   it until I tell you I am done. ("AND" implies transactional
   semantics)
X2) The launch of X is complete and I will not be requesting
   the allocation of any more RAM for it.  Please release
   the reservation, whether or not I''ve requested a total
   of N MB.

The calls may be nested or partially ordered, i.e.
   X1...Y1...Y2...X2
   X1...Y1...X2...Y2
and the hypervisor must be able to deal with this.

Then there would need to be two "versions" of "xm/xl free".
We can quibble about which should be the default, but
they would be:

- "xl --reserved free" asks the hypervisor how much RAM
   is available taking into account reservations
- "xm --raw free" asks the hypervisor for the instantaneous
   amount of RAM unallocated, not counting reservations

When the tools are not launching a domain (that is there
has been a matching X2 for all X1), the results of the
above "free" queries are always identical.

So, IanJ, does this match up with the design you were thinking
about?

Thanks,
Dan

[1] I think the core culprits are (a) the hypervisor accounts for
memory allocation of pages strictly on a first-come-first-served
basis and (b) the tools don''t have any form of need-this-much-memory
"transaction" model

Tim Deegan

2012-Oct-02 09:10 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer
wrote:> Bearing in mind that I know almost nothing about xl or
> the tools layer, and that, as a result, I tend to look
> for hypervisor solutions, I''m thinking it''s not possible
to
> solve this without direct participation of the hypervisor anyway,
> at least while ensuring the solution will successfully
> work with any memory technology that involves ballooning
> with the possibility of overcommit (i.e. tmem, page sharing
> and host-swapping, manual ballooning, PoD)...  EVEN if the
> toolset is single threaded (i.e. only one domain may
> be created at a time, such as xapi). [1]
TTBOMK, Xapi actually _has_ solved this problem, even with ballooning
and PoD.  I don''t know if they have any plans to support sharing,
swapping or tmem, though.

Adding a ''reservation'' of free pages that may only be
allocated by a
given domain should be straightforward enough, but I''m not sure it
helps
much.  In the ''balloon-to-fit'' model where all memory is
already
allocated to some domain (or tmem), some part of the toolstack needs to
sort out freeing up the memory before allocating it to another VM.
Surely that component needs to handle the exclusion too - otherwise a
series of small VM creations could stall a large one indefinitely.

Cheers,

Tim.

Ian Campbell

2012-Oct-02 09:47 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Tue, 2012-10-02 at 10:10 +0100, Tim Deegan wrote:> At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote:
> > Bearing in mind that I know almost nothing about xl or
> > the tools layer, and that, as a result, I tend to look
> > for hypervisor solutions, I''m thinking it''s not
possible to
> > solve this without direct participation of the hypervisor anyway,
> > at least while ensuring the solution will successfully
> > work with any memory technology that involves ballooning
> > with the possibility of overcommit (i.e. tmem, page sharing
> > and host-swapping, manual ballooning, PoD)...  EVEN if the
> > toolset is single threaded (i.e. only one domain may
> > be created at a time, such as xapi). [1]
> 
> TTBOMK, Xapi actually _has_ solved this problem, even with ballooning
> and PoD.  I don''t know if they have any plans to support sharing,
> swapping or tmem, though.
> 
> Adding a ''reservation'' of free pages that may only be
allocated by a
> given domain should be straightforward enough, but I''m not sure it
helps
> much.  In the ''balloon-to-fit'' model where all memory is
already
> allocated to some domain (or tmem), some part of the toolstack needs to
> sort out freeing up the memory before allocating it to another VM.
> Surely that component needs to handle the exclusion too - otherwise a
> series of small VM creations could stall a large one indefinitely.
xl today has a big lock around domain creation, which solves the
original issue which Dan describes but has the issue which you describe.

IIRC Dario was going to be looking at adding something to (one or more
of) xen, libxl and xl to allow this to be handled more cleverly as part
of the NUMA work in 4.3.

I think that the intention was still that there would be a critical
section within all of the colluding xl instances where memory was set
aside for a particular domain, possibly with hypervisor assistance.

Ian.

Dan Magenheimer

2012-Oct-02 18:17 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

(Rats, thought I sent this out yesterday...)
> From: Dario Faggioli [mailto:raistlin@linux.it]
> Sent: Friday, September 28, 2012 10:08 AM
> To: George Shuklin
> Cc: xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> On Thu, 2012-09-27 at 19:24 +0400, George Shuklin wrote:
> > not sure about xl/xm, but xapi performs one start at time, so there is
> > no race between domains for memory or other resources.
Oops, sorry, I missed this part of the thread because I wasn''t directly
cc''ed and am behind on my xen-devel reading...
> IIRC, xl has a vary coarse grain locking mechanism in place for domain
> creation too. As a result of that, you shouldn''t be able to create
two
> domains at the same time, which should be enough for preventing the
> situation described in the original e-mail for occurring.
> 
> Looking at acquire_lock() and release_lock() (and at where they are
> called) in xl code should clarify whether or not that is enough to
> actually avoid the race (which I think it is, but I might be wrong :-D).
This sounds like a pretty serious limitation, especially if it applies
to migration as well as creation (or a combination)... I hope it is not
a regression from xm to xl.  For example, suppose a data center
is trying to do a planned downtime for machine X by force-migrating
all guests to machine Y.  It sounds like xl would serialize this?
> That being said, there still is the room for races, although not wrt
> domain creation, as, for instance, there isn''t any synchronization
> between creation and ballooning, which both manipulate memory. So, maybe
> thinking about some kind of reservation-based/transactional mechanism at
> some level might make actual sense.
Which is mostly the reason I am interested ;-) though solving
the superset of my problem is probably a good thing as well.

Dan
> Unfortunately, I''ve no idea about how xm works in that respect.
> 
> Hope this at least help clarifying the situation a bit. :-)
> 
> Thanks and Regards,
> Dario
> 
> --
> <<This happens because I choose it to happen!>> (Raistlin
Majere)
> -----------------------------------------------------------------
> Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli
> Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
> 
>

Dan Magenheimer

2012-Oct-02 19:33 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote:
> > Bearing in mind that I know almost nothing about xl or
> > the tools layer, and that, as a result, I tend to look
> > for hypervisor solutions, I''m thinking it''s not
possible to
> > solve this without direct participation of the hypervisor anyway,
> > at least while ensuring the solution will successfully
> > work with any memory technology that involves ballooning
> > with the possibility of overcommit (i.e. tmem, page sharing
> > and host-swapping, manual ballooning, PoD)...  EVEN if the
> > toolset is single threaded (i.e. only one domain may
> > be created at a time, such as xapi). [1]
> 
> TTBOMK, Xapi actually _has_ solved this problem, even with ballooning
> and PoD.  I don''t know if they have any plans to support sharing,
> swapping or tmem, though.
Is this because PoD never independently increases the size of a
domain''s
allocation?  If so, then I agree Xapi has solved the problem because
in all cases the toolstack knows when the amount of memory allocated
to a domain is increasing.  However, given that George''s 4.3 plan
contains:

* Memory: Replace PoD with paging mechanism
  owner: george@citrix
  status: May need review

xapi might want to (re)consider either the above 4.3 feature or
see that this problem has been properly fixed prior to 4.3, because
I am fairly sure that paging _will_ increase a domain''s current
allocation without knowledge of the toolstack.
> Adding a ''reservation'' of free pages that may only be
allocated by a
> given domain should be straightforward enough, but I''m not sure it
helps
It absolutely does help.  With tmem (and I think with paging), the
total allocation of a domain may be increased without knowledge by
the toolset.
> much.  In the ''balloon-to-fit'' model where all memory is
already
> allocated to some domain (or tmem), some part of the toolstack needs to
> sort out freeing up the memory before allocating it to another VM.
By balloon-to-fit, do you mean that all RAM is occupied?  Tmem
handles the "sort out freeing up the memory" entirely in the
hypervisor, so the toolstack never knows.
> Surely that component needs to handle the exclusion too - otherwise a
> series of small VM creations could stall a large one indefinitely.
Not sure I understand this, but it seems feasible.

Dan

Tim Deegan

2012-Oct-02 20:16 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

At 12:33 -0700 on 02 Oct (1349181195), Dan Magenheimer
wrote:> > From: Tim Deegan [mailto:tim@xen.org]
> > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
> > 
> > At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote:
> > > Bearing in mind that I know almost nothing about xl or
> > > the tools layer, and that, as a result, I tend to look
> > > for hypervisor solutions, I''m thinking it''s not
possible to
> > > solve this without direct participation of the hypervisor anyway,
> > > at least while ensuring the solution will successfully
> > > work with any memory technology that involves ballooning
> > > with the possibility of overcommit (i.e. tmem, page sharing
> > > and host-swapping, manual ballooning, PoD)...  EVEN if the
> > > toolset is single threaded (i.e. only one domain may
> > > be created at a time, such as xapi). [1]
> > 
> > TTBOMK, Xapi actually _has_ solved this problem, even with ballooning
> > and PoD.  I don''t know if they have any plans to support
sharing,
> > swapping or tmem, though.
> 
> Is this because PoD never independently increases the size of a
domain''s
> allocation? 
AIUI xapi uses the domains'' maximum allocations, centrally controlled,
to place an upper bound on the amount of guest memory that can be in
use.  Within those limits there can be ballooning activity.  But TBH I
don''t know the details.
> > Adding a ''reservation'' of free pages that may only
be allocated by a
> > given domain should be straightforward enough, but I''m not
sure it helps
> 
> It absolutely does help.  With tmem (and I think with paging), the
> total allocation of a domain may be increased without knowledge by
> the toolset.
But not past the domains'' maximum allowance, right?  That''s
not the case
with paging, anyway.
> > much.  In the ''balloon-to-fit'' model where all
memory is already
> > allocated to some domain (or tmem), some part of the toolstack needs
to
> > sort out freeing up the memory before allocating it to another VM.
> 
> By balloon-to-fit, do you mean that all RAM is occupied?  Tmem
> handles the "sort out freeing up the memory" entirely in the
> hypervisor, so the toolstack never knows.
Does tmem replace ballooning/sharing/swapping entirely?  I thought they
could coexist.  Or, if you jut mean that tmem owns all otherwise-free
memory and will relinquish it on demand, then the same problems occur
while the toolstack is moving memory from owned-by-guests to
owned-by-tmem.
> > Surely that component needs to handle the exclusion too - otherwise a
> > series of small VM creations could stall a large one indefinitely.
> 
> Not sure I understand this, but it seems feasible.
If you ask for a large VM and a small VM to be started at about the same
time, the small VM will always win (since you''ll free enough memory for
the small VM before you free enough for the big one).  If you then ask
for another small VM it will win again, and so forth, indefinitely
postponing the large VM in the waiting-for-memory state, unless some
agent explicitly enforces that VMs be started in order.  If you have
such an agent you probably don''t need a hypervisor interlock as well.

I think it would be better to back up a bit.  Maybe you could sketch out
how you think [lib]xl ought to be handling ballooning/swapping/sharing/tmem
when it''s starting VMs.  I don''t have a strong objection to
accounting
free memory to particular domains if it turns out to be useful, but as
always I prefer not to have things happen in the hypervisor if they
could happen in less privileged code.

Tim.

Dan Magenheimer

2012-Oct-02 21:56 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Tim Deegan [mailto:tim@xen.org]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> At 12:33 -0700 on 02 Oct (1349181195), Dan Magenheimer wrote:
> > > From: Tim Deegan [mailto:tim@xen.org]
> > > Subject: Re: [Xen-devel] domain creation vs querying free memory
(xend and xl)
> > >
> > > At 13:03 -0700 on 01 Oct (1349096617), Dan Magenheimer wrote:
> > > > Bearing in mind that I know almost nothing about xl or
> > > > the tools layer, and that, as a result, I tend to look
> > > > for hypervisor solutions, I''m thinking
it''s not possible to
> > > > solve this without direct participation of the hypervisor
anyway,
> > > > at least while ensuring the solution will successfully
> > > > work with any memory technology that involves ballooning
> > > > with the possibility of overcommit (i.e. tmem, page sharing
> > > > and host-swapping, manual ballooning, PoD)...  EVEN if the
> > > > toolset is single threaded (i.e. only one domain may
> > > > be created at a time, such as xapi). [1]
> > >
> > > TTBOMK, Xapi actually _has_ solved this problem, even with
ballooning
> > > and PoD.  I don''t know if they have any plans to support
sharing,
> > > swapping or tmem, though.
> >
> > Is this because PoD never independently increases the size of a
domain''s
> > allocation?
> 
> AIUI xapi uses the domains'' maximum allocations, centrally
controlled,
> to place an upper bound on the amount of guest memory that can be in
> use.  Within those limits there can be ballooning activity.  But TBH I
> don''t know the details.
Yes, that''s the same as saying there is no memory-overcommit.

The original problem occurs only if there are multiple threads
of execution that can be simultaneously asking the hypervisor
to allocate memory without the knowledge of a single centralized
"controller".
 > > > Adding a ''reservation'' of free pages that may
only be allocated by a
> > > given domain should be straightforward enough, but I''m
not sure it helps
> >
> > It absolutely does help.  With tmem (and I think with paging), the
> > total allocation of a domain may be increased without knowledge by
> > the toolset.
> 
> But not past the domains'' maximum allowance, right? 
That''s not the case
> with paging, anyway.
Right.  We can quibble about memory hot-add, depending on its design.
> > > much.  In the ''balloon-to-fit'' model where all
memory is already
> > > allocated to some domain (or tmem), some part of the toolstack
needs to
> > > sort out freeing up the memory before allocating it to another
VM.
> >
> > By balloon-to-fit, do you mean that all RAM is occupied?  Tmem
> > handles the "sort out freeing up the memory" entirely in the
> > hypervisor, so the toolstack never knows.
> 
> Does tmem replace ballooning/sharing/swapping entirely?  I thought they
> could coexist.  Or, if you jut mean that tmem owns all otherwise-free
> memory and will relinquish it on demand, then the same problems occur
> while the toolstack is moving memory from owned-by-guests to
> owned-by-tmem.
Tmem replaces sharing/swapping entirely for guests that support it.
Since kernel changes are required to support it, not all guests
will ever support it.  Now with full tmem support in the Linux kernel,
it is possible that eventually all non-legacy Linux guests will support
it.

Tmem dynamically handles all the transfer of owned-by memory capacity
in the hypervisor, essentially augmenting the page allocator, so
the hypervisor is the "controller".

Oh, and tmem doesn''t replace ballooning at all... it works best with
selfballooning (which is also now in the Linux kernel).  Ballooning
is still a useful mechanism for moving memory capacity between
the guest and the hypervisor; tmem caches data and handles policy.
> > > Surely that component needs to handle the exclusion too -
otherwise a
> > > series of small VM creations could stall a large one
indefinitely.
> >
> > Not sure I understand this, but it seems feasible.
> 
> If you ask for a large VM and a small VM to be started at about the same
> time, the small VM will always win (since you''ll free enough
memory for
> the small VM before you free enough for the big one).  If you then ask
> for another small VM it will win again, and so forth, indefinitely
> postponing the large VM in the waiting-for-memory state, unless some
> agent explicitly enforces that VMs be started in order.  If you have
> such an agent you probably don''t need a hypervisor interlock as
well.
OK, I see, thanks.
> I think it would be better to back up a bit.  Maybe you could sketch out
> how you think [lib]xl ought to be handling ballooning/swapping/sharing/tmem
> when it''s starting VMs.  I don''t have a strong objection
to accounting
> free memory to particular domains if it turns out to be useful, but as
> always I prefer not to have things happen in the hypervisor if they
> could happen in less privileged code.
I sketched it out earlier in this thread, will attach again below.

I agree with your last statement in general, but would modify it to
"if they could happen efficiently and effectively in less privileged
code".
Obviously everything that Xen does can be done in less privileged code...
in an emulator.  Emulator''s just don''t do it fast enough.

Tmem argues that doing "memory capacity transfers" at a page
granularity
can only be done efficiently in the hypervisor.  This is true for
page-sharing when it breaks a "share" also... it can''t go ask
the
toolstack to approve allocation of a new page every time a write to a shared
page occurs.

Does that make sense?

So the original problem must be solved if:
1) Domain creation is not serialized
2) Any domain''s current memory allocation can be increased without
   approval of the toolstack.
Problem (1) arose independently and my interest is that it gets
solved in a way that (2) can also benefit.

Dan

(rough proposed design re-attached below)
> From: Dan Magenheimer
> Sent: Monday, October 01, 2012 2:04 PM
>    :
>    :
> Back to design brainstorming:
> 
> The way I am thinking about it, the tools need to be involved
> to the extent that they would need to communicate to the
> hypervisor the following facts (probably via new hypercall):
> 
> X1) I am launching a domain X and it is eventually going to
>    consume up to a maximum of N MB.  Please tell me if
>    there is sufficient RAM available AND, if so, reserve
>    it until I tell you I am done. ("AND" implies transactional
>    semantics)
> X2) The launch of X is complete and I will not be requesting
>    the allocation of any more RAM for it.  Please release
>    the reservation, whether or not I''ve requested a total
>    of N MB.
> 
> The calls may be nested or partially ordered, i.e.
>    X1...Y1...Y2...X2
>    X1...Y1...X2...Y2
> and the hypervisor must be able to deal with this.
> 
> Then there would need to be two "versions" of "xm/xl
free".
> We can quibble about which should be the default, but
> they would be:
> 
> - "xl --reserved free" asks the hypervisor how much RAM
>    is available taking into account reservations
> - "xm --raw free" asks the hypervisor for the instantaneous
>    amount of RAM unallocated, not counting reservations
> 
> When the tools are not launching a domain (that is there
> has been a matching X2 for all X1), the results of the
> above "free" queries are always identical.
> 
> So, IanJ, does this match up with the design you were thinking
> about?
> 
> Thanks,
> Dan
> 
> [1] I think the core culprits are (a) the hypervisor accounts for
> memory allocation of pages strictly on a first-come-first-served
> basis and (b) the tools don''t have any form of
need-this-much-memory
> "transaction" model

Tim Deegan

2012-Oct-04 10:06 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer
wrote:> > AIUI xapi uses the domains'' maximum allocations, centrally
controlled,
> > to place an upper bound on the amount of guest memory that can be in
> > use.  Within those limits there can be ballooning activity.  But TBH I
> > don''t know the details.
> 
> Yes, that''s the same as saying there is no memory-overcommit.
I''d say there is - but it''s all done by ballooning, and
it''s centrally
enforced by lowering each domain''s maxmem to its balloon target, so a
badly behaved guest can''t balloon up and confuse things. 
> The original problem occurs only if there are multiple threads
> of execution that can be simultaneously asking the hypervisor
> to allocate memory without the knowledge of a single centralized
> "controller".
Absolutely.
> Tmem argues that doing "memory capacity transfers" at a page
granularity
> can only be done efficiently in the hypervisor.  This is true for
> page-sharing when it breaks a "share" also... it can''t
go ask the
> toolstack to approve allocation of a new page every time a write to a
shared
> page occurs.
> 
> Does that make sense?
Yes.  The page-sharing version can be handled by having a pool of
dedicated memory for breaking shares, and the toolstack asynchronously
replenish that, rather than allowing CoW to use up all memory in the
system.
> (rough proposed design re-attached below)
Thanks for that.  It describes a sensible-looking hypervisor interface,
but my question was really: what should xl do, in the presence of
ballooning, sharing, paging and tmem, to
 - decide whether a VM can be started at all;
 - control those four systems to shuffle memory around; and
 - resolve races sensibly to avoid small VMs deferring large ones.
(AIUI, xl already has some logic to handle the case of balloon-to-fit.)

The second of those three is the interesting one.  It seems to me that
if the tools can''t force all other actors to give up memory (and not
immediately take it back) then they can''t guarantee to be able to start
a new VM, even with the new reservation hypercalls.

Cheers,

Tim.
> > From: Dan Magenheimer
> > Sent: Monday, October 01, 2012 2:04 PM
> >    :
> >    :
> > Back to design brainstorming:
> > 
> > The way I am thinking about it, the tools need to be involved
> > to the extent that they would need to communicate to the
> > hypervisor the following facts (probably via new hypercall):
> > 
> > X1) I am launching a domain X and it is eventually going to
> >    consume up to a maximum of N MB.  Please tell me if
> >    there is sufficient RAM available AND, if so, reserve
> >    it until I tell you I am done. ("AND" implies
transactional
> >    semantics)
> > X2) The launch of X is complete and I will not be requesting
> >    the allocation of any more RAM for it.  Please release
> >    the reservation, whether or not I''ve requested a total
> >    of N MB.
> > 
> > The calls may be nested or partially ordered, i.e.
> >    X1...Y1...Y2...X2
> >    X1...Y1...X2...Y2
> > and the hypervisor must be able to deal with this.
> > 
> > Then there would need to be two "versions" of "xm/xl
free".
> > We can quibble about which should be the default, but
> > they would be:
> > 
> > - "xl --reserved free" asks the hypervisor how much RAM
> >    is available taking into account reservations
> > - "xm --raw free" asks the hypervisor for the instantaneous
> >    amount of RAM unallocated, not counting reservations
> > 
> > When the tools are not launching a domain (that is there
> > has been a matching X2 for all X1), the results of the
> > above "free" queries are always identical.
> > 
> > So, IanJ, does this match up with the design you were thinking
> > about?
> > 
> > Thanks,
> > Dan
> > 
> > [1] I think the core culprits are (a) the hypervisor accounts for
> > memory allocation of pages strictly on a first-come-first-served
> > basis and (b) the tools don''t have any form of
need-this-much-memory
> > "transaction" model

Ian Campbell

2012-Oct-04 10:17 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote:> but my question was really: what should xl do, in the presence of
> ballooning, sharing, paging and tmem, to
>  - decide whether a VM can be started at all;
>  - control those four systems to shuffle memory around; and
>  - resolve races sensibly to avoid small VMs deferring large ones.
> (AIUI, xl already has some logic to handle the case of balloon-to-fit.)
> 
> The second of those three is the interesting one.  It seems to me that
> if the tools can''t force all other actors to give up memory (and
not
> immediately take it back) then they can''t guarantee to be able to
start
> a new VM, even with the new reservation hypercalls.
There was a bit of discussion in the spring about this sort of thing
(well, three of the four), which seems to have fallen a bit by the
wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g.
http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html

I''m sure there was earlier discussion which led to that, but I
can''t
seem to see it in the archives right now, perhaps I''m not looking for
the right Subject.

Olaf might have been intending to look into this (I can''t quite
remember
where we left it)

Ian.

Andres Lagar-Cavilla

2012-Oct-04 13:20 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote:
> On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote:
>> but my question was really: what should xl do, in the presence of
>> ballooning, sharing, paging and tmem, to
>> - decide whether a VM can be started at all;
>> - control those four systems to shuffle memory around; and
Are we talking about a per-VM control, with one or more of those sub-systems
colluding concurrently? Or are we talking about a global view, and how chunks of
host memory get sub-allocated? Hopefully the latter...
>> - resolve races sensibly to avoid small VMs deferring large ones.
>> (AIUI, xl already has some logic to handle the case of balloon-to-fit.)
>> 
>> The second of those three is the interesting one.  It seems to me that
>> if the tools can''t force all other actors to give up memory
(and not
>> immediately take it back) then they can''t guarantee to be able
to start
>> a new VM, even with the new reservation hypercalls.
> 
> There was a bit of discussion in the spring about this sort of thing
> (well, three of the four), which seems to have fallen a bit by the
> wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g.
> http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html
> 
> I''m sure there was earlier discussion which led to that, but I
can''t
> seem to see it in the archives right now, perhaps I''m not looking
for
> the right Subject.
IIRC, we had a bit of that conversation during the Santa Clara hackathon. The
idea was to devise a scheme so that libxl can be told who the "actor"
will be for memory management, and then hand-off appropriately. Add xl bindings,
suitable defaults, and an implementation of the "balloon actor" by
libxl, and the end result is the ability to start domains with a memory target
suitably managed by balloon, xenpaging, tmem, foo, according to the
user''s wish. With no need to know obscure knobs. To the extent that
might be possible.

Andres
> 
> Olaf might have been intending to look into this (I can''t quite
remember
> where we left it)
> 
> Ian.
> 
>

Ian Campbell

2012-Oct-04 13:25 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Thu, 2012-10-04 at 14:20 +0100, Andres Lagar-Cavilla
wrote:> On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote:
> 
> > On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote:
> >> but my question was really: what should xl do, in the presence of
> >> ballooning, sharing, paging and tmem, to
> >> - decide whether a VM can be started at all;
> >> - control those four systems to shuffle memory around; and
> 
> Are we talking about a per-VM control, with one or more of those
sub-systems colluding concurrently? Or are we talking about a global view, and
how chunks of host memory get sub-allocated? Hopefully the latter...
> 
> >> - resolve races sensibly to avoid small VMs deferring large ones.
> >> (AIUI, xl already has some logic to handle the case of
balloon-to-fit.)
> >> 
> >> The second of those three is the interesting one.  It seems to me
that
> >> if the tools can''t force all other actors to give up
memory (and not
> >> immediately take it back) then they can''t guarantee to be
able to start
> >> a new VM, even with the new reservation hypercalls.
> > 
> > There was a bit of discussion in the spring about this sort of thing
> > (well, three of the four), which seems to have fallen a bit by the
> > wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g.
> > http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html
> > 
> > I''m sure there was earlier discussion which led to that, but
I can''t
> > seem to see it in the archives right now, perhaps I''m not
looking for
> > the right Subject.
> 
> IIRC, we had a bit of that conversation during the Santa Clara
> hackathon. The idea was to devise a scheme so that libxl can be told
> who the "actor" will be for memory management, and then hand-off
> appropriately. Add xl bindings, suitable defaults, and an
> implementation of the "balloon actor" by libxl, and the end
result is
> the ability to start domains with a memory target suitably managed by
> balloon, xenpaging, tmem, foo, according to the user''s wish. With
no
> need to know obscure knobs. To the extent that might be possible.
That''s right, I''d forgotten about that conversation. Yet some
how the
mail I referenced seems to be a result of that conversation -- which is
a nice coincidence ;-)

Andres Lagar-Cavilla

2012-Oct-04 13:33 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote:
> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote:
>>> AIUI xapi uses the domains'' maximum allocations, centrally
controlled,
>>> to place an upper bound on the amount of guest memory that can be
in
>>> use.  Within those limits there can be ballooning activity.  But
TBH I
>>> don''t know the details.
>> 
>> Yes, that''s the same as saying there is no memory-overcommit.
> 
> I''d say there is - but it''s all done by ballooning, and
it''s centrally
> enforced by lowering each domain''s maxmem to its balloon target,
so a
> badly behaved guest can''t balloon up and confuse things. 
> 
>> The original problem occurs only if there are multiple threads
>> of execution that can be simultaneously asking the hypervisor
>> to allocate memory without the knowledge of a single centralized
>> "controller".
> 
> Absolutely.
> 
>> Tmem argues that doing "memory capacity transfers" at a page
granularity
>> can only be done efficiently in the hypervisor.  This is true for
>> page-sharing when it breaks a "share" also... it
can''t go ask the
>> toolstack to approve allocation of a new page every time a write to a
shared
>> page occurs.
>> 
>> Does that make sense?
> 
> Yes.  The page-sharing version can be handled by having a pool of
> dedicated memory for breaking shares, and the toolstack asynchronously
> replenish that, rather than allowing CoW to use up all memory in the
> system.
That is doable. One benefit is that it would minimize the chance of a VM hitting
a CoW ENOMEM. I don''t see how it would altogether avoid it.

If the objective is trying to put a cap to the unpredictable growth of memory
allocations via CoW unsharing, two observations: (1) will never grow past
nominal VM footprint (2) One can put a cap today by tweaking d->max_pages --
CoW will fail, faulting vcpu will sleep, and things can be kicked back into
action at a later point.
> 
>> (rough proposed design re-attached below)
> 
> Thanks for that.  It describes a sensible-looking hypervisor interface,
> but my question was really: what should xl do, in the presence of
> ballooning, sharing, paging and tmem, to
> - decide whether a VM can be started at all;
> - control those four systems to shuffle memory around; and
> - resolve races sensibly to avoid small VMs deferring large ones.
> (AIUI, xl already has some logic to handle the case of balloon-to-fit.)
> 
> The second of those three is the interesting one.  It seems to me that
> if the tools can''t force all other actors to give up memory (and
not
> immediately take it back) then they can''t guarantee to be able to
start
> a new VM, even with the new reservation hypercalls.
> 
> Cheers,
> 
> Tim.
> 
>>> From: Dan Magenheimer
>>> Sent: Monday, October 01, 2012 2:04 PM
>>>   :
>>>   :
>>> Back to design brainstorming:
>>> 
>>> The way I am thinking about it, the tools need to be involved
>>> to the extent that they would need to communicate to the
>>> hypervisor the following facts (probably via new hypercall):
>>> 
>>> X1) I am launching a domain X and it is eventually going to
>>>   consume up to a maximum of N MB.  Please tell me if
>>>   there is sufficient RAM available AND, if so, reserve
>>>   it until I tell you I am done. ("AND" implies
transactional
>>>   semantics)
X1 does not need hypervisor support. We already coexist with a global daemon
that is a single point of failure. I''m not arguing for xenstore to hold
onto these reservations, but a daemon can. Xapi does it that way.

Andres
>>> X2) The launch of X is complete and I will not be requesting
>>>   the allocation of any more RAM for it.  Please release
>>>   the reservation, whether or not I''ve requested a total
>>>   of N MB.
>>> 
>>> The calls may be nested or partially ordered, i.e.
>>>   X1...Y1...Y2...X2
>>>   X1...Y1...X2...Y2
>>> and the hypervisor must be able to deal with this.
>>> 
>>> Then there would need to be two "versions" of "xm/xl
free".
>>> We can quibble about which should be the default, but
>>> they would be:
>>> 
>>> - "xl --reserved free" asks the hypervisor how much RAM
>>>   is available taking into account reservations
>>> - "xm --raw free" asks the hypervisor for the
instantaneous
>>>   amount of RAM unallocated, not counting reservations
>>> 
>>> When the tools are not launching a domain (that is there
>>> has been a matching X2 for all X1), the results of the
>>> above "free" queries are always identical.
>>> 
>>> So, IanJ, does this match up with the design you were thinking
>>> about?
>>> 
>>> Thanks,
>>> Dan
>>> 
>>> [1] I think the core culprits are (a) the hypervisor accounts for
>>> memory allocation of pages strictly on a first-come-first-served
>>> basis and (b) the tools don''t have any form of
need-this-much-memory
>>> "transaction" model

Dan Magenheimer

2012-Oct-04 16:36 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, October 04, 2012 4:07 AM
> To: Dan Magenheimer
> Cc: Olaf Hering; Keir Fraser; Konrad Wilk; George Dunlap; Kurt Hackel; Ian
Jackson; xen-
> devel@lists.xen.org; George Shuklin; Dario Faggioli; Andres Lagar-Cavilla
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
Hi Tim --

Good discussion!
> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote:
> > > AIUI xapi uses the domains'' maximum allocations,
centrally controlled,
> > > to place an upper bound on the amount of guest memory that can be
in
> > > use.  Within those limits there can be ballooning activity.  But
TBH I
> > > don''t know the details.
> >
> > Yes, that''s the same as saying there is no memory-overcommit.
> 
> I''d say there is - but it''s all done by ballooning, and
it''s centrally
> enforced by lowering each domain''s maxmem to its balloon target,
so a
> badly behaved guest can''t balloon up and confuse things.
While I agree this conceivably is a form of memory overcommit,
I discarded it as a workable overcommit solution in 2008.  The
short reason is: EVERY guest is badly behaved in that they all
want to suck up as much memory as possible and they all need it
_now_.  This observation actually is what led to tmem.
 > > The original problem occurs only if there are multiple threads
> > of execution that can be simultaneously asking the hypervisor
> > to allocate memory without the knowledge of a single centralized
> > "controller".
> 
> Absolutely.
> 
> > Tmem argues that doing "memory capacity transfers" at a page
granularity
> > can only be done efficiently in the hypervisor.  This is true for
> > page-sharing when it breaks a "share" also... it
can''t go ask the
> > toolstack to approve allocation of a new page every time a write to a
shared
> > page occurs.
> >
> > Does that make sense?
> 
> Yes.  The page-sharing version can be handled by having a pool of
> dedicated memory for breaking shares, and the toolstack asynchronously
> replenish that, rather than allowing CoW to use up all memory in the
> system.
This is really just overcommit-by-undercommit.  IMHO, any attempt
to set aside a chunk of memory for a specific purpose just increases
memory pressure on all the other memory users.  Nobody has any clue
a priori what the size of that dedicated memory pool should be;
if it is too big, you are simply wasting memory and if it is too
small, you haven''t solved the real problem.  Workloads vary too
dramatically, instantaneously, and unpredictably across time in their
need for memory.  Sharing makes it even more complex.
> > (rough proposed design re-attached below)
> 
> Thanks for that.  It describes a sensible-looking hypervisor interface,
> but my question was really: what should xl do, in the presence of
> ballooning, sharing, paging and tmem, to
>  - decide whether a VM can be started at all;
>  - control those four systems to shuffle memory around; and
>  - resolve races sensibly to avoid small VMs deferring large ones.
> (AIUI, xl already has some logic to handle the case of balloon-to-fit.)
> 
> The second of those three is the interesting one.  It seems to me that
> if the tools can''t force all other actors to give up memory (and
not
> immediately take it back) then they can''t guarantee to be able to
start
> a new VM, even with the new reservation hypercalls.
I agree the second one is interesting but the only real solution
is for the controller to be an oracle for all the guests.  That
makes it less interesting to me, so balloon-to-fit is less
interesting to me (even if it is the only overcommit option for
legacy guests).  IMHO, the problem is the same as for guest OS''s
that compute pi in the kernel when there are no runnable tasks,
i.e. a virtualization environment is sometimes forced to partition
resources, not virtualize those guests. IOW, don''t overcommit
"unenlightened" legacy guests. [1]

So I don''t think the design I wrote up solves the second one,
nor do I think it makes it any worse.

The design I wrote up is intended to solve the first and third.
I _think_ the reservation-transaction model described (X1 and X2)
should work for libxl, in the presence of ballooning, sharing,
paging, and tmem.  And it neither helps nor hurts balloon-to-fit.

Given that, can you shoot holes in the design?  Or are there
parts that aren''t clear?  Or (admitting that I am a libxl idiot)
is it unworkable for xl/libxl?

Thanks,
Dan

[1] By "unenlightened" here, I mean guests that are still
under the notion that they "own" all of a fixed amount of RAM.
A balloon driver makes them "semi-enlightened" :-)
> > > From: Dan Magenheimer
> > > Sent: Monday, October 01, 2012 2:04 PM
> > >    :
> > >    :
> > > Back to design brainstorming:
> > >
> > > The way I am thinking about it, the tools need to be involved
> > > to the extent that they would need to communicate to the
> > > hypervisor the following facts (probably via new hypercall):
> > >
> > > X1) I am launching a domain X and it is eventually going to
> > >    consume up to a maximum of N MB.  Please tell me if
> > >    there is sufficient RAM available AND, if so, reserve
> > >    it until I tell you I am done. ("AND" implies
transactional
> > >    semantics)
> > > X2) The launch of X is complete and I will not be requesting
> > >    the allocation of any more RAM for it.  Please release
> > >    the reservation, whether or not I''ve requested a
total
> > >    of N MB.
> > >
> > > The calls may be nested or partially ordered, i.e.
> > >    X1...Y1...Y2...X2
> > >    X1...Y1...X2...Y2
> > > and the hypervisor must be able to deal with this.
> > >
> > > Then there would need to be two "versions" of
"xm/xl free".
> > > We can quibble about which should be the default, but
> > > they would be:
> > >
> > > - "xl --reserved free" asks the hypervisor how much RAM
> > >    is available taking into account reservations
> > > - "xm --raw free" asks the hypervisor for the
instantaneous
> > >    amount of RAM unallocated, not counting reservations
> > >
> > > When the tools are not launching a domain (that is there
> > > has been a matching X2 for all X1), the results of the
> > > above "free" queries are always identical.
> > >
> > > So, IanJ, does this match up with the design you were thinking
> > > about?
> > >
> > > Thanks,
> > > Dan
> > >
> > > [1] I think the core culprits are (a) the hypervisor accounts for
> > > memory allocation of pages strictly on a first-come-first-served
> > > basis and (b) the tools don''t have any form of
need-this-much-memory
> > > "transaction" model

Dan Magenheimer

2012-Oct-04 16:54 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> 
> On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote:
> 
> > On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote:
> >> but my question was really: what should xl do, in the presence of
> >> ballooning, sharing, paging and tmem, to
> >> - decide whether a VM can be started at all;
> >> - control those four systems to shuffle memory around; and
> 
> Are we talking about a per-VM control, with one or more of those
sub-systems colluding concurrently?
> Or are we talking about a global view, and how chunks of host memory get
sub-allocated? Hopefully the
> latter...
> 
> >> - resolve races sensibly to avoid small VMs deferring large ones.
> >> (AIUI, xl already has some logic to handle the case of
balloon-to-fit.)
> >>
> >> The second of those three is the interesting one.  It seems to me
that
> >> if the tools can''t force all other actors to give up
memory (and not
> >> immediately take it back) then they can''t guarantee to be
able to start
> >> a new VM, even with the new reservation hypercalls.
> >
> > There was a bit of discussion in the spring about this sort of thing
> > (well, three of the four), which seems to have fallen a bit by the
> > wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g.
> > http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html
> >
> > I''m sure there was earlier discussion which led to that, but
I can''t
> > seem to see it in the archives right now, perhaps I''m not
looking for
> > the right Subject.
> 
> IIRC, we had a bit of that conversation during the Santa Clara hackathon.
The idea was to devise a
> scheme so that libxl can be told who the "actor" will be for
memory management, and then hand-off
> appropriately. Add xl bindings, suitable defaults, and an implementation of
the "balloon actor" by
Scanning through the archived message I am under the impression
that the focus is on a single server... i.e. "punt if actor is
not xl", i.e. it addressed "balloon-to-fit" and only tries to
avoid
stepping on other memory overcommit technologies.  That makes it
almost orthogonal, I think, to the problem I originally raised.

But a bigger concern is that its focus on a single machine ignores
the "cloud", where Xen seems to hold an advantage.  In the cloud,
the actor is "controlling" _many_ machines.  In the problem I
originally raised, this actor (a centralized management console)
is simply looking for a server that has sufficient memory to house
a new domain, and it (or the automation/sysadmin running it) gets
unhappy if (xl running on) the server says "yes there is enough
memory" but then later says, "oops, I guess there wasn''t
enough
after all".
> libxl, and the end result is the ability to start domains with a memory
target suitably managed by
> balloon, xenpaging, tmem, foo, according to the user''s wish. With
no need to know obscure knobs. To
> the extent that might be possible.
Am I detecting s[k|c]epticism?

If so, I too am s[k|c]eptical.

Dan

Dan Magenheimer

2012-Oct-04 16:59 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote:
> 
> > At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote:
> >> Tmem argues that doing "memory capacity transfers" at a
page granularity
> >> can only be done efficiently in the hypervisor.  This is true for
> >> page-sharing when it breaks a "share" also... it
can''t go ask the
> >> toolstack to approve allocation of a new page every time a write
to a shared
> >> page occurs.
> >>
> >> Does that make sense?
> >
> > Yes.  The page-sharing version can be handled by having a pool of
> > dedicated memory for breaking shares, and the toolstack asynchronously
> > replenish that, rather than allowing CoW to use up all memory in the
> > system.
> 
> That is doable. One benefit is that it would minimize the chance of a VM
hitting a CoW ENOMEM. I don''t
> see how it would altogether avoid it.
Agreed, so it doesn''t really solve the problem.  (See longer reply
to Tim.)
 > If the objective is trying to put a cap to the unpredictable growth of
memory allocations via CoW
> unsharing, two observations: (1) will never grow past nominal VM footprint
(2) One can put a cap today
> by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep, and
things can be kicked back
> into action at a later point.
But IIRC isn''t it (2) that has given VMware memory overcommit a bad
name?
Any significant memory pressure due to overcommit leads to double-swapping,
which leads to horrible performance?

Andres Lagar-Cavilla

2012-Oct-04 17:00 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 4, 2012, at 12:54 PM, Dan Magenheimer wrote:
>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
>> 
>> 
>> On Oct 4, 2012, at 6:17 AM, Ian Campbell wrote:
>> 
>>> On Thu, 2012-10-04 at 11:06 +0100, Tim Deegan wrote:
>>>> but my question was really: what should xl do, in the presence
of
>>>> ballooning, sharing, paging and tmem, to
>>>> - decide whether a VM can be started at all;
>>>> - control those four systems to shuffle memory around; and
>> 
>> Are we talking about a per-VM control, with one or more of those
sub-systems colluding concurrently?
>> Or are we talking about a global view, and how chunks of host memory
get sub-allocated? Hopefully the
>> latter...
>> 
>>>> - resolve races sensibly to avoid small VMs deferring large
ones.
>>>> (AIUI, xl already has some logic to handle the case of
balloon-to-fit.)
>>>> 
>>>> The second of those three is the interesting one.  It seems to
me that
>>>> if the tools can''t force all other actors to give up
memory (and not
>>>> immediately take it back) then they can''t guarantee to
be able to start
>>>> a new VM, even with the new reservation hypercalls.
>>> 
>>> There was a bit of discussion in the spring about this sort of
thing
>>> (well, three of the four), which seems to have fallen a bit by the
>>> wayside^W^W^W^W^W^Wbeen deferred until 4.3 (ahem) e.g.
>>> http://lists.xen.org/archives/html/xen-devel/2012-03/msg01181.html
>>> 
>>> I''m sure there was earlier discussion which led to that,
but I can''t
>>> seem to see it in the archives right now, perhaps I''m not
looking for
>>> the right Subject.
>> 
>> IIRC, we had a bit of that conversation during the Santa Clara
hackathon. The idea was to devise a
>> scheme so that libxl can be told who the "actor" will be for
memory management, and then hand-off
>> appropriately. Add xl bindings, suitable defaults, and an
implementation of the "balloon actor" by
> 
> Scanning through the archived message I am under the impression
> that the focus is on a single server... i.e. "punt if actor is
> not xl", i.e. it addressed "balloon-to-fit" and only tries
to avoid
> stepping on other memory overcommit technologies.  That makes it
> almost orthogonal, I think, to the problem I originally raised.
Yeah, fairly orthogonal.> 
> But a bigger concern is that its focus on a single machine ignores
> the "cloud", where Xen seems to hold an advantage.  In the cloud,
> the actor is "controlling" _many_ machines.  In the problem I
> originally raised, this actor (a centralized management console)
> is simply looking for a server that has sufficient memory to house
> a new domain, and it (or the automation/sysadmin running it) gets
> unhappy if (xl running on) the server says "yes there is enough
> memory" but then later says, "oops, I guess there wasn''t
enough
> after all".
Big problem in itself, but not one for xen.org (yet, cart before horse). Have
you had a look at the Openstack FilterScheduler? Plenty of room for
contribution.
> 
>> libxl, and the end result is the ability to start domains with a memory
target suitably managed by
>> balloon, xenpaging, tmem, foo, according to the user''s wish.
With no need to know obscure knobs. To
>> the extent that might be possible.
> 
> Am I detecting s[k|c]epticism?
> 
> If so, I too am s[k|c]eptical.
Well, not really. Things have to coexist cleanly, to the extent feasible.
Devising a libxl protocol to perform clean hand off if required, and to expose
minimum complexity to the average joe, is a great idea imho.

Andres> 
> Dan

Andres Lagar-Cavilla

2012-Oct-04 17:08 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote:
>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
>> 
>> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote:
>> 
>>> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote:
>>>> Tmem argues that doing "memory capacity transfers" at
a page granularity
>>>> can only be done efficiently in the hypervisor.  This is true
for
>>>> page-sharing when it breaks a "share" also... it
can''t go ask the
>>>> toolstack to approve allocation of a new page every time a
write to a shared
>>>> page occurs.
>>>> 
>>>> Does that make sense?
>>> 
>>> Yes.  The page-sharing version can be handled by having a pool of
>>> dedicated memory for breaking shares, and the toolstack
asynchronously
>>> replenish that, rather than allowing CoW to use up all memory in
the
>>> system.
>> 
>> That is doable. One benefit is that it would minimize the chance of a
VM hitting a CoW ENOMEM. I don''t
>> see how it would altogether avoid it.
> 
> Agreed, so it doesn''t really solve the problem.  (See longer reply
> to Tim.)
> 
>> If the objective is trying to put a cap to the unpredictable growth of
memory allocations via CoW
>> unsharing, two observations: (1) will never grow past nominal VM
footprint (2) One can put a cap today
>> by tweaking d->max_pages -- CoW will fail, faulting vcpu will sleep,
and things can be kicked back
>> into action at a later point.
> 
> But IIRC isn''t it (2) that has given VMware memory overcommit a
bad name?
> Any significant memory pressure due to overcommit leads to double-swapping,
> which leads to horrible performance?
The little that I''ve been able to read from their published results is
that a "lot" of CPU is consumed scanning memory and fingerprinting,
which leads to a massive assault on micro-architectural caches.

I don''t know if that equates to a "bad name", but I
don''t think that is a productive discussion either.

(2) doesn''t mean swapping. Note that d->max_pages can be set
artificially low by an admin, raised again. etc. It''s just a mechanism
to keep a VM at bay while corrective measures of any kind are taken.
It''s really up to a higher level controller whether you accept
allocations and later reach a point of thrashing.

I understand this is partly where your discussion is headed, but certainly
fixing the primary issue of nominal vanilla allocations preempting each other
looks fairly critical to begin with.

Andres

Dan Magenheimer

2012-Oct-04 17:18 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> 
> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote:
> 
> >> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> >> Subject: Re: [Xen-devel] domain creation vs querying free memory
(xend and xl)
> >>
> >> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote:
> >>
> >>> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer wrote:
> >>>> Tmem argues that doing "memory capacity
transfers" at a page granularity
> >>>> can only be done efficiently in the hypervisor.  This is
true for
> >>>> page-sharing when it breaks a "share" also... it
can''t go ask the
> >>>> toolstack to approve allocation of a new page every time a
write to a shared
> >>>> page occurs.
> >>>>
> >>>> Does that make sense?
> >>>
> >>> Yes.  The page-sharing version can be handled by having a pool
of
> >>> dedicated memory for breaking shares, and the toolstack
asynchronously
> >>> replenish that, rather than allowing CoW to use up all memory
in the
> >>> system.
> >>
> >> That is doable. One benefit is that it would minimize the chance
of a VM hitting a CoW ENOMEM. I
> don''t
> >> see how it would altogether avoid it.
> >
> > Agreed, so it doesn''t really solve the problem.  (See longer
reply
> > to Tim.)
> >
> >> If the objective is trying to put a cap to the unpredictable
growth of memory allocations via CoW
> >> unsharing, two observations: (1) will never grow past nominal VM
footprint (2) One can put a cap
> today
> >> by tweaking d->max_pages -- CoW will fail, faulting vcpu will
sleep, and things can be kicked back
> >> into action at a later point.
> >
> > But IIRC isn''t it (2) that has given VMware memory overcommit
a bad name?
> > Any significant memory pressure due to overcommit leads to
double-swapping,
> > which leads to horrible performance?
> 
> The little that I''ve been able to read from their published
results is that a "lot" of CPU is consumed
> scanning memory and fingerprinting, which leads to a massive assault on
micro-architectural caches.
> 
> I don''t know if that equates to a "bad name", but I
don''t think that is a productive discussion
> either.
Sorry, I wasn''t intending that to be snarky, but on re-read I guess it
did sound snarky.  What I meant is: Is this just a manual version of what
VMware does automatically? Or is there something I am misunderstanding?
(I think you answered that below.)
> (2) doesn''t mean swapping. Note that d->max_pages can be set
artificially low by an admin, raised
> again. etc. It''s just a mechanism to keep a VM at bay while
corrective measures of any kind are taken.
> It''s really up to a higher level controller whether you accept
allocations and later reach a point of
> thrashing.
> 
> I understand this is partly where your discussion is headed, but certainly
fixing the primary issue of
> nominal vanilla allocations preempting each other looks fairly critical to
begin with.
OK.  I _think_ the design I proposed helps in systems that are using
page-sharing/host-swapping as well... I assume share-breaking just
calls the normal hypervisor allocator interface to allocate a
new page (if available)?  If you could review and comment on
the design from a page-sharing/host-swapping perspective, I would
appreciate it.

Thanks,
Dan

Andres Lagar-Cavilla

2012-Oct-04 17:30 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 4, 2012, at 1:18 PM, Dan Magenheimer wrote:
>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
>> 
>> 
>> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote:
>> 
>>>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>>>> Subject: Re: [Xen-devel] domain creation vs querying free
memory (xend and xl)
>>>> 
>>>> On Oct 4, 2012, at 6:06 AM, Tim Deegan wrote:
>>>> 
>>>>> At 14:56 -0700 on 02 Oct (1349189817), Dan Magenheimer
wrote:
>>>>>> Tmem argues that doing "memory capacity
transfers" at a page granularity
>>>>>> can only be done efficiently in the hypervisor.  This
is true for
>>>>>> page-sharing when it breaks a "share" also...
it can''t go ask the
>>>>>> toolstack to approve allocation of a new page every
time a write to a shared
>>>>>> page occurs.
>>>>>> 
>>>>>> Does that make sense?
>>>>> 
>>>>> Yes.  The page-sharing version can be handled by having a
pool of
>>>>> dedicated memory for breaking shares, and the toolstack
asynchronously
>>>>> replenish that, rather than allowing CoW to use up all
memory in the
>>>>> system.
>>>> 
>>>> That is doable. One benefit is that it would minimize the
chance of a VM hitting a CoW ENOMEM. I
>> don''t
>>>> see how it would altogether avoid it.
>>> 
>>> Agreed, so it doesn''t really solve the problem.  (See
longer reply
>>> to Tim.)
>>> 
>>>> If the objective is trying to put a cap to the unpredictable
growth of memory allocations via CoW
>>>> unsharing, two observations: (1) will never grow past nominal
VM footprint (2) One can put a cap
>> today
>>>> by tweaking d->max_pages -- CoW will fail, faulting vcpu
will sleep, and things can be kicked back
>>>> into action at a later point.
>>> 
>>> But IIRC isn''t it (2) that has given VMware memory
overcommit a bad name?
>>> Any significant memory pressure due to overcommit leads to
double-swapping,
>>> which leads to horrible performance?
>> 
>> The little that I''ve been able to read from their published
results is that a "lot" of CPU is consumed
>> scanning memory and fingerprinting, which leads to a massive assault on
micro-architectural caches.
>> 
>> I don''t know if that equates to a "bad name", but I
don''t think that is a productive discussion
>> either.
> 
> Sorry, I wasn''t intending that to be snarky, but on re-read I
guess it
> did sound snarky.  What I meant is: Is this just a manual version of what
> VMware does automatically? Or is there something I am misunderstanding?
> (I think you answered that below.)
> 
>> (2) doesn''t mean swapping. Note that d->max_pages can be
set artificially low by an admin, raised
>> again. etc. It''s just a mechanism to keep a VM at bay while
corrective measures of any kind are taken.
>> It''s really up to a higher level controller whether you accept
allocations and later reach a point of
>> thrashing.
>> 
>> I understand this is partly where your discussion is headed, but
certainly fixing the primary issue of
>> nominal vanilla allocations preempting each other looks fairly critical
to begin with.
> 
> OK.  I _think_ the design I proposed helps in systems that are using
> page-sharing/host-swapping as well... I assume share-breaking just
> calls the normal hypervisor allocator interface to allocate a
> new page (if available)?  If you could review and comment on
> the design from a page-sharing/host-swapping perspective, I would
> appreciate it.
I think you will need to refine your notion of reservation. If you have nominal
RAM N, and current RAM C, N >= C, it makes no sense to reserve N so the VM
later has room to occupy by swapping-in, unsharing or whatever -- then you are
not over-committing memory.

To the extent that you want to facilitate VM creation, it does make sense to
reserve C and guarantee that.

Then it gets mm-specific. PoD has one way of dealing with the allocation growth.
xenpaging tries to stick to the watermark -- if something swaps in something
else swaps out. And uncooperative balloons are be stymied by xapi using
d->max_pages.

This is why I believe you need to solve the problem of initial reservation, and
the problem of handing off to the right actor. And then xl need not care any
further.

Andres
> 
> Thanks,
> Dan

Dan Magenheimer

2012-Oct-04 17:55 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> 
> On Oct 4, 2012, at 1:18 PM, Dan Magenheimer wrote:
> 
> >> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> >> Subject: Re: [Xen-devel] domain creation vs querying free memory
(xend and xl)
> >>
> >>
> >> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote:
> >>
> > OK.  I _think_ the design I proposed helps in systems that are using
> > page-sharing/host-swapping as well... I assume share-breaking just
> > calls the normal hypervisor allocator interface to allocate a
> > new page (if available)?  If you could review and comment on
> > the design from a page-sharing/host-swapping perspective, I would
> > appreciate it.
> 
> I think you will need to refine your notion of reservation. If you have
nominal RAM N, and current RAM
> C, N >= C, it makes no sense to reserve N so the VM later has room to
occupy by swapping-in, unsharing
> or whatever -- then you are not over-committing memory.
> 
> To the extent that you want to facilitate VM creation, it does make sense
to reserve C and guarantee
> that.
> 
> Then it gets mm-specific. PoD has one way of dealing with the allocation
growth. xenpaging tries to
> stick to the watermark -- if something swaps in something else swaps out.
And uncooperative balloons
> are be stymied by xapi using d->max_pages.
> 
> This is why I believe you need to solve the problem of initial reservation,
and the problem of handing
> off to the right actor. And then xl need not care any further.
> 
> Andres
I think we may be saying the same thing, at least in the context
of the issue I am trying to solve (which, admittedly, may be
a smaller part of a bigger issue, and we should attempt to ensure
that the solution to the smaller part is at least a step in the
right direction for the bigger issue).  And I am trying to
solve the mechanism problem only, not the policy which, I agree is
mm-specific.

The core problem, as I see it, is that there are multiple consumers of
memory, some of which may be visible to xl and some of which are
not.  Ultimately, the hypervisor is asked to provide memory
and will return failure if it can''t, so the hypervisor is the
final arbiter.

When a domain is created, we''d like to ensure there is enough memory
for it to "not fail".  But when the toolstack asks for memory to
create a domain, it asks for it "piecemeal".  I''ll assume
that
the toolstack knows how much memory it needs to allocate to ensure
the launch doesn''t fail... my solution is that it asks for that
entire amount of memory at once as a "reservation".  If the
hypervisor has that much memory available, it returns success and
must behave as if the memory has been already allocated.  Then,
later, when the toolstack is happy that the domain did successfully
launch, it says "remember that reservation? any memory reserved
that has not yet been allocated, need no longer be reserved, you
can unreserve it"

In other words, between reservation and unreserve, there is no
memory overcommit for that domain.  Once the toolstack does
the unreserve, its memory is available for overcommit mechanisms.

Not sure if that part was clear: it''s my intent that unreserve occur
soon after the domain is launched, _not_, for example, when the domain
is shut down.  What I don''t know is if there is a suitable point
in the launch when the toolstack knows it can do the "release"...
that may be the sticking point and may be mm-specific.

Thanks,
Dan

Olaf Hering

2012-Oct-04 18:26 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Mon, Oct 01, Dan Magenheimer wrote:
> > From: Ian Jackson [mailto:Ian.Jackson@eu.citrix.com]
> > Sent: Friday, September 28, 2012 11:12 AM
> > To: Dan Magenheimer
> > Cc: xen-devel@lists.xen.org; Kurt Hackel; Konrad Wilk
> > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
> > 
> > Dan Magenheimer writes ("[Xen-devel] domain creation vs querying
free memory (xend and xl)"):
> > > But the second domain launch fails, possibly after
> > > several minutes because, actually, there isn''t enough
> > > physical RAM for both.
> > 
> > This is a real problem.  The solution is not easy, and may not make it
> > for 4.3.  It would involve a rework of the memory handling code in
> > libxl.
> 
> [broadening cc to "Xen memory technology people", please
forward/add
> if I missed someone]
Dan,

I''m sure there has been already alot of thought and discussion about
this issue, So here are my thoughts:

In my opinion the code which is about to start a domain has to take all
currently created/starting/running/dying domains, and their individual
"allocation behaviour", into account before it can finally launch the
domain. All of this needs math, not locking.

A domain (domU or dom0) has a couple of constraints:

- current nr_pages vs. target_nr_pages vs. max_pages
- current PoD allocation vs. max_PoD
- current paged_pages vs. target_nr_pages vs. max_paged_pages
- some shared_pages
- some tmem
- maybe grant_pages
- ...

Depending on the state (starting and working towards a target number,
running, dying) the "current" numbers above will increase or shrink.
So
the algorithm which turns the parameters above for each domain into a
total number of allocated (or soon to be allocated) host memory has to
work with "target numbers" instead of what is currently allocated.

Some examples that come to mind:
- a PoD domain will most likely use all of the pages configured with
  memory=, so that number should be used
- the number shared pages is eventually not predictable. If so, this
  number could be handled as "allocated to the guest". Maybe a knob to
  say "running domains will have amount N shared" can exist? Dont know
  much about how sharing looks in practice.
- ballooning may not reach the configured target, and the guest admin
  can just balloon up to the limit without notifying the toolstack
- a new paging target will take some time until its reached, there is
  always some jitter during page-in/page-out, mapping guest pages will
  cause nomination failures.
- tmem does something, I dont know.
- no idea if grant pages are needed in the math

Since the central management of xend is gone each libxl process is
likely on its own, so two "xl create" can race when doing the math.
Maybe a libxl process dies and leaves a mess behind. So that could make
it difficult to get a good snapshot of the memory situation on the host.
Maybe each domain could get some metadata to record the individual
current/target/max numbers. Or if xenstore is good enough, something can
cleanup zombie numbers.

As IanJ said, the memory handling code in libxl needs such a feature to
do the math right. The proposed handling of
sharing/paging/ballooning/PoD/tmem/... in libxl is just a small part of
it.

Olaf

Dan Magenheimer

2012-Oct-04 19:38 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Olaf Hering [mailto:olaf@aepfle.de]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> On Mon, Oct 01, Dan Magenheimer wrote:
> 
Hi Olaf --

Thanks for the reply.
> domain. All of this needs math, not locking.
>  :
> As IanJ said, the memory handling code in libxl needs such a feature to
> do the math right. The proposed handling of
> sharing/paging/ballooning/PoD/tmem/... in libxl is just a small part of
> it.
Unfortunately, as you observe in some of the cases earlier in your reply,
it is more than a math problem for libxl... it is a crystal ball problem.
If xl launches a domain D at time T and it takes N seconds before it has
completed asking the hypervisor for all of the memory M that D will require
to successfully launch, then xl must determine at time T the maximum memory
allocated across all running domains for the future time period between
T and T+N.  In other words, xl must predict the future.

Clearly this is impossible especially when page-sharing is not communicating
its dynamic allocations (e.g. due to page-splitting) to libxl, and tmem
is not communicating allocations resulting from multiple domains
simultaenously making tmem hypercalls to libxl, and PoD is not communicating
its allocations to libxl, and in-guest-kernel selfballooning is not
communicating
allocations to libxl.  Only the hypervisor is aware of every dynamic allocation
request.

So all libxl can do is guess about the future because races are
going to occur.  Multiple threads are simultaneously trying to
access a limited resource (pages of memory) and only the hypervisor
knows whether there is enough to deliver memory for all requests.

To me, the solution to racing for a shared resource is locking.
Naturally, you want the critical path to be as short as possible.
And you don''t want to lock all instances of the resource (i.e.
every page in memory) if you can avoid it.  And you need to
ensure that the lock is honored for all requests to allocate
the shared resource, meaning in this case that it has to
be done in the hypervisor.

I think that''s what the proposed design does:  It provides a
mechanism to ask the hypervisor to reserve a fixed amount of
memory M, some or all of which will eventually turn into
an allocation request; and a mechanism to ask the hypervisor
to no longer honor that reservation ("unreserve") whether or
not all of M has been allocated.  It essentially locks that
M amount of memory between reserve and unreserve so that other
dynamic allocations (page-sharing, tmem, PoD, OR another libxl
thread trying to create another domain) cannot sneak in and
claim memory capacity that has been reserved.

Does that make sense?

Thanks,
Dan

Olaf Hering

2012-Oct-04 20:18 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Thu, Oct 04, Dan Magenheimer wrote:
> > From: Olaf Hering [mailto:olaf@aepfle.de]
> > Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
> > 
> > On Mon, Oct 01, Dan Magenheimer wrote:
> > 
> 
> Hi Olaf --
> 
> Thanks for the reply.
> 
> > domain. All of this needs math, not locking.
> >  :
> > As IanJ said, the memory handling code in libxl needs such a feature
to
> > do the math right. The proposed handling of
> > sharing/paging/ballooning/PoD/tmem/... in libxl is just a small part
of
> > it.
> 
> Unfortunately, as you observe in some of the cases earlier in your reply,
> it is more than a math problem for libxl... it is a crystal ball problem.
> If xl launches a domain D at time T and it takes N seconds before it has
> completed asking the hypervisor for all of the memory M that D will require
> to successfully launch, then xl must determine at time T the maximum memory
> allocated across all running domains for the future time period between
> T and T+N.  In other words, xl must predict the future.
I think xl can predict it, if it takes the target of all domains into
account.  Certainly not down to a handful pages, it would be good enough
to know if the calculated estimate of free memory is good for the new
guest and its specific memory targets.
> Clearly this is impossible especially when page-sharing is not
communicating
> its dynamic allocations (e.g. due to page-splitting) to libxl, and tmem
> is not communicating allocations resulting from multiple domains
> simultaenously making tmem hypercalls to libxl, and PoD is not
communicating
> its allocations to libxl, and in-guest-kernel selfballooning is not
communicating
> allocations to libxl.  Only the hypervisor is aware of every dynamic
allocation
> request.
The hypervisor can not predict the future either, and it has even less
info about the individual targets of each domain.
> Does that make sense?
It does, but:
If xl reserves the memory in its own "virtual allocator", or if Xen
gets
such functionality, does not really matter, as long as its known how much
exactly needs to be allocated. I think that part is missing.

Olaf

Dan Magenheimer

2012-Oct-04 20:35 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Olaf Hering [mailto:olaf@aepfle.de]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> On Thu, Oct 04, Dan Magenheimer wrote:
> 
> > > From: Olaf Hering [mailto:olaf@aepfle.de]
> > > Subject: Re: [Xen-devel] domain creation vs querying free memory
(xend and xl)
> > >
> > > On Mon, Oct 01, Dan Magenheimer wrote:
> > >
> >
> > Hi Olaf --
> >
> > Thanks for the reply.
> >
> > > domain. All of this needs math, not locking.
> > >  :
> > > As IanJ said, the memory handling code in libxl needs such a
feature to
> > > do the math right. The proposed handling of
> > > sharing/paging/ballooning/PoD/tmem/... in libxl is just a small
part of
> > > it.
> >
> > Unfortunately, as you observe in some of the cases earlier in your
reply,
> > it is more than a math problem for libxl... it is a crystal ball
problem.
> > If xl launches a domain D at time T and it takes N seconds before it
has
> > completed asking the hypervisor for all of the memory M that D will
require
> > to successfully launch, then xl must determine at time T the maximum
memory
> > allocated across all running domains for the future time period
between
> > T and T+N.  In other words, xl must predict the future.
> 
> I think xl can predict it, if it takes the target of all domains into
> account.  Certainly not down to a handful pages, it would be good enough
> to know if the calculated estimate of free memory is good for the new
> guest and its specific memory targets.
Well I don''t know enough about the page-sharing implementation but
it''s not hard with tmem to synthesize a workload where the amount of
free memory is half of RAM at time T and there is no RAM left at all at
time T+(N/2) and three quarters of RAM is free at time T+N.  That would
be very hard for xl to predict.  I expect that dramatic changes like
this might be harder with page-sharing but not impossible.
 > > Clearly this is impossible especially when page-sharing is not
communicating
> > its dynamic allocations (e.g. due to page-splitting) to libxl, and
tmem
> > is not communicating allocations resulting from multiple domains
> > simultaenously making tmem hypercalls to libxl, and PoD is not
communicating
> > its allocations to libxl, and in-guest-kernel selfballooning is not
communicating
> > allocations to libxl.  Only the hypervisor is aware of every dynamic
allocation
> > request.
> 
> The hypervisor can not predict the future either, and it has even less
> info about the individual targets of each domain.
The point is the hypervisor doesn''t need to predict the future
and doesn''t need to know the individual targets.  It just
acts on allocation requests and, with the proposed design,
on reservation requests.
> > Does that make sense?
> 
> It does, but:
> If xl reserves the memory in its own "virtual allocator", or if
Xen gets
> such functionality, does not really matter, as long as its known how much
> exactly needs to be allocated. I think that part is missing.
I agree, though I think the only constraint is that the domain
must be capable of booting.  So if xl always requests a reservation
of "mem=", I would think that should always work.

Ian Campbell

2012-Oct-05 09:44 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Thu, 2012-10-04 at 17:54 +0100, Dan Magenheimer
wrote:> Scanning through the archived message I am under the impression
> that the focus is on a single server... i.e. "punt if actor is
> not xl", i.e. it addressed "balloon-to-fit" and only tries
to avoid
> stepping on other memory overcommit technologies.
xl is inherently a single system toolstack, and a simple ballooning
based actor would just be its default.

The design is not intended to require that a toolstack only provide a
single actor, or indeed that the actor is provided by the toolstack at
all.

It would be perfectly reasonable for xl to provide actors which work
well with tmem or paging or sharing or some complex combination and even
to select them by default when those technologies are enabled on the
host.

We also fully expect that other toolstacks will want to provide their
own actors which make use of the facilities of those toolstacks to do a
better job based on the additional stateetc (e.g. we expect xapi to want
to provide a squeezed based actor).

Lastly design is also intended to support "3rd party" actors which are
not part of any toolstack. e.g. actors which talk to your cloud
orchestration layer or with some central authority or which communicate
with other hosts etc is intended to be a possibility.
>   That makes it
> almost orthogonal, I think, to the problem I originally raised.
> 
> But a bigger concern is that its focus on a single machine ignores
> the "cloud", where Xen seems to hold an advantage.  In the cloud,
> the actor is "controlling" _many_ machines.  In the problem I
> originally raised, this actor (a centralized management console)
> is simply looking for a server that has sufficient memory to house
> a new domain, and it (or the automation/sysadmin running it) gets
> unhappy if (xl running on) the server says "yes there is enough
> memory" but then later says, "oops, I guess there wasn''t
enough
> after all".
Integrating some sort of "entry control" into the actor protocol seems
like a logical addition to me (assuming we didn''t already include it, I
didn''t go back and check), since the details of when to say yes or no
seem like they would very depend on the policies of that particular
actor and the technologies which it is using to implement them.

Ian.

George Dunlap

2012-Oct-05 11:40 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On 04/10/12 17:54, Dan Magenheimer wrote:>>
> Scanning through the archived message I am under the impression
> that the focus is on a single server... i.e. "punt if actor is
> not xl", i.e. it addressed "balloon-to-fit" and only tries
to avoid
> stepping on other memory overcommit technologies. That makes it
> almost orthogonal, I think, to the problem I originally raised.No, the idea was to allow the flexibility of different actors in
different situations. The plan was to start with a simple actor, but to
add new ones as necessary. But on reflection, it seems like the whole
"actor" thing was actually something completely separate to what
we''re
talking about here. The idea behind the actor (IIRC) was that you could
tell the toolstack, "Make VM A use X amount of host memory"; and the
actor would determine the best way to do that -- either by only
ballooning, or ballooning first and then swapping. But it doesn''t
decide how to get the value X.

This thread has been very hard to follow for some reason, so let me see
if I can understand everything:
* You are concerned about being able to predictably start VMs in the
face of:
- concurrent requests, and
- dynamic memory technologies (including PoD, ballooning, paging, page
sharing, and tmem)
Any of which may change the amount of free memory between the time a
decision is made and the time memory is actually allocated.
* You have proposed a hypervisor-based solution that allows the
toolstack to "reserve" a specific amount of memory to a VM that will
not
be used for something else; this allocation is transactional -- it will
either completely succeed, or completely fail, and do it quickly.

Is that correct?

The problem with that solution, it seems to me, is that the hypervisor
does not (and I think probably should not) have any insight into the
policy for allocating or freeing memory as a result of other activities,
such as ballooning or page sharing. Suppose someone were ballooning
down domain M to get 8GiB in order to start domain A; and at some point
, another process looks and says, "Oh look, there''s 4GiB free,
that''s
enough to start domain B" and asks Xen to reserve that memory. Xen has
no way of knowing that the memory freed by domain M was "earmarked"
for
domain A, and so will happily give it to domain B, causing domain A''s
creation to fail (potentially).

So it seems like we need to have the idea of a memory controller -- one
central process (per host, as you say) that would know about all of the
knobs -- ballooning, paging, page sharing, tmem, whatever -- that could
be in charge of knowing where all the memory was coming from and where
it was going. So if xl wanted to start a new VM, it can ask the memory
controller for 3GiB, and the controller could decide, "I''ll take
1GiB
from domain M and 2 from domain N, and give it to the new domain", and
respond when it has the memory that it needs. Similarly, it can know
that it should try to keep X megabytes for un-sharing of pages, and it
can be responsible for freeing up more memory if that memory becomes
exhausted.

At the moment, the administrator himself (or the cloud orchestration
layer) needs to be his own memory controller; that is, he needs to
manually decide if there''s enough free memory to start a VM; if
there''s
not, he needs to figure out how to get that memory (either by ballooning
or swapping). Ballooning and swapping are both totally under his
control; the only thing he doesn''t control is the unsharing of pages.
But as long as there was a way to tell the page sharing daemon not to
allocate an amount of free memory, then this
"administrator-as-memory-controller" should work just fine.

Does that make sense? Or am I still confused? :-)

-George

Andres Lagar-Cavilla

2012-Oct-05 14:25 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 4, 2012, at 1:55 PM, Dan Magenheimer wrote:
>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
>> 
>> 
>> On Oct 4, 2012, at 1:18 PM, Dan Magenheimer wrote:
>> 
>>>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>>>> Subject: Re: [Xen-devel] domain creation vs querying free
memory (xend and xl)
>>>> 
>>>> 
>>>> On Oct 4, 2012, at 12:59 PM, Dan Magenheimer wrote:
>>>> 
>>> OK.  I _think_ the design I proposed helps in systems that are
using
>>> page-sharing/host-swapping as well... I assume share-breaking just
>>> calls the normal hypervisor allocator interface to allocate a
>>> new page (if available)?  If you could review and comment on
>>> the design from a page-sharing/host-swapping perspective, I would
>>> appreciate it.
>> 
>> I think you will need to refine your notion of reservation. If you have
nominal RAM N, and current RAM
>> C, N >= C, it makes no sense to reserve N so the VM later has room
to occupy by swapping-in, unsharing
>> or whatever -- then you are not over-committing memory.
>> 
>> To the extent that you want to facilitate VM creation, it does make
sense to reserve C and guarantee
>> that.
>> 
>> Then it gets mm-specific. PoD has one way of dealing with the
allocation growth. xenpaging tries to
>> stick to the watermark -- if something swaps in something else swaps
out. And uncooperative balloons
>> are be stymied by xapi using d->max_pages.
>> 
>> This is why I believe you need to solve the problem of initial
reservation, and the problem of handing
>> off to the right actor. And then xl need not care any further.
>> 
>> Andres
> 
> I think we may be saying the same thing, at least in the context
> of the issue I am trying to solve (which, admittedly, may be
> a smaller part of a bigger issue, and we should attempt to ensure
> that the solution to the smaller part is at least a step in the
> right direction for the bigger issue).  And I am trying to
> solve the mechanism problem only, not the policy which, I agree is
> mm-specific.
> 
> The core problem, as I see it, is that there are multiple consumers of
> memory, some of which may be visible to xl and some of which are
> not.  Ultimately, the hypervisor is asked to provide memory
> and will return failure if it can''t, so the hypervisor is the
> final arbiter.
> 
> When a domain is created, we''d like to ensure there is enough
memory
> for it to "not fail".  But when the toolstack asks for memory to
> create a domain, it asks for it "piecemeal".  I''ll
assume that
> the toolstack knows how much memory it needs to allocate to ensure
> the launch doesn''t fail... my solution is that it asks for that
> entire amount of memory at once as a "reservation".  If the
> hypervisor has that much memory available, it returns success and
> must behave as if the memory has been already allocated.  Then,
> later, when the toolstack is happy that the domain did successfully
> launch, it says "remember that reservation? any memory reserved
> that has not yet been allocated, need no longer be reserved, you
> can unreserve it"
> 
> In other words, between reservation and unreserve, there is no
> memory overcommit for that domain.  Once the toolstack does
> the unreserve, its memory is available for overcommit mechanisms.
I think that will be fragile. Suppose you have a 16 GiB domain and an overcommit
mechanism that allows you to start the VM with 8 GiB. Straight-forward scenario
with xen-4.2 and a combination of PoD and ballooning. Suppose you have 14GiB of
RAM free in the system. Why should creation of that domain fail?

Andres
> 
> Not sure if that part was clear: it''s my intent that unreserve
occur
> soon after the domain is launched, _not_, for example, when the domain
> is shut down.  What I don''t know is if there is a suitable point
> in the launch when the toolstack knows it can do the "release"...
> that may be the sticking point and may be mm-specific.
> 
> Thanks,
> Dan

Dan Magenheimer

2012-Oct-07 23:43 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> > In other words, between reservation and unreserve, there is no
> > memory overcommit for that domain.  Once the toolstack does
> > the unreserve, its memory is available for overcommit mechanisms.
> 
> I think that will be fragile. Suppose you have a 16 GiB domain and an
overcommit mechanism that allows
> you to start the VM with 8 GiB. Straight-forward scenario with xen-4.2 and
a combination of PoD and
> ballooning. Suppose you have 14GiB of RAM free in the system. Why should
creation of that domain fail?
It shouldn''t.  Either I''m not clear or I don''t
understand PoD.

My understanding of PoD is that, for the above case, the
domain has "mem=8192 maxmem=16394".  So with my proposal
xl would ask for a reservation of 8192M and, when the domain
is successfully launched (i.e. for PoD, balloon driver is running?),
make the matching unreserve call. *

Not sure why that would be any more fragile than today.  In
fact it seems to me it is less fragile... changing your example
to "8GiB of RAM free in the system", today, xl will ask if there
is enough memory and will be told yes and attempt to launch the
domain.  But then suppose in between the time xl asks the hypervisor
if there is enough free memory and the time it attempts to launch
the domain, another domain eats up a few pages and now there is
ever so slightly less than 8GiB.  Won''t the domain creation
commence and then fail a few moments later?  (A few moments, 
probably not a big deal, but multiply the memory sizes by 64
and a few moments becomes a few minutes!)

With my proposal, the domain will immediately fail to launch
because the reservation will fail.
 
* Maybe the above "there is no memory overcommit for that domain"
was confusing?  I suppose you could call that "mem=8192
maxmem=16384" overcommit... I just didn''t think of it that
way.

Dan Magenheimer

2012-Oct-08 01:02 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: George Dunlap [mailto:george.dunlap@eu.citrix.com]
> Sent: Friday, October 05, 2012 5:40 AM
> To: Dan Magenheimer
> Cc: Andres Lagar-Cavilla; Ian Campbell; Tim (Xen.org); Olaf Hering; Keir
(Xen.org); Konrad Wilk; Kurt
> Hackel; Ian Jackson; xen-devel@lists.xen.org; George Shuklin; Dario
Faggioli
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
Hi George --

Thanks for your thoughts!
 > On 04/10/12 17:54, Dan Magenheimer wrote:
> >>
> > Scanning through the archived message I am under the impression
> > that the focus is on a single server... i.e. "punt if actor is
> > not xl", i.e. it addressed "balloon-to-fit" and only
tries to avoid
> > stepping on other memory overcommit technologies.  That makes it
> > almost orthogonal, I think, to the problem I originally raised.
> No, the idea was to allow the flexibility of different actors in
> different situations.  The plan was to start with a simple actor, but to
> add new ones as necessary.  But on reflection, it seems like the whole
> "actor" thing was actually something completely separate to what
we''re
> talking about here.  The idea behind the actor (IIRC) was that you could
> tell the toolstack, "Make VM A use X amount of host memory"; and
the
> actor would determine the best way to do that -- either by only
> ballooning, or ballooning first and then swapping.  But it doesn''t
> decide how to get the value X.
OK, so if the actor stuff is orthogonal, let''s go back to the
original problem.  We do want to ensure the solution doesn''t _break_
the actor idea... but IMHO any assumption that there is an actor
that can always sufficiently "control" memory allocation is suspect.
> This thread has been very hard to follow for some reason, so let me see
> if I can understand everything:
> * You are concerned about being able to predictably start VMs in the
> face of:
>   - concurrent requests, and
>   - dynamic memory technologies (including PoD, ballooning, paging, page
> sharing, and tmem)
> Any of which may change the amount of free memory between the time a
> decision is made and the time memory is actually allocated.
> * You have proposed a hypervisor-based solution that allows the
> toolstack to "reserve" a specific amount of memory to a VM that
will not
> be used for something else; this allocation is transactional -- it will
> either completely succeed, or completely fail, and do it quickly.
> 
> Is that correct?
Yes, good summary.
> The problem with that solution, it seems to me, is that the hypervisor
> does not (and I think probably should not) have any insight into the
> policy for allocating or freeing memory as a result of other activities,
> such as ballooning or page sharing.  Suppose someone were ballooning
> down domain M to get 8GiB in order to start domain A; and at some point
> , another process looks and says, "Oh look, there''s 4GiB
free, that''s
> enough to start domain B" and asks Xen to reserve that memory.  Xen
has
> no way of knowing that the memory freed by domain M was
"earmarked" for
> domain A, and so will happily give it to domain B, causing domain
A''s
> creation to fail (potentially).
I agree completely that the hypervisor shouldn''t have any insight into
the _policy_ (though see below).  I''m just proposing an extension to
the
existing mechanism and I am quite convinced that the hypervisor must
be involved (e.g. a new hypercall) for the extension to work properly.

In your example, the "someone" ballooning down domain M to get 8GiB
for domain M would need somehow to "reserve" the memory for domain M.
I didn''t foresee the use of the proposed reservation mechanism beyond
domain creation, but it could probably be used for large ballooning
quantities as well.
> So it seems like we need to have the idea of a memory controller -- one
> central process (per host, as you say) that would know about all of the
> knobs -- ballooning, paging, page sharing, tmem, whatever -- that could
> be in charge of knowing where all the memory was coming from and where
> it was going.  So if xl wanted to start a new VM, it can ask the memory
> controller for 3GiB, and the controller could decide, "I''ll
take 1GiB
> from domain M and 2 from domain N, and give it to the new domain", and
> respond when it has the memory that it needs.  Similarly, it can know
> that it should try to keep X megabytes for un-sharing of pages, and it
> can be responsible for freeing up more memory if that memory becomes
> exhausted.
First, let me quibble about the term you used.  It''s especially
important for you, George, because I know your previous Xen contributions.

IMHO, we are not talking about a "memory controller", we are talking
about a "memory scheduler".  In a CPU scheduler, one would never
assume that all demands for CPU time should be reviewed and granted
by some userland process in dom0 (and certainly not by some grand
central data center manager).  That would be silly.  Instead, we
provide some policy parameters and let each hypervisor make intelligent
dynamic decisions thousands of times every second based on those parameters.

IMHO, the example you give for asking a memory controller for GiB
of memory is equally silly.  Outside of some geek with a handful
of VMs on a single machine, there is inadequate information from
any VM to drive automatic memory allocation decisions and, even if
there was, it just doesn''t scale.  It doesn''t scale either up,
to
many VMs across many physical machines, or down, to instantaneous
needs of one-page-at-a-time requests for unsharing or for tmem.

(Also see my previous comments to Tim about memory-overcommit-by-
undercommit:  There isn''t sufficient information to size any
emergency buffer for unsharing either... too big and you waste
memory, too little and it doesn''t solve the underlying problem.)
> At the moment, the administrator himself (or the cloud orchestration
> layer) needs to be his own memory controller; that is, he needs to
> manually decide if there''s enough free memory to start a VM; if
there''s
> not, he needs to figure out how to get that memory (either by ballooning
> or swapping).  Ballooning and swapping are both totally under his
> control; the only thing he doesn''t control is the unsharing of
pages.
> But as long as there was a way to tell the page sharing daemon not to
> allocate an amount of free memory, then this
> "administrator-as-memory-controller" should work just fine.
> 
> Does that make sense?  Or am I still confused? :-)
It mostly makes sense until you get to host-swapping/unsharing,
see comments above.  And tmem takes the "doesn''t control" to
a
whole new level.  Meaning tmem (IMHO) completely eliminates
the possibility of a "memory controller" and begs for a
"memory scheduler".

Tmem really is a breakthrough on memory management in a virtualized
system.  I realize that many people are in the "if it doesn''t
work on Windows, I don''t care" camp.  And others never thought
it would make it into upstream Linux (or don''t care because it
isn''t
completely functional in any distros yet... other than Oracle''s..
but since all parts are now upstream, it will be soon).  But there
probably are also many that just don''t understand it... I guess I need
to work on fixing that.  Any thoughts on how to start?

In any case, though the reservation proposal is intended to cover
tmem as well, I think it is still needed for page-sharing and
domain-creation "races".

Dan

George Dunlap

2012-Oct-16 11:49 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Mon, Oct 8, 2012 at 2:02 AM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:> Tmem really is a breakthrough on memory management in a virtualized
> system.  I realize that many people are in the "if it doesn''t
> work on Windows, I don''t care" camp.  And others never
thought
> it would make it into upstream Linux (or don''t care because it
isn''t
> completely functional in any distros yet... other than Oracle''s..
> but since all parts are now upstream, it will be soon).  But there
> probably are also many that just don''t understand it... I guess I
need
> to work on fixing that.  Any thoughts on how to start?
Well, I''m sorry to say this, but to start I think you need to work on
your communication.  I had read this entire thread 2 or 3 times before
writing my last response; and I have now read this e-mail half a dozen
times, and I''m still don''t have a good idea what it is
you''re talking
about.  If I didn''t respect you, I would have just given up on the 2nd
try.

In my summary, I mentioned just 2 things: the problem of domain
creation, and the solution of a hypercall to allocate a big chunk of
memory to a domain.  You answered by saying it was a good summary.
But then you said:
> I''m just proposing an extension to the
> existing mechanism and I am quite convinced that the hypervisor must
> be involved (e.g. a new hypercall) for the extension to work properly.
Now you''re talking about an extension... then you mention a
"memory
scheduler" (which we don''t yet have), and say:
> ...there is inadequate information from
> any VM to drive automatic memory allocation decisions and, even if
> there was, it just doesn''t scale.
But you don''t say where or who *could* have adequate information;
which again hints at something else which you have in mind, but you
haven''t actually talked about very explicitly yet.  If you have been
trying to talk about it, and it wasn''t in my summary, why
didn''t you
say something about it, instead of saying, "Yes that''s
right"?  And if
you haven''t talked about it, why are you speaking as though we all
know already what you''re talking about?

Furthermore, you say things like this:
> IMHO, the example you give for asking a memory controller for GiB
> of memory is equally silly.  Outside of some geek with a handful
> of VMs on a single machine, there is inadequate information from
> any VM to drive automatic memory allocation decisions and, even if
> there was, it just doesn''t scale.  It doesn''t scale
either up, to
> many VMs across many physical machines, or down, to instantaneous
> needs of one-page-at-a-time requests for unsharing or for tmem.
What do you mean, "doesn''t scale up or across"?  Why not? 
Why is
there inadequate information inside dom0 for a toolstack-based memory
controller?  And if there''s not enough information there, who *does*
have the information?  It''s just a bunch of vague assertions with no
justification and no alternative proposed.  It doesn''t bring any light
to the discussion (which is no doubt why the thread has died without
conclusion).

Nor does saying "see above" and "see below", when
"above" and "below"
are still equally unenlightening.

Maybe your grand designs for a "memory scheduler", where memory pages
hop back and forth at millisecond quanta based on instantaneous data,
between page sharing, paging, tmem, and so on, is a good one.  But
that''s not what we have now.  And that''s not even what
you''re trying
to promote.  Instead, you''re trying to push a single hypercall that
you think will be necessary for such a scheduler.

Doesn''t it make sense to *first* talk about your grand vision and come
up with a reasonable plan for it, *then* propose an implementation?
If in the course of your 15-patch series introducing a "memory
scheduler", you also introduce a "reservation" hypercall, then
everyone can see exactly what it accomplishes, and actually see if
it''s necessary, or if some other design would work better.

Does that make sense?

If I still haven''t understood where you''re coming from, then I
am
sorry; but I have tried pretty hard, and I''m not the only one having
that problem.

 -George

Dan Magenheimer

2012-Oct-16 17:51 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> On Mon, Oct 8, 2012 at 2:02 AM, Dan Magenheimer
> <dan.magenheimer@oracle.com> wrote:
> > Tmem really is a breakthrough on memory management in a virtualized
> > system.  I realize that many people are in the "if it
doesn''t
> > work on Windows, I don''t care" camp.  And others never
thought
> > it would make it into upstream Linux (or don''t care because
it isn''t
> > completely functional in any distros yet... other than
Oracle''s..
> > but since all parts are now upstream, it will be soon).  But there
> > probably are also many that just don''t understand it... I
guess I need
> > to work on fixing that.  Any thoughts on how to start?
> 
> Well, I''m sorry to say this, but to start I think you need to work
on
> your communication.  I had read this entire thread 2 or 3 times before
> writing my last response; and I have now read this e-mail half a dozen
> times, and I''m still don''t have a good idea what it is
you''re talking
> about.  If I didn''t respect you, I would have just given up on the
2nd
> try.
>   :
> If I still haven''t understood where you''re coming from,
then I am
> sorry; but I have tried pretty hard, and I''m not the only one
having
> that problem.
Hi George --

Thanks for the honest direct feedback.  I had no idea.  I have
been buried in this memory stuff since April 2008 and it is easy
for me to assume that people understand what I am talking about,
have read everything I''ve written about it, seen/remember my
presentations etc.  Further, the conversational delays due to timezone
differences and the fact that we all are juggling many different
deliverables makes it difficult to maintain all the context necessary
to drive/converge a complex discussion.

So I am truly sorry and I really appreciate that you''ve stuck with me.

Let me ponder how to improve, but try to maintain some forward
progress in the interim by continuing this thread.

There are two things being mixed here: (A) The very general concepts
of  how to deal with RAM capacity as a resource and how to best
"control"
"sharing" of the resource among virtual machines; and (B) how to solve
a
very specific known problem that occurs due to "races" for memory
capacity.
Solving (B) requires some assumptions about (A) which is why (A)
keeps coming up.

I''ll mark my comments below with (A) and (B) to make it clear
which is being discussed.
> In my summary, I mentioned just 2 things: the problem of domain
> creation, and the solution of a hypercall to allocate a big chunk of
> memory to a domain.  You answered by saying it was a good summary.
> But then you said:
> 
> > I''m just proposing an extension to the
> > existing mechanism and I am quite convinced that the hypervisor must
> > be involved (e.g. a new hypercall) for the extension to work properly.
> 
> Now you''re talking about an extension...
This is (B)

Extension == new hypercall.  (It''s an extension to the way memory
has previously been allocated by the hypervisor.)
> then you mention a "memory
> scheduler" (which we don''t yet have), and say:
> 
> > ...there is inadequate information from
> > any VM to drive automatic memory allocation decisions and, even if
> > there was, it just doesn''t scale.
> 
> But you don''t say where or who *could* have adequate information;
> which again hints at something else which you have in mind, but you
> haven''t actually talked about very explicitly yet.  If you have
been
> trying to talk about it, and it wasn''t in my summary, why
didn''t you
> say something about it, instead of saying, "Yes that''s
right"?  And if
> you haven''t talked about it, why are you speaking as though we all
> know already what you''re talking about?
(A)

My bad.  The premise of tmem (and IMHO the thorn in the side of
all memory capacity management in virtualized systems) is that *nobody*
has adequate information.  The guest OS has some "demand" information,
though not in any externally-communicable form, and the host/hypervisor
has "supply" information.  Tmem uses a small handful of kernel changes
and some hypercalls to tie these together in a surprisingly useful way.
> Furthermore, you say things like this:
> 
> > IMHO, the example you give for asking a memory controller for GiB
> > of memory is equally silly.  Outside of some geek with a handful
> > of VMs on a single machine, there is inadequate information from
> > any VM to drive automatic memory allocation decisions and, even if
> > there was, it just doesn''t scale.  It doesn''t scale
either up, to
> > many VMs across many physical machines, or down, to instantaneous
> > needs of one-page-at-a-time requests for unsharing or for tmem.
> 
> What do you mean, "doesn''t scale up or across"?  Why
not?  Why is
> there inadequate information inside dom0 for a toolstack-based memory
> controller?  And if there''s not enough information there, who
*does*
> have the information?  It''s just a bunch of vague assertions with
no
> justification and no alternative proposed.  It doesn''t bring any
light
> to the discussion (which is no doubt why the thread has died without
> conclusion).
(A)

There is inadequate information period.  OS''s have forever been
designed to manage a fixed amount of RAM, not to communicate very
well about if and when the OS needs more RAM (and how much) or can
get by with less RAM (and how much).  So any external "memory
controller"
is (IMHO) doomed to failure, limited to approximations based on pieces of
guest-OS-externally-visible usually-out-of-date information collected
at a relatively low frequency.  Collecting/analyzing/acting-on the
information across hundreds/thousands of guests is very difficult
(doesn''t "scale up"), collecting/analyzing/acting-on the
information
across hundreds of machines -- each with hundreds/thousands of
guests has exponential communication and bin-packing problems
(doesn''t scale "across") and, if the memory-demand is a
high-frequency
stream of single pages (i.e. with page-unsharing), sampling by
the memory controller can''t possibly keep up (doesn''t
"scale down").

This is only slightly better than a bunch of vague assertions, but if
you disagree, let''s take it down a level in a separate thread.

My proposed alternative is tmem. which is why it may appear that I
haven''t proposed anything... tmem already exists today.
> Nor does saying "see above" and "see below", when
"above" and "below"
> are still equally unenlightening.
Oops, sorry. :-}  Just trying to avoid repeating myself.
> Maybe your grand designs for a "memory scheduler", where memory
pages
> hop back and forth at millisecond quanta based on instantaneous data,
> between page sharing, paging, tmem, and so on, is a good one.  But
> that''s not what we have now.
(A)

Tmem *is* essentially a memory scheduler.  A grand design is implemented,
works, and all the parts are upstream in open source.
> And that''s not even what you''re trying
> to promote.  Instead, you''re trying to push a single hypercall
that
> you think will be necessary for such a scheduler.
(B)

Strangely, tmem doesn''t really need this hypercall.  It already has
a solution working in xm create called "tmem freeze/thaw".  But this
solution is a half-assed very heavy hammer.

The single "memory reservation" hypercall is intended to help solve a
known problem (IanJ said early in this thread: "This is a real
problem")
with any environment where the amount of RAM used by a guest can change
dynamically without the knowledge of a not-in-hypervisor "memory
controller",
and the toolstack then wishes to launch a new domain.  The problem can even
occur with multiple toolstack threads simultaneously launching domains.

After further thought, it appeared that the "memory reservation"
hypercall
also eliminates the need for the half-assed tmem freeze/thaw as well.
> Doesn''t it make sense to *first* talk about your grand vision and
come
> up with a reasonable plan for it, *then* propose an implementation?
> If in the course of your 15-patch series introducing a "memory
> scheduler", you also introduce a "reservation" hypercall,
then
> everyone can see exactly what it accomplishes, and actually see if
> it''s necessary, or if some other design would work better.
> 
> Does that make sense?
If you reread my last response with the assumption in mind:

  "tmem == an instance of a memory scheduler == grand vision"

then does the discussion of the "memory reservation" hypercall
make more sense?

Thanks again for the pointed communication feedback.  Hopefully this
is a bit better and I will continue to ponder more communication
improvements.

Dan

George Dunlap

2012-Oct-17 17:35 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

[Sorry, forgot to reply-to-all]

On Tue, Oct 16, 2012 at 6:51 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:>
> If you reread my last response with the assumption in mind:
>
>   "tmem == an instance of a memory scheduler == grand vision"
>
> then does the discussion of the "memory reservation" hypercall
> make more sense?
Sort of. :-) Unfortunately, I think it shows a bit of confusion, which
is perhaps why it was hard to understand.

But let''s go back for a minute to the problem at hand: you''re
afraid
of free memory disappearing between a toolstack checking for the
memory, and the toolstack actually creating the VM.

There are two ways this could happen:

1. Another admin command (perhaps by another administrator) has caused
the memory to go away -- i.e,. another admin has called "xl create",
or has instructed a VM to balloon up to a higher amount of memory.

2. One of the self-directed processes in the system has allocated the
memory: a balloon driver has ballooned up, or the swapper has swapped
something in, or the page sharing daemon has had to un-share pages.

In the case of #1, I think the right answer to that is, "Don''t do
that."  :-) The admins should co-ordinate with each other about what
to start where; if they both want to use a bit of memory, that''s a
human interaction problem, not a technological one.  Alternately, if
we''re talking a cloud orchestration layer, the cloud orchestration
should have an idea how much memory is available on each node, and not
allow different users to issue commands which would violate those.

In the case of #2, I think the answer is, "self-directed processes
should not be allowed to consume free memory without permission from
the toolstack".  The pager should not increase the memory footprint of
a VM unless either told to by an admin or a memory controller which
has been given authority by an admin.  (Yes, memory controller, not
scheduler -- more on that in another e-mail.)  A VM should be given a
fixed amount of memory above which the balloon driver cannot go.  The
page-sharing daemon should have a small amount set aside to handle
un-sharing requests; but this should be immediately replenished by
other methods (preferably by ballooning a VM down, or if necessary by
swapping pages out).  It should not be able to make arbitrarily large
allocations without permission from the toolstack.

I was chatting with Konrad yesterday, and he brought up
"self-ballooning" VMs, which apparently vonluntarily choose to balloon
down to *below* their toolstack-dictated balloon target, in order to
induce Linux to swap some pages out to tmem, and will then balloon up to
the toolstack-dictated target later.

It seems to me that the Right Thing in this case is for the toolstack
to know that this "free" memory isn''t really free -- that if
your 2GiB
VM is only using 1.5GiB, you nonetheless don''t touch that 0.5GiB,
because you know it may use it later.  This is what xapi does.

Alternately, if you don''t want to do that accounting, and just want to
use Xen''s free memory to determine if you can start a VM, then you
could just have your "self-ballooning" processes *not actually free
the memory*.  That way the free memory would be an accurate
representation of how much memory is actually present on a system.

In all of this discussion, I don''t see any reason to bring up tmem at
all (except to note the reason why a VM may balloon down).  It''s just
another area to which memory can be allocated (along with Xen or a
domain).  It also should not be allowed to allocate free Xen memory to
itself without being specifically instructed by the toolstack, so it
can''t
cause the problem you''re talking about.

Any system that follows the rules I''ve set above won''t have to
worry
about free memory disappearing half-way through domain creation.

I''m not fundamentally opposed to the idea of an "allocate memory
to a
VM" hypercall; but the arguments adduced to support this seem
hopelessly confused, which does not bode well for the usefulness or
maintainability of such a hypercall.

 -George

George Dunlap

2012-Oct-17 17:35 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Wed, Oct 17, 2012 at 6:30 PM, George Dunlap
<George.Dunlap@eu.citrix.com> wrote:> A VM should be given a
> fixed amount of memory above which the balloon driver cannot go.
I forgot to mention: there is a limit you can set in the hypervisor
such that the balloon driver cannot go up past a certain point.  And
since 4.1, I think, it has been possible to set this limit to below
what the VM currently has alloated -- the effect being, that as soon
as the VM balloons down to that point, it cannot balloon back up.
Xapi sets this value at the same time it sets the balloon target in
xenstore, so that it can have confidence that once it actually has
some free memory, it won''t disappear from under its feet.

Andres Lagar-Cavilla

2012-Oct-17 18:33 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 17, 2012, at 1:35 PM, George Dunlap <George.Dunlap@eu.citrix.com>
wrote:
> [Sorry, forgot to reply-to-all]
> 
> On Tue, Oct 16, 2012 at 6:51 PM, Dan Magenheimer
> <dan.magenheimer@oracle.com> wrote:
>> 
>> If you reread my last response with the assumption in mind:
>> 
>>  "tmem == an instance of a memory scheduler == grand vision"
>> 
>> then does the discussion of the "memory reservation"
hypercall
>> make more sense?
> 
> Sort of. :-) Unfortunately, I think it shows a bit of confusion, which
> is perhaps why it was hard to understand.
> 
> But let''s go back for a minute to the problem at hand:
you''re afraid
> of free memory disappearing between a toolstack checking for the
> memory, and the toolstack actually creating the VM.
> 
> There are two ways this could happen:
> 
> 1. Another admin command (perhaps by another administrator) has caused
> the memory to go away -- i.e,. another admin has called "xl
create",
> or has instructed a VM to balloon up to a higher amount of memory.
> 
> 2. One of the self-directed processes in the system has allocated the
> memory: a balloon driver has ballooned up, or the swapper has swapped
> something in, or the page sharing daemon has had to un-share pages.
> 
> In the case of #1, I think the right answer to that is,
"Don''t do
> that."  :-) The admins should co-ordinate with each other about what
> to start where; if they both want to use a bit of memory, that''s a
> human interaction problem, not a technological one.  Alternately, if
> we''re talking a cloud orchestration layer, the cloud orchestration
> should have an idea how much memory is available on each node, and not
> allow different users to issue commands which would violate those.
> 
> In the case of #2, I think the answer is, "self-directed processes
> should not be allowed to consume free memory without permission from
> the toolstack".  The pager should not increase the memory footprint of
> a VM unless either told to by an admin or a memory controller which
> has been given authority by an admin.  (Yes, memory controller, not
> scheduler -- more on that in another e-mail.)  A VM should be given a
> fixed amount of memory above which the balloon driver cannot go.  The
> page-sharing daemon should have a small amount set aside to handle
> un-sharing requests; but this should be immediately replenished by
> other methods (preferably by ballooning a VM down, or if necessary by
> swapping pages out).  It should not be able to make arbitrarily large
> allocations without permission from the toolstack.
Something that I struggle with here is the notion that we need to extend the
hypervisor for any aspect of the discussion we''ve had so far. I just
don''t see that. The toolstack has (or should definitely have) a
non-racy view of the memory of the host. Reservations are therefore notions the
toolstack manages.

Domains can be cajoled into obedience via the max_pages tweak -- which I
profoundly dislike. If anything we should change the hypervisor to have a
"current_allowance" or similar field with a more obvious meaning. The
abuse of max_pages makes me cringe. Not to say I disagree with its usefulness.

Once you guarantee no "ex machina" entities fudging the view of the
memory the toolstack has, then all known methods can be bounded in terms of
their capacity to allocate memory unsupervised.

Note that this implies as well, I don''t see the need for a pool of
"unshare" pages. It''s all in the heap. The toolstack ensures
there is something set apart.

I further think the pod cache could be converted to this model. Why have
specific per-domain lists of cached pages in the hypervisor? Get them back from
the heap! Obviously places a decoupled requirement of certain toolstack
features. But allows to throw away a lot of complex code.

My two cents for the new iteration

Andres
> 
> I was chatting with Konrad yesterday, and he brought up
> "self-ballooning" VMs, which apparently vonluntarily choose to
balloon
> down to *below* their toolstack-dictated balloon target, in order to
> induce Linux to swap some pages out to tmem, and will then balloon up to
> the toolstack-dictated target later.
> 
> It seems to me that the Right Thing in this case is for the toolstack
> to know that this "free" memory isn''t really free --
that if your 2GiB
> VM is only using 1.5GiB, you nonetheless don''t touch that 0.5GiB,
> because you know it may use it later.  This is what xapi does.
> 
> Alternately, if you don''t want to do that accounting, and just
want to
> use Xen''s free memory to determine if you can start a VM, then you
> could just have your "self-ballooning" processes *not actually
free
> the memory*.  That way the free memory would be an accurate
> representation of how much memory is actually present on a system.
> 
> In all of this discussion, I don''t see any reason to bring up tmem
at
> all (except to note the reason why a VM may balloon down).  It''s
just
> another area to which memory can be allocated (along with Xen or a
> domain).  It also should not be allowed to allocate free Xen memory to
> itself without being specifically instructed by the toolstack, so it
can''t
> cause the problem you''re talking about.
> 
> Any system that follows the rules I''ve set above won''t
have to worry
> about free memory disappearing half-way through domain creation.
> 
> I''m not fundamentally opposed to the idea of an "allocate
memory to a
> VM" hypercall; but the arguments adduced to support this seem
> hopelessly confused, which does not bode well for the usefulness or
> maintainability of such a hypercall.
> 
> -George

Dan Magenheimer

2012-Oct-17 18:45 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: George Dunlap [mailto:George.Dunlap@eu.citrix.com]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
> 
> On Tue, Oct 16, 2012 at 6:51 PM, Dan Magenheimer
> <dan.magenheimer@oracle.com> wrote:
> >
> > If you reread my last response with the assumption in mind:
> >
> >   "tmem == an instance of a memory scheduler == grand
vision"
> >
> > then does the discussion of the "memory reservation"
hypercall
> > make more sense?
> 
> Sort of. :-) Unfortunately, I think it shows a bit of confusion, which
> is perhaps why it was hard to understand.
>    :
> I''m not fundamentally opposed to the idea of an "allocate
memory to a
> VM" hypercall; but the arguments adduced to support this seem
> hopelessly confused, which does not bode well for the usefulness or
> maintainability of such a hypercall.
Hi George --

Now I think I have a better idea as to why you are not understanding
and why you think this is confusing!!!

It seems we are not only speaking different languages but are
from completely different planets!  I.e. our world views are
very very different.

You have a very very static/partitioned/restrictive/controlled view
of how memory should be managed in a virtual environment.

I have a very very dynamic view of how memory should be managed in
a virtual environment.

Tmem -- and the ability to change guest kernels to cooperate in dynamic
memory management -- very obviously drives my world view, but my view
reveals subtle deficiencies in your world view.  Xapi and the constraints
it lives under (i.e. requirement for proprietary HVM guest kernels)
and the existing Xapi memory controller model seems good enough for
you, so your view makes my need for handling subtle dynamic corner
cases appear that I must have some secret fantastical "grand design"
in mind.
> Any system that follows the rules I''ve set above won''t
have to worry
> about free memory disappearing half-way through domain creation.
Agreed.  My claim is that: (1) tmem can''t possibly follow your rules
as it would decrease its value/performance by several orders of
magnitude; (2) page-unsharing/swapping can''t possibly follow your rules
because the corner cases it must deal with are urgent, frequent, and
unpredictable; (3) a "cloud orchestration layer" can''t follow
your rules
because of complexity and communication limits, unless it greatly
constrains its flexibility/automation; (4) following your
rules serializes common administration activities even for Xapi
that otherwise don''t need to be serialized.

I think your rules take an overconstrained problem (managing memory
for multiple VMs) and add more constraints.  While IMHO tmem takes away
constraints.

That''s why I brought up CPU schedulers.  I know you are an expert
in CPU scheduling, and you would never apply similar rules to
CPU scheduling that you want to apply to "memory scheduling".
E.g. you would never require the toolstack to be in the critical
path for every VCPU->CPU reassignment.

And so I have to try to solve a problem that you don''t have (or IMHO
that you will likely have in the future but don''t admit to yet ;-)
And I think the "reservation" hypercall will solve that problem.
> In all of this discussion, I don''t see any reason to bring up tmem
at
> all (except to note the reason why a VM may balloon down).  It''s
just
> another area to which memory can be allocated (along with Xen or a
> domain).  It also should not be allowed to allocate free Xen memory to
> itself without being specifically instructed by the toolstack, so it
can''t
> cause the problem you''re talking about.
This is all very wrong.  It''s clear you don''t understand why
tmem exists,
how it works, and what its value is/can be in the cloud.  I''ll take
some
of the blame for that because I''ve had to spend so much time in
Linux-kernel land in the last couple of years.

But if you want to try a different world view, and understand tmem,
let me know ;-)  I don''t mean to be immodest, but I truly believe it
is the first significant advance in managing RAM in a virtual
environment in ten years (since Waldspurger).

Dan

Dan Magenheimer

2012-Oct-17 19:46 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend and
xl)
Hi Andres --

Re reply just sent to George...

I think you must be on a third planet, revolving somewhere between
George''s and mine.  I say that because I agree completely with some
of your statements and disagree with the conclusions you draw from
them! :-)
> Domains can be cajoled into obedience via the max_pages tweak -- which I
profoundly dislike. If
> anything we should change the hypervisor to have a
"current_allowance" or similar field with a more
> obvious meaning. The abuse of max_pages makes me cringe. Not to say I
disagree with its usefulness.
Me cringes too.  Though I can see from George''s view that it makes
perfect sense.  Since the toolstack always controls exactly how
much memory is assigned to a domain and since it can cache the
"original max", current allowance and the hypervisors view of
max_pages must always be the same.

Only if the hypervisor or the domain or the domain''s administrator
can tweak current memory usage without the knowledge of the
toolstack (which is closer to my planet) does an issue arise.
And, to me, that''s the foundation of this whole thread.
> Once you guarantee no "ex machina" entities fudging the view of
the memory the toolstack has, then all
> known methods can be bounded in terms of their capacity to allocate memory
unsupervised.
> Note that this implies as well, I don''t see the need for a pool of
"unshare" pages. It''s all in the
> heap. The toolstack ensures there is something set apart.
By "ex machina" do you mean "without the toolstack''s
knowledge"?

Then how does page-unsharing work?  Does every page-unshare done by
the hypervisor require serial notification/permission of the toolstack?
Or is this "batched", in which case a pool is necessary,
isn''t it?
(Not sure what you mean by "no need for a pool" and then
"toolstack
ensures there is something set apart"... what''s the difference?)

My point is, whether there is no pool or a pool that sometimes
runs dry, are you really going to put the toolstack in the hypervisor''s
path for allocating a page so that the hypervisor can allocate
a new page for CoW to fulfill an unshare?
> Something that I struggle with here is the notion that we need to extend
the hypervisor for any aspect
> of the discussion we''ve had so far. I just don''t see
that. The toolstack has (or should definitely
> have) a non-racy view of the memory of the host. Reservations are therefore
notions the toolstack
> manages.
In a perfect world where the toolstack has an oracle for the
precise time-varying memory requirements for all guests, I
would agree.

In that world, there''s no need for a CPU scheduler either...
the toolstack can decide exactly when to assign each VCPU for
each VM onto each PCPU, and when to stop and reassign.
And then every PCPU would be maximally utilized, right?

My point: Why would you resource-manage CPUs differently from
memory?  The demand of real-world workloads varies dramatically
for both... don''t you want both to be managed dynamically,
whenever possible?

If yes (dynamic is good), in order for the toolstack''s view of
memory to be non-racy, doesn''t every hypervisor page allocation
need to be serialized with the toolstack granting notification/permission?
> I further think the pod cache could be converted to this model. Why have
specific per-domain lists of
> cached pages in the hypervisor? Get them back from the heap! Obviously
places a decoupled requirement
> of certain toolstack features. But allows to throw away a lot of complex
code.
IIUC in George''s (Xapi) model (or using Tim''s phrase,
"balloon-to-fit")
the heap is "always" empty because the toolstack has assigned all
memory.
So I''m still confused... where does "page unshare" get memory
from
and how does it notify and/or get permission from the toolstack?
> My two cents for the new iteration
I''ll see your two cents, and raise you a penny! ;-)

Dan

Andres Lagar-Cavilla

2012-Oct-17 20:14 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

On Oct 17, 2012, at 3:46 PM, Dan Magenheimer <dan.magenheimer@oracle.com>
wrote:
>> From: Andres Lagar-Cavilla [mailto:andreslc@gridcentric.ca]
>> Subject: Re: [Xen-devel] domain creation vs querying free memory (xend
and xl)
> 
> Hi Andres --
> 
> Re reply just sent to George...
> 
> I think you must be on a third planet, revolving somewhere between
> George''s and mine.  I say that because I agree completely with
some
> of your statements and disagree with the conclusions you draw from
> them! :-)
> 
>> Domains can be cajoled into obedience via the max_pages tweak -- which
I profoundly dislike. If
>> anything we should change the hypervisor to have a
"current_allowance" or similar field with a more
>> obvious meaning. The abuse of max_pages makes me cringe. Not to say I
disagree with its usefulness.
> 
> Me cringes too.  Though I can see from George''s view that it makes
> perfect sense.  Since the toolstack always controls exactly how
> much memory is assigned to a domain and since it can cache the
> "original max", current allowance and the hypervisors view of
> max_pages must always be the same.
No. There is room for slack. max_pages (or current_allowance) simply sets an
upper bound, which if met will trigger the need for memory management
intervention.
> 
> Only if the hypervisor or the domain or the domain''s administrator
> can tweak current memory usage without the knowledge of the
> toolstack (which is closer to my planet) does an issue arise.
> And, to me, that''s the foundation of this whole thread.
> 
>> Once you guarantee no "ex machina" entities fudging the view
of the memory the toolstack has, then all
>> known methods can be bounded in terms of their capacity to allocate
memory unsupervised.
>> Note that this implies as well, I don''t see the need for a
pool of "unshare" pages. It''s all in the
>> heap. The toolstack ensures there is something set apart.
> 
> By "ex machina" do you mean "without the
toolstack''s knowledge"?
> 
> Then how does page-unsharing work?  Does every page-unshare done by
> the hypervisor require serial notification/permission of the toolstack?
No of course not. But if you want to keep a domain at bay you keep its max_pages
where you want it to stop growing. And at that point the domain will fall asleep
(not 100% there hypervisor-wise yet but Real Soon Now (™)), and a synchronous
notification will be sent to a listener.

At that point it''s again a memory management decision. Should I
increase the domain''s reservation, page something out, etc? There is a
range of possibilities that are not germane to the core issue of enforcing
memory limits.
> Or is this "batched", in which case a pool is necessary,
isn''t it?
> (Not sure what you mean by "no need for a pool" and then
"toolstack
> ensures there is something set apart"... what''s the
difference?)
I am under the impression there is a proposal floating for a
hypervisor-maintained pool of pages to immediately relief un-sharing. Much like
there is now for PoD (the pod cache). This is what I think is not necessary.
> 
> My point is, whether there is no pool or a pool that sometimes
> runs dry, are you really going to put the toolstack in the
hypervisor''s
> path for allocating a page so that the hypervisor can allocate
> a new page for CoW to fulfill an unshare?
Absolutely not.
> 
>> Something that I struggle with here is the notion that we need to
extend the hypervisor for any aspect
>> of the discussion we''ve had so far. I just don''t see
that. The toolstack has (or should definitely
>> have) a non-racy view of the memory of the host. Reservations are
therefore notions the toolstack
>> manages.
> 
> In a perfect world where the toolstack has an oracle for the
> precise time-varying memory requirements for all guests, I
> would agree.
With the mechanism outlined, the toolstack needs to make coarse-grained
infrequent decisions. There is a possibility for pathological misbehavior -- I
think there is always that possibility. Correctness is preserved, at worst,
performance will be hurt.

It''s really important to keep things separate in this discussion. The
toolstack+hypervisor are enabling (1) control over how memory is allocated to
what (2) control over a domain''s ability to grow its footprint
unsupervised (3) control over a domain''s footprint with PV mechanisms
from within, or externally.

Performance is not up to the toolstack but to the memory manager magic the
toolstack enables with (3).
> 
> In that world, there''s no need for a CPU scheduler either...
> the toolstack can decide exactly when to assign each VCPU for
> each VM onto each PCPU, and when to stop and reassign.
> And then every PCPU would be maximally utilized, right?
> 
> My point: Why would you resource-manage CPUs differently from
> memory?  The demand of real-world workloads varies dramatically
> for both... don''t you want both to be managed dynamically,
> whenever possible?
> 
> If yes (dynamic is good), in order for the toolstack''s view of
> memory to be non-racy, doesn''t every hypervisor page allocation
> need to be serialized with the toolstack granting notification/permission?
Once you bucketize RAM and know you will get synchronous kicks as buckets fill
up, then you have a non-racy view. If you choose buckets of width one…..
> 
>> I further think the pod cache could be converted to this model. Why
have specific per-domain lists of
>> cached pages in the hypervisor? Get them back from the heap! Obviously
places a decoupled requirement
>> of certain toolstack features. But allows to throw away a lot of
complex code.
> 
> IIUC in George''s (Xapi) model (or using Tim''s phrase,
"balloon-to-fit")
> the heap is "always" empty because the toolstack has assigned all
memory.
I don''t think that''s what they mean. Nor is it what I mean.
The toolstack may chunk memory up into abstract buckets. It can certainly assert
that its bucketized view matches the hypervisor view. Pages flow from the heap
to each domain -- but the bucket "domain X" will not overflow
unsupervised.
> So I''m still confused... where does "page unshare" get
memory from
> and how does it notify and/or get permission from the toolstack?
Re sharing, as it should be clear by now, the answer is "it
doesn''t matter". If unsharing cannot be satisfied form the heap,
then memory management in dom0 is invoked. Heavy-weight, but it means
you''ve hit an admin-imposed limit.

Please note that this notion of limits and enforcement is sparingly applied
today, to the best of my knowledge. But imho it''d be great to
meaningfully work towards it.

Andres> 
>> My two cents for the new iteration
> 
> I''ll see your two cents, and raise you a penny! ;-)
> 
> Dan

Dan Magenheimer

2012-Oct-17 22:07 UTC

head link

Re: domain creation vs querying free memory (xend and xl)

Hi Andres --

First, the primary target of page-sharing is HVM proprietary/legacy
guests, correct?  So, as I said, we are starting from different
planets.  I''m not arguing that a toolstack-memory-controller
won''t be sufficient for your needs, especially in a single server
environment, only that the work required to properly ensure that:
> >> The toolstack has (or should definitely have) a non-racy view
> >> of the memory of the host
is unnecessary if you (and the toolstack) take a slightly broader
dynamic view of memory management.  IMHO that broader view
(which requires the "memory reservation" hypercall) both encompasses
tmem and IMHO greatly simplifies memory management in the presence
of page-unsharing.  I.e. it allows the toolstack to NOT have
a non-racy view of the memory of the host.

So, if you don''t mind, I will take this opportunity to
ask some questions about page-sharing stuff, in the context
of the toolstack-memory-controller and/or memory reservation
hypercall.
> >> Domains can be cajoled into obedience via the max_pages tweak --
which I profoundly dislike. If
> >> anything we should change the hypervisor to have a
"current_allowance" or similar field with a more
> >> obvious meaning. The abuse of max_pages makes me cringe. Not to
say I disagree with its usefulness.
> >
> > Me cringes too.  Though I can see from George''s view that it
makes
> > perfect sense.  Since the toolstack always controls exactly how
> > much memory is assigned to a domain and since it can cache the
> > "original max", current allowance and the hypervisors view
of
> > max_pages must always be the same.
> 
> No. There is room for slack. max_pages (or current_allowance) simply sets
an upper bound, which if met
> will trigger the need for memory management intervention.
I think we agree if we change my "must always be the same" to
"must always be essentially the same, ignoring some fudge factor".

Which begs the questions: How does one determine how big the
fudge factor is, what happens if it is not big enough, and if
it is too big, doesn''t that potentially add up to a lot of
wasted space?
> > By "ex machina" do you mean "without the
toolstack''s knowledge"?
> >
> > Then how does page-unsharing work?  Does every page-unshare done by
> > the hypervisor require serial notification/permission of the
toolstack?
> 
> No of course not. But if you want to keep a domain at bay you keep its
max_pages where you want it to
> stop growing. And at that point the domain will fall asleep (not 100% there
hypervisor-wise yet but
> Real Soon Now (T)), and a synchronous notification will be sent to a
listener.
> 
> At that point it''s again a memory management decision. Should I
increase the domain''s reservation,
> page something out, etc? There is a range of possibilities that are not
germane to the core issue of
> enforcing memory limits.
Maybe we need to dive deep into page-sharing accounting for
a moment here:

When a page is shared say, by 1000 different VMs, does it get
"billed" to all VMs?  If no (which makes the most sense to me),
how is the toolstack informed that there is now 999 free
pages available so that it can use them in, say, a new domain?
Does the hypervisor notification wait until there is sufficient
pages (say, a bucket''s worth)?  If yes, what''s the point of
sharing if the hypervisor now has some free memory but the
the freed memory is still "billed"; and are there data
structures in the hypervisor to track this so that unsharing
does proper accounting too?

Now suppose 10000 pages are shared by 1000 different VMs at
domain launch (scenario: an online class is being set up by
a cloud user) and then the VMs suddenly get very active
and require a lot of CoWing (say the online class just
got underway).  What''s the profile of interaction between
the hypervisor and toolstack?

Maybe you''ve got this all figured out (whether implemented or
not) and are convinced it is scalable (or don''t care because the
target product is a small single system), but I''d imagine the internal
hypervisor vs toolstack accounting/notifications will get very
very messy and have concerns about scalability and memory waste.
> > Or is this "batched", in which case a pool is necessary,
isn''t it?
> > (Not sure what you mean by "no need for a pool" and then
"toolstack
> > ensures there is something set apart"... what''s the
difference?)
> 
> I am under the impression there is a proposal floating for a
hypervisor-maintained pool of pages to
> immediately relief un-sharing. Much like there is now for PoD (the pod
cache). This is what I think is
> not necessary.
I agree it is not necessary, but don''t understand who manages
the "slop" (unallocated free pages) and how a pool is different
from a "bucket" (to use your term from further down in your reply).
> > My point is, whether there is no pool or a pool that sometimes
> > runs dry, are you really going to put the toolstack in the
hypervisor''s
> > path for allocating a page so that the hypervisor can allocate
> > a new page for CoW to fulfill an unshare?
> 
> Absolutely not.
Good to hear.  But this begs answers to the previous questions.
Mainly: How does it all work then so that the toolstack and
hypervisor are "in sync" about the number of available pages
such that the toolstack never wrongly determines that there
is enough free space to launch a domain and (by the time
it tries to use the free space) there really isn''t?

If they can''t remain in sync (at least within a single
"bucket",
across the entire system, not one bucket per domain), then
isn''t something like the proposed "memory reservation"
hypercall still required?
> >> Something that I struggle with here is the notion that we need to
extend the hypervisor for any
> aspect
> >> of the discussion we''ve had so far. I just don''t
see that. The toolstack has (or should definitely
> >> have) a non-racy view of the memory of the host. Reservations are
therefore notions the toolstack
> >> manages.
> >
> > In a perfect world where the toolstack has an oracle for the
> > precise time-varying memory requirements for all guests, I
> > would agree.
> 
> With the mechanism outlined, the toolstack needs to make coarse-grained
infrequent decisions. There is
> a possibility for pathological misbehavior -- I think there is always that
possibility. Correctness is
> preserved, at worst, performance will be hurt.
IMHO, performance will be hurt not only for the pathological cases.
Memory will also needlessly be wasted.  But, for Windows, I don''t
have a better solution, and it will probably be no worse than
Microsoft''s
solution.
> It''s really important to keep things separate in this discussion.
The toolstack+hypervisor are
> enabling (1) control over how memory is allocated to what (2) control over
a domain''s ability to grow
> its footprint unsupervised (3) control over a domain''s footprint
with PV mechanisms from within, or
> externally.
> 
> Performance is not up to the toolstack but to the memory manager magic the
toolstack enables with (3).
Good dichotomy (though not entirely perfect on my planet).
> > In that world, there''s no need for a CPU scheduler either...
> > the toolstack can decide exactly when to assign each VCPU for
> > each VM onto each PCPU, and when to stop and reassign.
> > And then every PCPU would be maximally utilized, right?
> >
> > My point: Why would you resource-manage CPUs differently from
> > memory?  The demand of real-world workloads varies dramatically
> > for both... don''t you want both to be managed dynamically,
> > whenever possible?
> >
> > If yes (dynamic is good), in order for the toolstack''s view
of
> > memory to be non-racy, doesn''t every hypervisor page
allocation
> > need to be serialized with the toolstack granting
notification/permission?
> 
> Once you bucketize RAM and know you will get synchronous kicks as buckets
fill up, then you have a
> non-racy view. If you choose buckets of width one...
 ... e.g. tmem, which is saving one page of data at high frequency
> >> I further think the pod cache could be converted to this model.
Why have specific per-domain lists
> of
> >> cached pages in the hypervisor? Get them back from the heap!
Obviously places a decoupled
> requirement
> >> of certain toolstack features. But allows to throw away a lot of
complex code.
> >
> > IIUC in George''s (Xapi) model (or using Tim''s
phrase, "balloon-to-fit")
> > the heap is "always" empty because the toolstack has
assigned all memory.
> 
> I don''t think that''s what they mean. Nor is it what I
mean. The toolstack may chunk memory up into
> abstract buckets. It can certainly assert that its bucketized view matches
the hypervisor view. Pages
> flow from the heap to each domain -- but the bucket "domain X"
will not overflow unsupervised.
Right, but it is the "underflow" I am concerned with.

I don''t know if that is what they mean by "balloon-to-fit"
(or exactly
what you mean), but I think we are all trying to optimize the use of
a fixed amount of RAM among some number of VMs.  To me, a corollary
of that is that the size of the heap is always as small "as possible".
And another corollary is that there aren''t a bunch of empty pools
of free pages lying about waiting for rare events to happen.  And
one more corollary is that, to the extent possible, guests aren''t
"wasting" memory.
> > So I''m still confused... where does "page unshare"
get memory from
> > and how does it notify and/or get permission from the toolstack?
> 
> Re sharing, as it should be clear by now, the answer is "it
doesn''t matter". If unsharing cannot be
> satisfied form the heap, then memory management in dom0 is invoked.
Heavy-weight, but it means you''ve
> hit an admin-imposed limit.
Well it *does* matter if that fallback (unsharing cannot be
satisfied from the heap) happens too frequently.
 > Please note that this notion of limits and enforcement is sparingly applied
today, to the best of my
> knowledge. But imho it''d be great to meaningfully work towards it.
Agreed.  There''s lots of policy questions around all of our different
mechanism "planets", so I hope this discussion meaningfully helps!

Thanks for the great discussion!

Dan

Xen devel - Sep 2012 - domain creation vs querying free memory (xend and xl)

domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)

Re: domain creation vs querying free memory (xend and xl)