thr3ads.net - Xen devel - [Xen-devel] [RFC] Xen NUMA strategy [Sep 2007]

If this information is useful, please help other people find it:
Share via:

Andre Przywara

2007-Sep-14 12:05 UTC

[Xen-devel] [RFC] Xen NUMA strategy

Hi,

Anthony Xu and I have had some fruitful discussion about the further 
direction of the NUMA support in Xen, I wanted to share the results with 
the Xen community and start a discussion:

We came up with two different approaches for better NUMA support in Xen:
1.) Guest NUMA support: spread a guest''s resources (CPUs and memory) 
over several nodes and propagate the appropriate topology to the guest.
The first part of this is in the patches I sent recently to the list (PV 
support is following, bells and whistles like automatic placement will 
follow, too.).
	***Advantages***:
- The guest OS has better means to deal with the NUMA setup, it can more 
easily migrate _processes_ among the nodes (Xen-HV can only migrate 
whole domains).
- Changes to Xen are relatively small.
- There is no limit for the guest resources, since they can use more 
resources than there are on one node.
- If guests are well spread over the nodes, the system is more balanced 
even if guests are destroyed and created later.
	***Disadvantages***:
- The guest has to support NUMA. This is not true for older guests 
(Win2K, older Linux).
- The guest''s workload has to fit NUMA. If the guests tasks are merely 
parallelizable or use much shared memory, they cannot take advantage of 
NUMA and will degrade in performance. This includes all single task 
problems.

In general this approach seems to fit better with smaller NUMA nodes and 
larger guests.

2.) Dynamic load balancing and page migration: create guests within one 
NUMA node and distribute all guests across the nodes. If the system 
becomes imbalanced, migrate guests to other nodes and copy (at least 
part of) their memory pages to the other node''s local memory.
	***Advantages***:
- No guest NUMA support necessary. Older as well a recent guests should 
run fine.
- Smaller guests don''t have to cope with NUMA and will have
''flat''
memory available.
- Guests running on separate nodes usually don''t disturb each other and
can benefit from the higher distributed memory bandwidth.
	***Disadvantages***:
- Guests are limited to the resources available on one node. This 
applies for both the number of CPUs and the amount of memory.
- Costly migration of guests. In a simple implementation we''d use live 
migration, which requires the whole guest''s memory to be copied before 
the guest starts to run on the other node. If this whole move proves to 
be unnecessary a few minutes later, all this was in vain. A more 
advanced implementation would do the page migration in the background 
and thus can avoid this problem, if only the hot pages are migrated first.
- Integration into Xen seems to be more complicated (at least for the 
more ungifted hackers among us).

This approach seems to be more reasonable if you have larger nodes (for 
instance 16 cores) and smaller guests (the more usual case nowadays?)

After some discussion we came to the conclusion that both approaches 
should be implemented. I want to put this to the list and am looking 
forward to any feedback.

Regards,
Andre.




-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 277-84917
----to satisfy European Law for business letters:
AMD Saxony Limited Liability Company & Co. KG
Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, 
Deutschland
Registergericht Dresden: HRA 4896
vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, 
Delaware, USA)
Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Akio Takebe

2007-Sep-18 06:08 UTC

head link

Re: [Xen-devel] [RFC] Xen NUMA strategy

Hi, Andre, Anthony and all
>Anthony Xu and I have had some fruitful discussion about the further 
>direction of the NUMA support in Xen, I wanted to share the results with 
>the Xen community and start a discussion:Thank you for this sharing.
>We came up with two different approaches for better NUMA support in Xen:
>1.) Guest NUMA support: spread a guest''s resources (CPUs and
memory)
>over several nodes and propagate the appropriate topology to the guest.
>The first part of this is in the patches I sent recently to the list (PV 
>support is following, bells and whistles like automatic placement will 
>follow, too.).
>	***Advantages***:
>- The guest OS has better means to deal with the NUMA setup, it can more 
>easily migrate _processes_ among the nodes (Xen-HV can only migrate 
>whole domains).
>- Changes to Xen are relatively small.
>- There is no limit for the guest resources, since they can use more 
>resources than there are on one node.
>- If guests are well spread over the nodes, the system is more balanced 
>even if guests are destroyed and created later.
>	***Disadvantages***:
>- The guest has to support NUMA. This is not true for older guests 
>(Win2K, older Linux).
>- The guest''s workload has to fit NUMA. If the guests tasks are
merely
>parallelizable or use much shared memory, they cannot take advantage of 
>NUMA and will degrade in performance. This includes all single task 
>problems.
>We may need to write something about guest NUMA in guest configuration file.
For example, in guest configuration file;
vnode = <a number of guest node>
vcpu = [<vcpus# pinned into the node: machine node#>, ...]
memory = [<amount of memory per node: machine node#>, ...]

e.g.
vnode = 2
vcpu = [0-1:0, 2-3:1]
memory = [128:0, 128:1]

If we setup vnode=1, old OSes should work fine.

And almost OSes read NUMA configuration only at booting and CPU/memory hotplug.
So if xen migrate vcpu, xen has to occur hotpulg event.
It''s costly. So pinning vcpu to node may be good.

In this case, we may need something also about cap/weight.
>In general this approach seems to fit better with smaller NUMA nodes and 
>larger guests.
>
>2.) Dynamic load balancing and page migration: create guests within one 
>NUMA node and distribute all guests across the nodes. If the system 
>becomes imbalanced, migrate guests to other nodes and copy (at least 
>part of) their memory pages to the other node''s local memory.
>	***Advantages***:
>- No guest NUMA support necessary. Older as well a recent guests should 
>run fine.
>- Smaller guests don''t have to cope with NUMA and will have
''flat''
>memory available.
>- Guests running on separate nodes usually don''t disturb each other
and
>can benefit from the higher distributed memory bandwidth.
>	***Disadvantages***:
>- Guests are limited to the resources available on one node. This 
>applies for both the number of CPUs and the amount of memory.
>- Costly migration of guests. In a simple implementation we''d use
live
>migration, which requires the whole guest''s memory to be copied
before
>the guest starts to run on the other node. If this whole move proves to 
>be unnecessary a few minutes later, all this was in vain. A more 
>advanced implementation would do the page migration in the background 
>and thus can avoid this problem, if only the hot pages are migrated first.
>- Integration into Xen seems to be more complicated (at least for the 
>more ungifted hackers among us).
>
>This approach seems to be more reasonable if you have larger nodes (for 
>instance 16 cores) and smaller guests (the more usual case nowadays?)If xen migrate a guest, does the system need to have two times memory
of the guest?

I think basicaly pinning a guest into a node is good.
If the system becomes imbalanced, and we absolutely want
to migration a guest, then xen temporarily migrate only vcpus,
and we abandon the performance at that time.
What do you think?

Best Regards,

Akio Takebe


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Anthony

2007-Sep-18 06:33 UTC

head link

RE: [Xen-devel] [RFC] Xen NUMA strategy

>We may need to write something about guest NUMA in guest configuration
file.>For example, in guest configuration file;
>vnode = <a number of guest node>
>vcpu = [<vcpus# pinned into the node: machine node#>, ...]
>memory = [<amount of memory per node: machine node#>, ...]
>
>e.g.
>vnode = 2
>vcpu = [0-1:0, 2-3:1]
>memory = [128:0, 128:1]
>
>If we setup vnode=1, old OSes should work fine.
This is something we need to do.
But if user forget to configure guest NUMA in guest configuration file.
Xen needs to provide an optimized guest NUMA information based on
current workload on physical machine.
We need provide both, user configuration can override default
configuration.



>
>And almost OSes read NUMA configuration only at booting and CPU/memory
>hotplug.
>So if xen migrate vcpu, xen has to occur hotpulg event.Guest should not know the vcpu migration, so xen doesn''t trigger
hotplug
event to guest.

Maybe we should not call it vcpu migration; we can call it vnode
migration.
Xen (maybe dom0 application) needs to migrate vnode ( include vcpus and
memorys) from a physical node to another physical node. The guest NUMA
topology is not changed, so Xen doesn''t need to inform guest of the
vnode migration.

>It''s costly. So pinning vcpu to node may be good.Agree
>I think basicaly pinning a guest into a node is good.
>If the system becomes imbalanced, and we absolutely want
>to migration a guest, then xen temporarily migrate only vcpus,
>and we abandon the performance at that time.As I mentioned above, it is not temporary migration. And it will not
impact performance, (it may impact the performance only at the process
of vnode migration)


And I think imbalanced is rare in VMM if user doesn''t create and
destroy
domain frequently. And VMs on VMM are far less than applications on
machine.

- Anthony

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Akio Takebe

2007-Sep-18 06:57 UTC

head link

RE: [Xen-devel] [RFC] Xen NUMA strategy

Hi, Anthony
>>We may need to write something about guest NUMA in guest configuration
>file.
>>For example, in guest configuration file;
>>vnode = <a number of guest node>
>>vcpu = [<vcpus# pinned into the node: machine node#>, ...]
>>memory = [<amount of memory per node: machine node#>, ...]
>>
>>e.g.
>>vnode = 2
>>vcpu = [0-1:0, 2-3:1]
>>memory = [128:0, 128:1]
>>
>>If we setup vnode=1, old OSes should work fine.
>
>This is something we need to do.
>But if user forget to configure guest NUMA in guest configuration file.
>Xen needs to provide an optimized guest NUMA information based on
>current workload on physical machine.
>We need provide both, user configuration can override default
>configuration.
>It sounds good.
>>
>>And almost OSes read NUMA configuration only at booting and CPU/memory
>>hotplug.
>>So if xen migrate vcpu, xen has to occur hotpulg event.
>Guest should not know the vcpu migration, so xen doesn''t trigger
hotplug
>event to guest.
>
>Maybe we should not call it vcpu migration; we can call it vnode
>migration.
>Xen (maybe dom0 application) needs to migrate vnode ( include vcpus and
>memorys) from a physical node to another physical node. The guest NUMA
>topology is not changed, so Xen doesn''t need to inform guest of the
>vnode migration.
>I got it. It sounds cool.
>And I think imbalanced is rare in VMM if user doesn''t create and
destroy
>domain frequently. And VMs on VMM are far less than applications on
>machine.
>I also think so.

Best Regards,

Akio Takebe


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Sep-18 08:43 UTC

head link

RE: [Xen-devel] [RFC] Xen NUMA strategy

> >We may need to write something about guest NUMA in guest
configuration> file.
> >For example, in guest configuration file;
> >vnode = <a number of guest node>
> >vcpu = [<vcpus# pinned into the node: machine node#>, ...]
> >memory = [<amount of memory per node: machine node#>, ...]
> >
> >e.g.
> >vnode = 2
> >vcpu = [0-1:0, 2-3:1]
> >memory = [128:0, 128:1]
> >
> >If we setup vnode=1, old OSes should work fine.
We need to think carefully about NUMA use cases before implementing a
bunch of mechanism.

The way I see it, in most situations it will not make sense for guests
to span NUMA nodes: you''ll have a number of guests with relatively
small
numbers of vCPUs, and it probably makes sense to allow the guests to be
pinned to nodes. What we have in Xen today works pretty well for this
case, but we could make configuration easier by looking at more
sophisticated mechanisms for specifying CPU groups rather than just
pinning. Migration between nodes could be handled with a locahost
migrate, but we could obviously come up with something more time/space
efficient (particularly for HVM gusts) if required. 

There may be some usage scenarios where having a large SMP guest that
spans multiple nodes would be desirable. However, there''s a bunch of
scalability works that''s required in Xen before this will really make
sense, and all of this is much higher priority (and more generally
useful) than figuring out how to expose NUMA topology to guests. I''d
definitely encourage looking at the guest scalability issues first.

Thanks,
Ian 



 > This is something we need to do.
> But if user forget to configure guest NUMA in guest configuration
file.> Xen needs to provide an optimized guest NUMA information based on
> current workload on physical machine.
> We need provide both, user configuration can override default
> configuration.
> 
> 
> 
> 
> >
> >And almost OSes read NUMA configuration only at booting and
CPU/memory> >hotplug.
> >So if xen migrate vcpu, xen has to occur hotpulg event.
> Guest should not know the vcpu migration, so xen doesn''t trigger
> hotplug
> event to guest.
> 
> Maybe we should not call it vcpu migration; we can call it vnode
> migration.
> Xen (maybe dom0 application) needs to migrate vnode ( include vcpus
and> memorys) from a physical node to another physical node. The guest NUMA
> topology is not changed, so Xen doesn''t need to inform guest of
the
> vnode migration.
> 
> 
> >It''s costly. So pinning vcpu to node may be good.
> Agree
> 
> >I think basicaly pinning a guest into a node is good.
> >If the system becomes imbalanced, and we absolutely want
> >to migration a guest, then xen temporarily migrate only vcpus,
> >and we abandon the performance at that time.
> As I mentioned above, it is not temporary migration. And it will not
> impact performance, (it may impact the performance only at the process
> of vnode migration)
> 
> 
> And I think imbalanced is rare in VMM if user doesn''t create and
> destroy
> domain frequently. And VMs on VMM are far less than applications on
> machine.
> 
> - Anthony
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Aron Griffis

2007-Sep-18 13:30 UTC

head link

Re: [Xen-devel] [RFC] Xen NUMA strategy

Hi Ian,

Ian Pratt wrote:  [Tue Sep 18 2007, 04:43:24AM EDT]> However, there''s a bunch of scalability works that''s
required in Xen
> before this will really make sense, and all of this is much higher
> priority (and more generally useful) than figuring out how to expose
> NUMA topology to guests.
Could you elaborate on this sentence?  What are you thinking of?

Thanks,
Aron

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Aron Griffis

2007-Sep-18 14:31 UTC

head link

Re: [Xen-devel] [RFC] Xen NUMA strategy

Hi Andre,

Andre Przywara wrote:  [Fri Sep 14 2007, 08:05:26AM EDT]> We came up with two different approaches for better NUMA support in Xen:
> 1.) Guest NUMA support: spread a guest''s resources (CPUs and
memory) over
> several nodes and propagate the appropriate topology to the guest.
> The first part of this is in the patches I sent recently to the list (PV 
> support is following, bells and whistles like automatic placement will 
> follow, too.).
It seems like you are proposing two things at once here.  Let''s call
these 1a and 1b

1a. Expose NUMA topology to the guests.  This isn''t the topology of
    dom0, just the topology of the domU, i.e. it is constructed by
    dom0 when starting the domain.

1b. Spread the guest over nodes.  I can''t tell if you mean to do this
    automatically or by request when starting the guest.  This seems
    to be separate from 1a.
> 	***Advantages***:
> - The guest OS has better means to deal with the NUMA setup, it can more 
> easily migrate _processes_ among the nodes (Xen-HV can only migrate whole 
> domains).
> - Changes to Xen are relatively small.
> - There is no limit for the guest resources, since they can use more 
> resources than there are on one node.
The advantages above relate to 1a
> - If guests are well spread over the nodes, the system is more balanced 
> even if guests are destroyed and created later.
and this advantage relates to 1b
> 	***Disadvantages***:
> - The guest has to support NUMA. This is not true for older guests (Win2K, 
> older Linux).
> - The guest''s workload has to fit NUMA. If the guests tasks are
merely
> parallelizable or use much shared memory, they cannot take advantage of 
> NUMA and will degrade in performance. This includes all single task 
> problems.
IMHO the list of disadvantages is only what we have in xen today.
Presently no guests can see the NUMA topology, so it''s the same as if
they don''t have support in the guest.  Adding NUMA topology
propogation does not create these disadvantages, it simply exposes the
weakness of the lesser operating systems.
> In general this approach seems to fit better with smaller NUMA nodes
> and larger guests.
>
> 2.) Dynamic load balancing and page migration: create guests within one 
> NUMA node and distribute all guests across the nodes. If the system becomes
> imbalanced, migrate guests to other nodes and copy (at least part of) their
> memory pages to the other node''s local memory.
Again, this seems like a two-part proposal.

2a. Add to xen the ability to run a guest within a node, so that cpus
    and ram are allocated from within the node instead of randomly
    across the system.

2b. NUMA balancing.  While this seems like a worthwhile goal, IMHO
    it''s separate from the first part of the proposal.
> 	***Advantages***:
> - No guest NUMA support necessary. Older as well a recent guests should run
> fine.
> - Smaller guests don''t have to cope with NUMA and will have
''flat'' memory
> available.
> - Guests running on separate nodes usually don''t disturb each
other and can
> benefit from the higher distributed memory bandwidth.
> 	***Disadvantages***:
> - Guests are limited to the resources available on one node. This applies 
> for both the number of CPUs and the amount of memory.
Advantages and disadvantages above apply to 2a
> - Costly migration of guests. In a simple implementation we''d use
live
> migration, which requires the whole guest''s memory to be copied
before the
> guest starts to run on the other node. If this whole move proves to be 
> unnecessary a few minutes later, all this was in vain. A more advanced 
> implementation would do the page migration in the background and thus can 
> avoid this problem, if only the hot pages are migrated first.
This applies to 2b
> - Integration into Xen seems to be more complicated (at least for the more 
> ungifted hackers among us).
It seems like 2a would be significantly easier than 2b

If the mechanics of migrating between NUMA nodes is implemented in the
hypervisor, then policy and control can be implemented in dom0
userland, so none of the automatic part of this needs to be in the
hypervisor.

Thanks for getting started on this.  Getting some of this into Xen
would be great!

Aron

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Sep-19 01:04 UTC

head link

RE: [Xen-devel] [RFC] Xen NUMA strategy

> Ian Pratt wrote:  [Tue Sep 18 2007, 04:43:24AM EDT]
> > However, there''s a bunch of scalability works that''s
required in Xen
> > before this will really make sense, and all of this is much higher
> > priority (and more generally useful) than figuring out how to expose
> > NUMA topology to guests.
> 
> Could you elaborate on this sentence?  What are you thinking of?
Eliminating the need to hold the domain lock for pagetable updates for
PV guests would certainly be a guest scalability win. The page ref
counting is designed to make this possible.

Similarly, HVM guests would benefit from optimizing the amount of time
the shadow lock is held for (eliminating it altogether is harder). [NB:
per VCPU shadows is one strategy for removing the lock but it brings
with it a whole host of other synchronization issues that make me
strongly suspect its not worth it.]

Xen''s CPU scheduler could certainly do with some improvements when it
comes to multiple multi-CPU guests. We probably want behaviour that
tends towards gang scheduling yet remains work conserving.

We also need some kind of "bad pre-emption" avoidance or mitigation
strategy to avoid other VCPUs spinning waiting for a VCPU that isn''t
running. We could implement avoidance by giving a VCPU some extra time
if its holding a lock at the point we pre-empt it, the latter by doing a
directed yield in the lock slow path if the VCPU holding the lock is not
running. Both require a new paravirtualization extension to the OS.

Ian

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Anthony

2007-Sep-20 01:44 UTC

head link

RE: [Xen-devel] [RFC] Xen NUMA strategy

>-----Original Message-----
>From: Ian Pratt [mailto:Ian.Pratt@cl.cam.ac.uk]
>Sent: Tuesday, September 18, 2007 4:43 PM
>To: Xu, Anthony; Akio Takebe; Andre Przywara;
xen-devel@lists.xensource.com>Cc: ian.pratt@cl.cam.ac.uk
>Subject: RE: [Xen-devel] [RFC] Xen NUMA strategy
>
>> >We may need to write something about guest NUMA in guest
>configuration
>> file.
>> >For example, in guest configuration file;
>> >vnode = <a number of guest node>
>> >vcpu = [<vcpus# pinned into the node: machine node#>, ...]
>> >memory = [<amount of memory per node: machine node#>, ...]
>> >
>> >e.g.
>> >vnode = 2
>> >vcpu = [0-1:0, 2-3:1]
>> >memory = [128:0, 128:1]
>> >
>> >If we setup vnode=1, old OSes should work fine.
>
>We need to think carefully about NUMA use cases before implementing a
>bunch of mechanism.
Agree, that''s why we posted this thread, we hope we can get enough
input.


>
>The way I see it, in most situations it will not make sense for guests
>to span NUMA nodes: you''ll have a number of guests with relatively
small>numbers of vCPUs, and it probably makes sense to allow the guests to be
>pinned to nodes.What we have in Xen today works pretty well for this
>case, but we could make configuration easier by looking at more
>sophisticated mechanisms for specifying CPU groups rather than just
>pinning. Migration between nodes could be handled with a locahost
>migrate, but we could obviously come up with something more time/space
>efficient (particularly for HVM gusts) if required.
>
>There may be some usage scenarios where having a large SMP guest that
>spans multiple nodes would be desirable. However, there''s a bunch
of
>scalability works that''s required in Xen before this will really
make
>sense, and all of this is much higher priority (and more generally
>useful) than figuring out how to expose NUMA topology to guests.
I''d
>definitely encourage looking at the guest scalability issues first.

	What have you said maybe true, many of guests have small numbers
of vCPUs. In this situation, we need to pin guest to node for good
performance.
Pining guest to node may lead to imbalance after some creating and
destroying guest. We also need to handle imbalance. Better host NUMA
support is needed.
	
	Even we don''t have big guest, we may also need to let guest span
NUMA node.  For example, when we create a guest which has big memory,
none of the NUMA node can satisfy the memory request, so this guest has
to span NUMA node. We need to provide guest the NUMA information.

	There is still very small NUMA node. May be one CPU per node, if
guest has two vCPUs, we need provide guest NUMA information, and
otherwise it will impact performance badly.


- Anthony

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Aron Griffis

2007-Sep-20 03:09 UTC

head link

Re: [Xen-devel] [RFC] Xen NUMA strategy

Ian Pratt wrote:  [Tue Sep 18 2007, 04:43:24AM EDT]> The way I see it, in most situations it will not make sense for guests
> to span NUMA nodes: you''ll have a number of guests with relatively
small
> numbers of vCPUs, and it probably makes sense to allow the guests to be
> pinned to nodes. What we have in Xen today works pretty well for this
> case, [snip]
One part that doesn''t work well presently is memory locality.  A guest
can be pinned to a CPU but its allocated memory might be on a distant
node...

Aron

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Sep-20 09:50 UTC

head link

RE: [Xen-devel] [RFC] Xen NUMA strategy

> Ian Pratt wrote:  [Tue Sep 18 2007, 04:43:24AM EDT]
> > The way I see it, in most situations it will not make sense for
guests> > to span NUMA nodes: you''ll have a number of guests with
relatively
small> > numbers of vCPUs, and it probably makes sense to allow the guests to
be> > pinned to nodes. What we have in Xen today works pretty well for
this> > case, [snip]
> 
> One part that doesn''t work well presently is memory locality.  A
guest
> can be pinned to a CPU but its allocated memory might be on a distant
> node...
If the guest''s VCPUs were pinned to a particular node (or nodes) at the
time the domain was created or additional memory was allocated to it,
then the memory *will* be allocated from the right node(s). If you have
a domain CPU mask that covers multiple nodes, memory will be
''striped''
across the nodes.

Ian

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Pratt

2007-Sep-20 09:56 UTC

head link

RE: [Xen-devel] [RFC] Xen NUMA strategy

> >There may be some usage scenarios where having a large SMP guest that
> >spans multiple nodes would be desirable. However, there''s a
bunch of
> >scalability works that''s required in Xen before this will
really make
> >sense, and all of this is much higher priority (and more generally
> >useful) than figuring out how to expose NUMA topology to guests.
I''d
> >definitely encourage looking at the guest scalability issues first.
>  
> 	What have you said maybe true, many of guests have small numbers
> of vCPUs. In this situation, we need to pin guest to node for good
> performance.
> Pining guest to node may lead to imbalance after some creating and
> destroying guest. We also need to handle imbalance. Better host NUMA
> support is needed.
Localhost relocate is a crude way of doing this rebalancing today. Sure,
we can do better, but it''s a solution.
> 	Even we don''t have big guest, we may also need to let guest span
> NUMA node.  For example, when we create a guest which has big memory,
> none of the NUMA node can satisfy the memory request, so this guest
has> to span NUMA node. We need to provide guest the NUMA information.
In that far from optimal situation you''ll likely to want to try and
rebalance things at some point later. Since no guest OS I''m aware of
understands dynamic NUMA information I seriously doubt there''s any good
can come from telling it about the temporary situation. 
> 	There is still very small NUMA node. May be one CPU per node, if
> guest has two vCPUs, we need provide guest NUMA information, and
> otherwise it will impact performance badly.
Ian



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

André Przywara

2007-Sep-20 10:26 UTC

head link

Re: [Xen-devel] [RFC] Xen NUMA strategy

Hi Aron,

 >> 1.) Guest NUMA support: spread a guest''s resources (CPUs and
memory)
 >> over several nodes and propagate the appropriate topology to the
 >> guest. ...
 >It seems like you are proposing two things at once here.  Let''s
call
 >these 1a and 1b
 >1a. Expose NUMA topology to the guests.  This isn''t the topology
of
 >    dom0, just the topology of the domU, i.e. it is constructed by
 >    dom0 when starting the domain.
 >1b. Spread the guest over nodes.  I can''t tell if you mean to do
this
 >    automatically or by request when starting the guest.  This seems
 >    to be separate from 1a.
 From an implementation point-of-view this is right, if you look at my 
patches I sent mid of August those parts are done in seperate patches:
http://lists.xensource.com/archives/html/xen-devel/2007-08/msg00275.html
Patch 3/4 cares about 1b), Patch 4/4 is about 1a)
But both parts do not make much sense if done seperately. If you spread 
the guest over several nodes and don''t tell the guest OS about it, you 
will have about the same behaviour Xen had before the integration of the 
basic NUMA patches from Ryan Harper in October 2006.

 >>       ***Disadvantages***:
 >> - The guest has to support NUMA...
 >> - The guest''s workload has to fit NUMA...
 >IMHO the list of disadvantages is only what we have in xen today.
 >Presently no guests can see the NUMA topology, so it''s the same as
if
 >they don''t have support in the guest.  Adding NUMA topology
 >propogation does not create these disadvantages, it simply exposes the
 >weakness of the lesser operating systems.
This was mostly thought of disadvantages against the solution 2)

 >> 2.) Dynamic load balancing and page migration:
 >Again, this seems like a two-part proposal.
 >2a. Add to xen the ability to run a guest within a node, so that cpus
 >    and ram are allocated from within the node instead of randomly
 >    across the system.
This is already in Xen, at least if you pin the guest manually to a 
certain node _before_ creating the guest (by saying for instance 
cpus=0,1 if the first node consists of the first two CPUs). Xen will try 
to allocate the guest''s memory from within the node the first VCPU is 
currently scheduled on (at least for HVM guests).

 >2b. NUMA balancing.  While this seems like a worthwhile goal, IMHO
 >    it''s separate from the first part of the proposal.
This is most of the work that has to be done.

 > If the mechanics of migrating between NUMA nodes is implemented in the
 > hypervisor, then policy and control can be implemented in dom0
 > userland, so none of the automatic part of this needs to be in the
 > hypervisor.
This maybe true, at least there should be some means to manually migrate 
domains between nodes, which must be triggered from Dom0. So automatic 
behavior could be triggered from there, too.

Andre.

--
Andre Przywara
AMD - Operating System Research Center, Dresden, Germany


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Aron Griffis

2007-Sep-21 21:36 UTC

head link

Re: [Xen-devel] [RFC] Xen NUMA strategy

Ian Pratt wrote:  [Thu Sep 20 2007, 05:50:26AM EDT]> If the guest''s VCPUs were pinned to a particular node (or nodes)
at
> the time the domain was created or additional memory was allocated
> to it, then the memory *will* be allocated from the right node(s).
> If you have a domain CPU mask that covers multiple nodes, memory
> will be ''striped'' across the nodes.
Thanks, I didn''t realize that much is already working.

Aron

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Sep 2007 - [RFC] Xen NUMA strategy

[Xen-devel] [RFC] Xen NUMA strategy

Re: [Xen-devel] [RFC] Xen NUMA strategy

RE: [Xen-devel] [RFC] Xen NUMA strategy

RE: [Xen-devel] [RFC] Xen NUMA strategy

RE: [Xen-devel] [RFC] Xen NUMA strategy

Re: [Xen-devel] [RFC] Xen NUMA strategy

Re: [Xen-devel] [RFC] Xen NUMA strategy

RE: [Xen-devel] [RFC] Xen NUMA strategy

RE: [Xen-devel] [RFC] Xen NUMA strategy

Re: [Xen-devel] [RFC] Xen NUMA strategy

RE: [Xen-devel] [RFC] Xen NUMA strategy

RE: [Xen-devel] [RFC] Xen NUMA strategy

Re: [Xen-devel] [RFC] Xen NUMA strategy

Re: [Xen-devel] [RFC] Xen NUMA strategy