Andre Przywara
2007-Aug-13 10:01 UTC
[Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests
Hi, these four patches allow to forward NUMA characteristics into HVM guests. This works by allocating memory explicitly from different NUMA nodes and create an appropriate SRAT-ACPI table which describes the topology. Needs a decent guest kernel which uses the SRAT table to discover the NUMA topology. This allows to break the current de-facto limitation of guests to one NUMA node, one can use more memory and/or more VCPUs than there are available on one node. Patch 1/4: introduce numanodes=n config file option. this states how many NUMA nodes the guest should see, the default is 0, which means to turn off most parts of the code. Patch 2/4: introduce CPU affinity for allocate_physmap call. currently the correct NUMA node to take the memory from is chosen by simply using the currently scheduled CPU, this patch allows to explicitly specify a CPU and provides XENMEM_DEFAULT_CPU for the old behavior Patch 3/4: allocate memory with NUMA in mind. actually look at the numanodes=n option to split the memory request up into n parts and allocate it from different nodes. Also change the VCPUs affinity to match the nodes. Patch 4/4: inject created SRAT table into the guest. create a SRAT table, fill it up with the desired NUMA topology and inject it into the guest Applies against staging c/s #15719. Signed-off-by: Andre Przywara <andre.przywara@amd.com> Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 ----to satisfy European Law for business letters: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Anthony
2007-Sep-07 08:42 UTC
RE: [Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests
Hi Andre, This is a good start for supporting guest NUMA. I have some comments. + for (i=0;i<=dominfo.max_vcpu_id;i++) + { + node= ( i * numanodes ) / (dominfo.max_vcpu_id+1); + xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]); + } This always starts from node0, this may make node0 very busy, while other nodes may not have many work. It may be nice to pin node from the lightest overhead node. We also need to add some limitations for numanodes. The number of vcpus on vnode should not be larger than the number of pcpus on pnode. Otherwise vcpus belonging to a domain run on the same pcpu, which is not what we want. In setup_numa_mem, each node has even memory size, if the memory allocation fails, the domain creation fails. This may be too "rude", I think we can support guest NUMA with each node has different memory size, even more, and maybe some node doesn''t have memory. What we need guarantee is guest see physical topology. In your patch, when create NUMA guest, vnode is pinned to pnode. While after some creations and destroys domain operation, the workload on the platform may be very imbalanced, we need a method to dynamically balance workload. There are two methods IMO. 1. Implement NUMA-aware scheduler and page migration 2. Run a daemon in dom0, this daemon monitors workload, and use live-migration to balance workload if necessary. Regards -Anthony>-----Original Message----- >From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel- >bounces@lists.xensource.com] On Behalf Of Andre Przywara >Sent: Monday, August 13, 2007 6:01 PM >To: xen-devel@lists.xensource.com >Subject: [Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests > >Hi, > >these four patches allow to forward NUMA characteristics into HVM >guests. This works by allocating memory explicitly from different NUMA >nodes and create an appropriate SRAT-ACPI table which describes the >topology. Needs a decent guest kernel which uses the SRAT table to >discover the NUMA topology. >This allows to break the current de-facto limitation of guests to one >NUMA node, one can use more memory and/or more VCPUs than there are >available on one node. > > Patch 1/4: introduce numanodes=n config file option. >this states how many NUMA nodes the guest should see, the default is >0, >which means to turn off most parts of the code. > Patch 2/4: introduce CPU affinity for allocate_physmap call. >currently >the correct NUMA node to take the memory from is chosen by simply using >the currently scheduled CPU, this patch allows to explicitly specify a >CPU and provides XENMEM_DEFAULT_CPU for the old behavior > Patch 3/4: allocate memory with NUMA in mind. >actually look at the numanodes=n option to split the memory request up >into n parts and allocate it from different nodes. Also change the VCPUs >affinity to match the nodes. > Patch 4/4: inject created SRAT table into the guest. >create a SRAT table, fill it up with the desired NUMA topology and >inject it into the guest > >Applies against staging c/s #15719. > >Signed-off-by: Andre Przywara <andre.przywara@amd.com> > >Regards, >Andre. > >-- >Andre Przywara >AMD-Operating System Research Center (OSRC), Dresden, Germany >Tel: +49 351 277-84917 >----to satisfy European Law for business letters: >AMD Saxony Limited Liability Company & Co. KG >Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, >Deutschland >Registergericht Dresden: HRA 4896 >vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, >Delaware, USA) >Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy > > > >_______________________________________________ >Xen-devel mailing list >Xen-devel@lists.xensource.com >http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2007-Sep-07 12:49 UTC
Re: [Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests
Anthony, thanks for looking into the patches, I appreciate your comments.> + for (i=0;i<=dominfo.max_vcpu_id;i++) > + { > + node= ( i * numanodes ) / (dominfo.max_vcpu_id+1); > + xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]); > + } > > This always starts from node0, this may make node0 very busy, while other nodes may not have many work.This is true, I encountered this before, but didn''t want to wait longer for sending up the patches. Actually the "numanodes=n" config file option shouldn''t specify the number of nodes, but a list of specific nodes to use, like "numanodes=0,2" to pin the domain on the first and the third node.> It may be nice to pin node from the lightest overhead node.This sounds interesting. It shouldn''t be that hard to do this in libxc, but we should think about a semantic to specify this behavior in the config file (if we change the semantic from the number to specific node like I described above).> We also need to add some limitations for numanodes. The number of vcpus on vnode should not be larger>than the number of pcpus on pnode. Otherwise vcpus belonging to a domain run > on the same pcpu, which is not what we want. Would be nice, but in the moment I would push this into the sysadmin''s responsibility.> In setup_numa_mem, each node has even memory size, if the memory allocation fails, >the domain creation fails. This may be too "rude", I think we cansupport guest > NUMA with each node has different memory size, even more, and maybe some node doesn''t have> memory. What we need guarantee is guest see physical topology.Sound reasonable. I will look into this.> In your patch, when create NUMA guest, vnode is pinned to pnode. While after some creations and destroys domain operation,>the workload on the platform may be very imbalanced, we need a method to dynamically balance workload.> There are two methods IMO. > 1. Implement NUMA-aware scheduler and page migration > 2. Run a daemon in dom0, this daemon monitors workload, and use live-migration to balance workload if necessary.You are right, this may become a problem. I think the second solution is easier to implement. A NUMA-aware scheduler would be nice, but my idea was that the guest OS can better schedule (more fine-grained on a per-process base than on a per-machine base) things. Changing the processing node without moving the memory along should be an exception (as it changes NUMA topology and in the moment I don''t see methods to propagate this nicely to the (HVM) guest), so I think a kind of "real-emergency balancer" which includes page-migration (quite expensive with bigger memory sizes!) would be more appropriate. After all my patches were more a discussion base than a final solution, so I see there is more work to do. In the moment I am working on including PV guests. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 ----to satisfy European Law for business letters: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Anthony
2007-Sep-10 01:14 UTC
RE: [Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests
Andre>> >> This always starts from node0, this may make node0 very busy, whileother>nodes may not have many work. >This is true, I encountered this before, but didn''t want to wait longer >for sending up the patches. Actually the "numanodes=n" config file >option shouldn''t specify the number of nodes, but a list of specific >nodes to use, like "numanodes=0,2" to pin the domain on the first and >the third node.That''s a good idea to specify the nodes to use, We can use "numamodes=0,2" in configure file, and it will be converted into bitmap long numamodes, every bit indicates one node. When guest doesn''t specify "numamodes", XEN will need to choose proper nodes for guest. So XEN also needs to implement some algorithm to choose proper nodes.>> We also need to add some limitations for numanodes. The number ofvcpus>on vnode should not be larger > >than the number of pcpus on pnode. Otherwise vcpus belonging to a >domain run > > on the same pcpu, which is not what we want. >Would be nice, but in the moment I would push this into the sysadmin''s >responsibility.It''s reasonable.>After all my patches were more a discussion base than a final solution, >so I see there is more work to do. In the moment I am working on >including PV guests. >That''s a very good start for support guest NUMA. Regards - Anthony _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Duan, Ronghui
2007-Nov-23 08:42 UTC
RE: [Xen-devel] [PATCH 0/4] [HVM][RFC] NUMA support in HVM guests
Hi Andre, I read your patches and Anthony''s commands. Write a patch based on 1: If guest set numanodes=n (default it will be 1 means that this guest will be restricted in one node); hypervisor will choose begin node to pin for this guest use round robin. But the method I use need a spin_lock to prevent create domain at same time. Are there any more good methods, hope for your suggestion. 2: pass node parameter use higher bits in flags when create domain. At this time, domain can record node information in domain struct for further use, i.e. show which node to pin when setup_guest. If use this method, in your patch, can simply balance nodes just like below;> + for (i=0;i<=dominfo.max_vcpu_id;i++) > + { > + node= ( i * numanodes ) / (dominfo.max_vcpu_id+1)+ > + domaininfo.first_node; > + xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]); > + } >BTW: I can''t find your mail of Patch 2/4: introduce CPU affinity for allocate_physmap call, so I can''t add your patch on source. I just begin my "NUMA trip", appreciate you suggestions. Thanks. Best Regards Ronghui -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Xu, Anthony Sent: Monday, September 10, 2007 9:14 AM To: Andre Przywara Cc: xen-devel@lists.xensource.com Subject: RE: [Xen-devel] [PATCH 0/4] [HVM] NUMA support in HVM guests Andre>> >> This always starts from node0, this may make node0 very busy, whileother>nodes may not have many work. >This is true, I encountered this before, but didn''t want to wait longer >for sending up the patches. Actually the "numanodes=n" config file >option shouldn''t specify the number of nodes, but a list of specific >nodes to use, like "numanodes=0,2" to pin the domain on the first and >the third node.That''s a good idea to specify the nodes to use, We can use "numamodes=0,2" in configure file, and it will be converted into bitmap long numamodes, every bit indicates one node. When guest doesn''t specify "numamodes", XEN will need to choose proper nodes for guest. So XEN also needs to implement some algorithm to choose proper nodes.>> We also need to add some limitations for numanodes. The number ofvcpus>on vnode should not be larger > >than the number of pcpus on pnode. Otherwise vcpus belonging to a >domain run > > on the same pcpu, which is not what we want. >Would be nice, but in the moment I would push this into the sysadmin''s >responsibility.It''s reasonable.>After all my patches were more a discussion base than a final solution, >so I see there is more work to do. In the moment I am working on >including PV guests. >That''s a very good start for support guest NUMA. Regards - Anthony _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
All, thanks Ronghui for your patches and ideas. To make a more structured approach to a better NUMA support, I suggest to concentrate on one-node-guests first: * introduce CPU affinity to memory allocation routines called from Dom0. This is basically my patch 2/4 from August. We should think about using a NUMA node number instead of a physical CPU, is there something to be said against this? * find _some_ method of load balancing when creating guests. The method 1 from Ronghui is a start, but a real decision based on each node''s utilization (or free memory) would be more reasonable. * patch the guest memory allocation routines to allocate memory from that specific node only (based on my patch 3/4) * use live migration to local host to allow node migration. Assuming that localhost live migration works reliably (is that really true?) it shouldn''t be too hard to implement this (basically just using node affinity while allocating guest memory). Since this is a rather expensive operation (takes twice the memory temporarily and quite some time), I''d suggest to trigger that explicitly from the admin via a xm command, maybe as an addition to migrate: # xm migrate --live --node 1 <domid> localhost There could be some Dom0 daemon based re-balancer to do this somewhat automatically later on. I would take care of the memory allocation patch and would look into node migration. It would be great if Roughui or Anthony would help to improve the "load balancing" algorithm. Meanwhile I will continue to patch that d*** Linux kernel to accept both CONFIG_NUMA and CONFIG_XEN without crashing that early ;-), this should allow both HVM and PV guests to support multiple NUMA nodes within one guest. Also we should start a discussion on the config file options to add: Shall we use "numanodes=<nr of nodes>", something like "numa=on" (for one-node-guests only), or something like "numanode=0,1" to explicitly specify certain nodes? Any comments are appreciated.> I read your patches and Anthony''s commands. Write a patch based on > > 1: If guest set numanodes=n (default it will be 1 means that this > guest will be restricted in one node); hypervisor will choose > begin node to pin for this guest use round robin. But the method I use > need a spin_lock to prevent create domain at same time. Are there any > more good methods, hope for your suggestion.That''s a good start, thank you. Maybe Keir has some comments on the spinlock issue.> 2: pass node parameter use higher bits in flags when create domain. > At this time, domain can record node information in domain struct > for further use, i.e. show which node to pin when setup_guest. > If use this method, in your patch, can simply balance nodes just > like below; > >> + for (i=0;i<=dominfo.max_vcpu_id;i++) >> + { >> + node= ( i * numanodes ) / (dominfo.max_vcpu_id+1)+ >> + domaininfo.first_node; >> + xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]); >> + }How many bits do you want to use? Maybe it''s not a good idea to abuse some variable to hold a limited number of nodes only ("640K ought to be enough for anybody" ;-) But the general idea is good. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 277-84917 ----to satisfy European Law for business letters: AMD Saxony Limited Liability Company & Co. KG Sitz (Geschäftsanschrift): Wilschdorfer Landstr. 101, 01109 Dresden, Deutschland Registergericht Dresden: HRA 4896 vertretungsberechtigter Komplementär: AMD Saxony LLC (Sitz Wilmington, Delaware, USA) Geschäftsführer der AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Hi all,>thanks Ronghui for your patches and ideas. To make a more structured >approach to a better NUMA support, I suggest to concentrate on >one-node-guests first:That is exactly what we want do at first, don''t support guest Numa.>* introduce CPU affinity to memory allocation routines called fromDom0.>This is basically my patch 2/4 from August. We should think about using >a NUMA node number instead of a physical CPU, is there something to be >said against this?I think it is reasonable to bind guest with node not CPU.>* find _some_ method of load balancing when creating guests. The method >1 from Ronghui is a start, but a real decision based on each node''s >utilization (or free memory) would be more reasonable.Yes, it is only a start for balancing.>* patch the guest memory allocation routines to allocate memory from >that specific node only (based on my patch 3/4)Considering the performance, we should do it.>* use live migration to local host to allow node migration. Assuming >that localhost live migration works reliably (is that really true?) it >shouldn''t be too hard to implement this (basically just using node >affinity while allocating guest memory). Since this is a rather >expensive operation (takes twice the memory temporarily and quite some >time), I''d suggest to trigger that explicitly from the admin via a xm >command, maybe as an addition to migrate: ># xm migrate --live --node 1 <domid> localhost >There could be some Dom0 daemon based re-balancer to do this somewhat >automatically later on. > >I would take care of the memory allocation patch and would look into >node migration. It would be great if Roughui or Anthony would help to >improve the "load balancing" algorithm.I have no idea on this now.>Meanwhile I will continue to patch that d*** Linux kernel to acceptboth>CONFIG_NUMA and CONFIG_XEN without crashing that early ;-), this should >allow both HVM and PV guests to support multiple NUMA nodes within one >guest. > >Also we should start a discussion on the config file options to add: >Shall we use "numanodes=<nr of nodes>", something like "numa=on" (for >one-node-guests only), or something like "numanode=0,1" to explicitly >specify certain nodes?Because now we don''t support guest Numa, this configure options we don''t need now. If need to support guest Numa, I think users may even want to configure the node''s type, i.e. how many Cpu or memory in that node. I think it will be too complicated. ^_^>Any comments are appreciated. > >> I read your patches and Anthony''s commands. Write a patch based on >> >> 1: If guest set numanodes=n (default it will be 1 means that this >> guest will be restricted in one node); hypervisor will choose >> begin node to pin for this guest use round robin. But themethod I use>> need a spin_lock to prevent create domain at same time. Arethere any>> more good methods, hope for your suggestion. >That''s a good start, thank you. Maybe Keir has some comments on the >spinlock issue. >> 2: pass node parameter use higher bits in flags when create domain. >> At this time, domain can record node information in domain struct >> for further use, i.e. show which node to pin when setup_guest. >> If use this method, in your patch, can simply balance nodes just >> like below; >> >>> + for (i=0;i<=dominfo.max_vcpu_id;i++) >>> + { >>> + node= ( i * numanodes ) / (dominfo.max_vcpu_id+1)+ >>> + domaininfo.first_node; >>> + xc_vcpu_setaffinity (xc_handle, dom, i, nodemasks[node]); >>> + } >How many bits do you want to use? Maybe it''s not a good idea to abuse >some variable to hold a limited number of nodes only ("640K ought to be >enough for anybody" ;-) But the general idea is good.Actually if no need to support guest Numa, no parameter need to pass down. Seems that one node for guest is a good method. ^_^ Best regards, Ronghui _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel