Andre Przywara
2010-Feb-04 21:50 UTC
[Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Hi, to avoid double work in the community on the same topic and to help syncing on the subject and as I am not in office next week, I would like to send the NUMA guest support patches I have so far. These patches introduce NUMA support for guests. This can be handy if either the guests resources (VCPUs and/or memory) exceed one node''s capacity or the host is already loaded so that the requirement cannot be satisfied from one node alone. Some applications may also benefit from the aggregated bandwidth of multiple memory controllers. Even if the guest has only a single node, this code replaces the current NUMA placement mechanism by moving it into libxc. I have changed something lately, so there are some loose ends, but it should suffice as a discussion base. The patches are for HVM guest primarily, as I don''t deal much with PV I am not sure whether a port would be straight-forward or the complexity is higher. One thing I was not sure about is how to communicate the NUMA topology to PV guests. Reusing the existing code base and inject a generated ACPI table seems smart, but this would mean to enable ACPI parsing code in PV Linux, which currently seems to be disabled (?). If someone wants to step in and implement PV support, I will be glad to help. I have reworked the (guest node to) host node assignment part, this is currently unfinished. I decided to move the node-rating part from XendDomainInfo.py:find_relaxed_node() into libxc (should this eventually go into libxenlight?) to avoid passing to much information between the layers and to include libxl support. This code snippet (patch 5/5) basically scans all VCPUs on all domains and generates an array holding the node load metric for future sorting. The missing part is here a static function in xc_hvm_build.c to pick the <n> best nodes and populate the numainfo->guest_to_host_node array with the result. I will do this when I will be back. For more details see the following email bodies. Thanks and Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 488-3567-12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cui, Dexuan
2010-Feb-05 16:35 UTC
RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Hi Andre, I'm also looking into hvm guest's numa support and I'd like to share my thoughs and supply my understanding about your patches. 1) Besides SRAT, I think we should also build guest SLIT according to host SLIT. 2) I agree we should supply the user a way to specify which guest node should have how much memory, namely, the "nodemem" parameter in your patch02. However, I can't find where it is assigned a value in your patches. I guess you missed it in image.py. And what if xen can't allocate memory from the specified host node(e.g., no enough free memory on the host node)? -- currently xen *silently* tries to allocate memory from other host nodes -- this would hurt guest performance while the user doesn't know that at all! I think we should add an option in guest config file: if it's set, the guest creation should fail if xen can not allocate memory from the specified host node. 3) In your patch02: + for (i = 0; i < numanodes; i++) + numainfo.guest_to_host_node[i] = i % 2; As you said in the mail "[PATCH 5/5]", at present it "simply round robin until the code for automatic allocation is in place", I think "simply round robin" is not acceptable and we should implement "automatic allocation". 4) Your patches try to sort the host nodes using a noad load evaluation algorithm, and require the user to specify how many guest nodes the guest should see, and distribute equally guest vcpus into each guest node. I don't think the algorithm could be wise enough every time and it's not flexiable. Requiring the user to specify the number of guest node and districuting vcpus equally into each guest node also doesn't sound wise enough and flexible. Since guest numa needs vcpu pinning to work as expected, how about my below thoughs? a) ask the user to use "cpus" option to pin each vcpu to a physical cpu (or node); b) find out how many physical nodes (host nodes) are involved and use that number as the number of guest node; c) each guest node corresponds to a host node found out in step b) and use this info to fill the numainfo.guest_to_host_node[] in 3). 5) I think we also need to present the numa guest with virtual cpu topology, e.g., throught the initial APCI ID. In current xen, apic_id = vcpu_id * 2; even if we have the guest SRAT support and use 2 guest nodes for a vcpus=n guest, the guest would still think it's on a package with n cores without the knowledge of vcpu and cache topology and this would do harm to the performance of guest. I think we can use each guest node as a guest package and by giving the guest a proper APIC ID (consisting of guest SMT_ID/Core_ID/Package_ID) to show the vcpu topology to guest. This needs changes to the hvmloader's SRAT/MADT's APID ID fields, xen's cpuid/vlapic emulation. 6) HVM vcpu's hot add/remove functionlity was added into xen recently. The guest numa support should take this into consideration. 7) I don't see the live migration support in your patches. Looks it's hard for hvm numa guest to do live migration as the src/dest hosts could be very different in HW configuration. Thanks, -- Dexuan -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Andre Przywara Sent: 2010年2月5日 5:51 To: Keir Fraser; Kamble, Nitin A Cc: xen-devel@lists.xensource.com Subject: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support Hi, to avoid double work in the community on the same topic and to help syncing on the subject and as I am not in office next week, I would like to send the NUMA guest support patches I have so far. These patches introduce NUMA support for guests. This can be handy if either the guests resources (VCPUs and/or memory) exceed one node's capacity or the host is already loaded so that the requirement cannot be satisfied from one node alone. Some applications may also benefit from the aggregated bandwidth of multiple memory controllers. Even if the guest has only a single node, this code replaces the current NUMA placement mechanism by moving it into libxc. I have changed something lately, so there are some loose ends, but it should suffice as a discussion base. The patches are for HVM guest primarily, as I don't deal much with PV I am not sure whether a port would be straight-forward or the complexity is higher. One thing I was not sure about is how to communicate the NUMA topology to PV guests. Reusing the existing code base and inject a generated ACPI table seems smart, but this would mean to enable ACPI parsing code in PV Linux, which currently seems to be disabled (?). If someone wants to step in and implement PV support, I will be glad to help. I have reworked the (guest node to) host node assignment part, this is currently unfinished. I decided to move the node-rating part from XendDomainInfo.py:find_relaxed_node() into libxc (should this eventually go into libxenlight?) to avoid passing to much information between the layers and to include libxl support. This code snippet (patch 5/5) basically scans all VCPUs on all domains and generates an array holding the node load metric for future sorting. The missing part is here a static function in xc_hvm_build.c to pick the <n> best nodes and populate the numainfo->guest_to_host_node array with the result. I will do this when I will be back. For more details see the following email bodies. Thanks and Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 488-3567-12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cui, Dexuan
2010-Feb-22 10:24 UTC
RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Hi Andre, have you returned to office now? :-) Thanks, -- Dexuan -----Original Message----- From: Cui, Dexuan Sent: 2010年2月6日 0:36 To: 'Andre Przywara'; Keir Fraser; Kamble, Nitin A Cc: xen-devel@lists.xensource.com Subject: RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support Hi Andre, I'm also looking into hvm guest's numa support and I'd like to share my thoughs and supply my understanding about your patches. 1) Besides SRAT, I think we should also build guest SLIT according to host SLIT. 2) I agree we should supply the user a way to specify which guest node should have how much memory, namely, the "nodemem" parameter in your patch02. However, I can't find where it is assigned a value in your patches. I guess you missed it in image.py. And what if xen can't allocate memory from the specified host node(e.g., no enough free memory on the host node)? -- currently xen *silently* tries to allocate memory from other host nodes -- this would hurt guest performance while the user doesn't know that at all! I think we should add an option in guest config file: if it's set, the guest creation should fail if xen can not allocate memory from the specified host node. 3) In your patch02: + for (i = 0; i < numanodes; i++) + numainfo.guest_to_host_node[i] = i % 2; As you said in the mail "[PATCH 5/5]", at present it "simply round robin until the code for automatic allocation is in place", I think "simply round robin" is not acceptable and we should implement "automatic allocation". 4) Your patches try to sort the host nodes using a noad load evaluation algorithm, and require the user to specify how many guest nodes the guest should see, and distribute equally guest vcpus into each guest node. I don't think the algorithm could be wise enough every time and it's not flexiable. Requiring the user to specify the number of guest node and districuting vcpus equally into each guest node also doesn't sound wise enough and flexible. Since guest numa needs vcpu pinning to work as expected, how about my below thoughs? a) ask the user to use "cpus" option to pin each vcpu to a physical cpu (or node); b) find out how many physical nodes (host nodes) are involved and use that number as the number of guest node; c) each guest node corresponds to a host node found out in step b) and use this info to fill the numainfo.guest_to_host_node[] in 3). 5) I think we also need to present the numa guest with virtual cpu topology, e.g., throught the initial APCI ID. In current xen, apic_id = vcpu_id * 2; even if we have the guest SRAT support and use 2 guest nodes for a vcpus=n guest, the guest would still think it's on a package with n cores without the knowledge of vcpu and cache topology and this would do harm to the performance of guest. I think we can use each guest node as a guest package and by giving the guest a proper APIC ID (consisting of guest SMT_ID/Core_ID/Package_ID) to show the vcpu topology to guest. This needs changes to the hvmloader's SRAT/MADT's APID ID fields, xen's cpuid/vlapic emulation. 6) HVM vcpu's hot add/remove functionlity was added into xen recently. The guest numa support should take this into consideration. 7) I don't see the live migration support in your patches. Looks it's hard for hvm numa guest to do live migration as the src/dest hosts could be very different in HW configuration. Thanks, -- Dexuan -----Original Message----- From: xen-devel-bounces@lists.xensource.com [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Andre Przywara Sent: 2010年2月5日 5:51 To: Keir Fraser; Kamble, Nitin A Cc: xen-devel@lists.xensource.com Subject: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support Hi, to avoid double work in the community on the same topic and to help syncing on the subject and as I am not in office next week, I would like to send the NUMA guest support patches I have so far. These patches introduce NUMA support for guests. This can be handy if either the guests resources (VCPUs and/or memory) exceed one node's capacity or the host is already loaded so that the requirement cannot be satisfied from one node alone. Some applications may also benefit from the aggregated bandwidth of multiple memory controllers. Even if the guest has only a single node, this code replaces the current NUMA placement mechanism by moving it into libxc. I have changed something lately, so there are some loose ends, but it should suffice as a discussion base. The patches are for HVM guest primarily, as I don't deal much with PV I am not sure whether a port would be straight-forward or the complexity is higher. One thing I was not sure about is how to communicate the NUMA topology to PV guests. Reusing the existing code base and inject a generated ACPI table seems smart, but this would mean to enable ACPI parsing code in PV Linux, which currently seems to be disabled (?). If someone wants to step in and implement PV support, I will be glad to help. I have reworked the (guest node to) host node assignment part, this is currently unfinished. I decided to move the node-rating part from XendDomainInfo.py:find_relaxed_node() into libxc (should this eventually go into libxenlight?) to avoid passing to much information between the layers and to include libxl support. This code snippet (patch 5/5) basically scans all VCPUs on all domains and generates an array holding the node load metric for future sorting. The missing part is here a static function in xc_hvm_build.c to pick the <n> best nodes and populate the numainfo->guest_to_host_node array with the result. I will do this when I will be back. For more details see the following email bodies. Thanks and Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 488-3567-12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Feb-23 09:53 UTC
Re: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Cui, Dexuan wrote:> Hi Andre, > I''m also looking into hvm guest''s numa support and I''d like to share my thoughs and supply my understanding about your patches. > > 1) Besides SRAT, I think we should also build guest SLIT according to host SLIT.That is probably right, though currently low priority. Let''s get the basics first upstream.> 2) I agree we should supply the user a way to specify which guest node should have how much memory, namely, the "nodemem" > parameter in your patch02. However, I can''t find where it is assigneda value in your patches. I guess you missed it in image.py. Omitted for now. I wanted to keep the first patches clean and had some hard time to propagate arrays from the config files downto libxc. Is there a good explanation of the different kind of config file options? I see different classes (like HVM only) along with some legacy parts that appear quite confusing to me.> And what if xen can''t allocate memory from the specified host node(e.g., no enough free memory on the host node)? > -- currently xen *silently* tries to allocate memory from other hostnodes -- this would hurt guest performance> while the user doesn''t know that at all! I think we should add anoption in guest config file: if it''s set,> the guest creation should fail if xen can not allocate memory from thespecified host node. Exactly that scenario I had also in mind: Provide some kind of numa=auto option in the config file to let Xen automatically split up the memory allocation from different nodes if needed. I think we need an upper limit here, or maybe something like: numa={force,allow,deny} numanodes=2 the numa=allow option would only allocate up to 2 nodes if no single node can satisfy the memory request.> 3) In your patch02: > + for (i = 0; i < numanodes; i++) > + numainfo.guest_to_host_node[i] = i % 2; > As you said in the mail "[PATCH 5/5]", at present it "simply round robin until the code for automatic allocation is in place", > I think "simply round robin" is not acceptable and we should implement"automatic allocation". Right, but this depends on the one part I missed. The first part of this is the xc_nodeload() function. I will try to provide the missing part this week.> 4) Your patches try to sort the host nodes using a noad load evaluation algorithm, and require the user to specify how many > guest nodes the guest should see, and distribute equally guest vcpusinto each guest node.> I don''t think the algorithm could be wise enough every time and it''s not flexiable. Requiring the user to specify the number > of guest node and districuting vcpus equally into each guest node alsodoesn''t sound wise enough and flexible. Another possible extension. I had some draft with "node_cpus=[1,2,1]" to put one vCPU in the first and third node and two vCPUs in the second node, although I omitted them from the first "draft" release.> Since guest numa needs vcpu pinning to work as expected, how about my below thoughs? > > a) ask the user to use "cpus" option to pin each vcpu to a physical cpu (or node); > b) find out how many physical nodes (host nodes) are involved and use that number as the number of guest node; > c) each guest node corresponds to a host node found out in step b) and use this info to fill the numainfo.guest_to_host_node[] in 3).My idea is: 1) use xc_nodeload() to get a list of host nodes with the respective amount of free memory 2) either use the user-provided number of guest nodes and determine the number based on the memory availability (=n) 3) select the <n> best nodes from the list (algorithm still to be discussed, but a simple approach is sufficient for the first time) 4) populate numainfo.guest_to_host_node accordingly 5) pin vCPUs based on this array This is basically the missing function (TM) I described earlier.> 5) I think we also need to present the numa guest with virtual cpu topology, e.g., throught the initial APCI ID. In current xen, > apic_id = vcpu_id * 2; even if we have the guest SRAT support and use2 guest nodes for a vcpus=n guest,> the guest would still think it''s on a package with n cores without theknowledge of vcpu and cache> topology and this would do harm to the performance of guest. > I think we can use each guest node as a guest package and by giving the guest a proper APIC ID > (consisting of guest SMT_ID/Core_ID/Package_ID) to show the vcputopology to guest.> This needs changes to the hvmloader''s SRAT/MADT''s APID ID fields,xen''s cpuid/vlapic emulation. The APIC ID scenario does not work on AMD CPUs, which don''t have a bit field based association between compute units and APIC IDs. For NUMA purposes SRAT should be sufficient, as it overrides APIC based decisions. But you are right in that it needs more CPUID / ACPI tweaking to get the topology right, although this should be addressed in separate patches: Currently(?) it is very cumbersome to inject a specific "cores per socket" number into Xen (by tweaking those ugly CPUID bit masks). For QEMU/KVM I introduced an easy config scheme (smp=8,cores=2,threads=2) to allow this (purely CPUID based). If only I had time for this I would do this for Xen, too.> > 6) HVM vcpu''s hot add/remove functionlity was added into xen recently. The guest numa support should take this into consideration.Are you volunteering? ;-)> 7) I don''t see the live migration support in your patches. Looks it''s hard for hvm numa guest to do live migration as the > src/dest hosts could be very different in HW configuration.I don''t think this is a problem. We need to separate guest specific options (like VCPUs to guest nodes or guest memory to guest nodes mapping) from host specific parts (guest nodes to host nodes). I haven''t tested it yet, but I assume that the config file options to specify the guest specific parts should be sent already right now, resulting in the new guest setting up with the proper guest config. The guest node to host node association is determined by the new host dynamically depending on the current host''s resources. This can turn out to be sub-optimal, like migrating a "4 guest node on 4 host nodes" guest on a dual node host, but this would currently map to 0-1-0-1 setup, where two guest nodes are assigned the same host node. I don''t see much of an problem here. Thanks for your thoughts and looking forward to future collaboration. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cui, Dexuan
2010-Feb-25 13:14 UTC
RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Andre Przywara wrote:> Cui, Dexuan wrote: >> Hi Andre, >> I''m also looking into hvm guest''s numa support and I''d like to share >> my thoughs and supply my understanding about your patches. >> >> 1) Besides SRAT, I think we should also build guest SLIT according >> to host SLIT. > That is probably right, though currently low priority. Let''s get the > basics first upstream.I think the goal of guest NUMA is reflecting hardware configuration properly. Nintin''s patch (which exposes host SLIT info to dom0''s user space) would help here. I think adding hvm guest SLIT should be not complex and we can also do that together once Nitin''s patch is in (after Xen 4.0.0 is released).> >> 2) I agree we should supply the user a way to specify which guest >> node should have how much memory, namely, the "nodemem" >> parameter in your patch02. However, I can''t find where it is assigned > a value in your patches. I guess you missed it in image.py. > Omitted for now. I wanted to keep the first patches clean and had some > hard time to propagate arrays from the config files downto libxc. Is > there a good explanation of the different kind of config file > options? I see different classes (like HVM only) along with > some legacy parts that appear quite confusing to me.I also feel here it needs many efforts to cleanly pass the necessary info from guest config file to libxc (and to hvmloader and possibly to hypervisor) . :-) If the "nodemem" option is not specified by a user, looks your patches equally distribute guest memory into guest nodes. I think this is not good -- I think we should require the user to explicitly specify how the guest memory should be distributed, e.g., assuming there are 2 host nodes in a platform and the user can know hNode0 has 3G memory available and hNode1 has 8G(a user can know easily this by "xm info"): now the user needs to create a guest with 10G memory. If we enable guest numa and equally distribute guest memory, the guest would think there are 2 nodes and the first 5G memory is on gNode0 and the second 5G is on gNode1, and we 1:1 map gNodes to hNodes -- but actually 40% of gNode0''s memory is not on hNode0! In this case, not enabling guest numa may achive better guest performance. I mean: equally distributing guest memory into guest nodes would make the guest performance very unpredictable to the user. I think the policy can be: only enable guest numa iff "nodemem" is specified.> >> And what if xen can''t allocate memory from the specified host >> node(e.g., no enough free memory on the host node)? -- currently xen >> *silently* tries to allocate memory from other host > nodes -- this would hurt guest performance >> while the user doesn''t know that at all! I think we should add an > option in guest config file: if it''s set, >> the guest creation should fail if xen can not allocate memory from >> the > specified host node. > Exactly that scenario I had also in mind: Provide some kind of > numa=auto option in the config file to let Xen automatically > split up the memory allocation from different nodes if needed. I think > we need an upper limit here, or maybe something like: > numa={force,allow,deny}I think the policy could be: 1) if no numa config is specified by the user, we should try the best to make guest creation succeed, even if the guest would have a bad performance; 2) if "nodemem" is specified, we should try the best to satisfy that; when we can''t satisfy that: if "numa=force" is specified, we fail the guest creation, ELSE, try the best to make guest creation succeed, even if the guest would have a bad performance;> numanodes=2 > the numa=allow option would only allocate up to 2 nodes if no single > node can satisfy the memory request.I don''t think it''s good to require the user to specify the number of guest nodes. It''s not straightforward to a user at all. In my mind, the typical scenario should be: One day, a user needs to create a "powerful" guest that has 32vcpus and 64G memory. By running "xm info", the user knows there are 3 hNodes (this is only a casually faked case by me :-) hNode0: 8 logical processors, 20G memory available; hNode1: 24 logical processors, 40G memory available; hNode2: 8 logical processors, 40G memory available. Aftering thinking for seconds, the user decides to allocate 4/24/4 vcpus, 10G/40G/14G memory, from hNode0/1/2, respectively, to the guest; or, the user may decide to allocate 24/8 vcpus, 40G/24G memory, from hNode1/2, respectively, to the guest. I mean: we should be able to deduce the number of guest node from the user''s explicit configuration, without that, the guest performance would be unpredictable if we just simply require the user to supply "numanodes" and try to fingure out the "best" (I think it''s difficult and not flexible at all) vcpu/mem distribution solution for the user. Looks Ian Pratt also tends to use this idea in his reply to another thead "Host Numa informtion in dom0".>> 3) In your patch02: >> + for (i = 0; i < numanodes; i++) >> + numainfo.guest_to_host_node[i] = i % 2; >> As you said in the mail "[PATCH 5/5]", at present it "simply round >> robin until the code for automatic allocation is in place", >> I think "simply round robin" is not acceptable and we should >> implement > "automatic allocation". > Right, but this depends on the one part I missed. The first part of > this is the xc_nodeload() function. I will try to provide > the missing part this week.As I replied above, I think it''s better to ask the user to give an explicit configuration. It''s difficult to make an always-wise-enough algorithm to figure out the best solution and the user will lose flexibility.> >> 4) Your patches try to sort the host nodes using a noad load >> evaluation algorithm, and require the user to specify how many >> guest nodes the guest should see, and distribute equally guest vcpus >> into each guest node. I don''t think the algorithm could be wise >> enough every time and it''s not flexiable. Requiring the user to >> specify the number >> of guest node and districuting vcpus equally into each guest node >> also > doesn''t sound wise enough and flexible. > Another possible extension. I had some draft with "node_cpus=[1,2,1]" > to put one vCPU in the first and third node and two vCPUs in the > second node, although I omitted them from the first "draft" release.As I replied above, the info "nodemem" and "cpus"(the vcpu affinity info) should be enough and the "node_cpus" here should be redundant.> >> Since guest numa needs vcpu pinning to work as expected, how >> about my below thoughs? >> >> a) ask the user to use "cpus" option to pin each vcpu to a >> physical cpu (or node); b) find out how many physical nodes (host >> nodes) are involved and use that number as the number of guest >> node; c) each guest node corresponds to a host node found out in >> step b) and use this info to fill the numainfo.guest_to_host_node[] >> in 3). > My idea is: > 1) use xc_nodeload() to get a list of host nodes with the respective > amount of free memory > 2) either use the user-provided number of guest nodes and determine > the number based on the memory availability (=n) > 3) select the <n> best nodes from the list (algorithm still to be > discussed, but a simple approach is sufficient for the first time) > 4) populate numainfo.guest_to_host_node accordingly > 5) pin vCPUs based on this array > > This is basically the missing function (TM) I described earlier.Please see my above reply.>> 5) I think we also need to present the numa guest with virtual cpu >> topology, e.g., throught the initial APCI ID. In current xen, >> apic_id = vcpu_id * 2; even if we have the guest SRAT support and >> use 2 guest nodes for a vcpus=n guest, >> the guest would still think it''s on a package with n cores without >> the knowledge of vcpu and cache >> topology and this would do harm to the performance of guest. >> I think we can use each guest node as a guest package and by >> giving the guest a proper APIC ID (consisting of guest >> SMT_ID/Core_ID/Package_ID) to show the vcpu topology to guest. >> This needs changes to the hvmloader''s SRAT/MADT''s APID ID fields, > xen''s cpuid/vlapic emulation. > The APIC ID scenario does not work on AMD CPUs, which don''t have a bit > field based association between compute units and APIC IDs. For NUMA > purposes SRAT should be sufficient, as it overrides APIC basedSorry, I''m not familiar with APIC ID on AMD''s CPU. My thought is: assuming a hNode corresponds to a host package, and a package has some cores, and a core has 2 threads, if we could expose this info (and the related host cache topology) to guest, guest os can intentionally try to schedule the "related" processes to run on the threads of the same core, and as a result, we can achieve a better guest performance.> decisions. But you are right in that it needs more CPUID / ACPI > tweaking to get the topology right, although this should be addressed > in separate patches: > Currently(?) it is very cumbersome to inject a specific "cores per > socket" number into Xen (by tweaking those ugly CPUID bit masks). ForI looked into current Xen code and also agree it''s not easy to make the injection clean; however it may deserve the effort if this can improve guest performance to a notable degree. I''m trying to obtain some data now.> QEMU/KVM I introduced an easy config scheme (smp=8,cores=2,threads=2) > to allow this (purely CPUID based). If only I had time for this I > would do this for Xen, too. > >> >> 6) HVM vcpu''s hot add/remove functionlity was added into xen >> recently. The guest numa support should take this into >> consideration. Are you volunteering? ;-)Yes, I''m looking into this.>> 7) I don''t see the live migration support in your patches. Looks >> it''s hard for hvm numa guest to do live migration as the >> src/dest hosts could be very different in HW configuration. > I don''t think this is a problem. We need to separate guest specific > options (like VCPUs to guest nodes or guest memory to guest nodes > mapping) from host specific parts (guest nodes to host nodes). I > haven''t tested it yet, but I assume that the config file options to > specify the guest specific parts should be sent already right now, > resulting in the new guest setting up with the proper guest config. > The guest node to host node association is determined by the new host > dynamically depending on the current host''s resources. This can turn > out to be sub-optimal, like migrating a "4 guest node on 4 host > nodes" guest on a dual node host, but this would currently map to > 0-1-0-1 setup, where two guest nodes are assigned the same host node. > I don''t see much of an problem here.A concern of mine is: after the migration, does a user expect the guest performancer would change a lot? E.g., assuming there are 2 exactly same hosts and there are many guests running on each hosts, after migrating a numa guest from host A to host B, the underlying memory distribution may vary (because on A and B, the amount of available memory on different nodes are different; on A, all gNode0''s memory can be on hNode0, but on B, gNode0''s memory can be on hNode0 and 1) and the guest performance would be downgraded. Another thing is: if we change the current mapping "apic_id = vcpu_id * 2", we''ll have compatability issue. Thanks, -- Dexuan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cui, Dexuan
2010-Mar-31 06:23 UTC
RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Hi Andre, will you re-post your patches? Now I think for the first implementation, we can make things simple, e.g, we should specify how many guest nodes (the "guestnodes" option in your patch -- I think "numa_nodes", or "nodes", may be a better naming) the hvm guest will see, and we distribute guest memory and vcpus uniformly among the guest nodes. And we should add one more option "uniform_nodes" -- this boolean option's default value can be True, meaning if we can't construct uniform nodes to guest(e.g., on the related host node, no enough memory as expected can be allocated to the guest), the guest creation should fail. This option is useful to users who want predictable guest performance. Thanks, -- Dexuan -----Original Message----- From: Andre Przywara [mailto:andre.przywara@amd.com] Sent: 2010年2月23日 17:53 To: Cui, Dexuan Cc: xen-devel; Keir Fraser; Kamble, Nitin A Subject: Re: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support Cui, Dexuan wrote:> Hi Andre, > I'm also looking into hvm guest's numa support and I'd like to share my thoughs and supply my understanding about your patches. > > 1) Besides SRAT, I think we should also build guest SLIT according to host SLIT.That is probably right, though currently low priority. Let's get the basics first upstream.> 2) I agree we should supply the user a way to specify which guest node should have how much memory, namely, the "nodemem" > parameter in your patch02. However, I can't find where it is assigneda value in your patches. I guess you missed it in image.py. Omitted for now. I wanted to keep the first patches clean and had some hard time to propagate arrays from the config files downto libxc. Is there a good explanation of the different kind of config file options? I see different classes (like HVM only) along with some legacy parts that appear quite confusing to me.> And what if xen can't allocate memory from the specified host node(e.g., no enough free memory on the host node)? > -- currently xen *silently* tries to allocate memory from other hostnodes -- this would hurt guest performance> while the user doesn't know that at all! I think we should add anoption in guest config file: if it's set,> the guest creation should fail if xen can not allocate memory from thespecified host node. Exactly that scenario I had also in mind: Provide some kind of numa=auto option in the config file to let Xen automatically split up the memory allocation from different nodes if needed. I think we need an upper limit here, or maybe something like: numa={force,allow,deny} numanodes=2 the numa=allow option would only allocate up to 2 nodes if no single node can satisfy the memory request.> 3) In your patch02: > + for (i = 0; i < numanodes; i++) > + numainfo.guest_to_host_node[i] = i % 2; > As you said in the mail "[PATCH 5/5]", at present it "simply round robin until the code for automatic allocation is in place", > I think "simply round robin" is not acceptable and we should implement"automatic allocation". Right, but this depends on the one part I missed. The first part of this is the xc_nodeload() function. I will try to provide the missing part this week.> 4) Your patches try to sort the host nodes using a noad load evaluation algorithm, and require the user to specify how many > guest nodes the guest should see, and distribute equally guest vcpusinto each guest node.> I don't think the algorithm could be wise enough every time and it's not flexiable. Requiring the user to specify the number > of guest node and districuting vcpus equally into each guest node alsodoesn't sound wise enough and flexible. Another possible extension. I had some draft with "node_cpus=[1,2,1]" to put one vCPU in the first and third node and two vCPUs in the second node, although I omitted them from the first "draft" release.> Since guest numa needs vcpu pinning to work as expected, how about my below thoughs? > > a) ask the user to use "cpus" option to pin each vcpu to a physical cpu (or node); > b) find out how many physical nodes (host nodes) are involved and use that number as the number of guest node; > c) each guest node corresponds to a host node found out in step b) and use this info to fill the numainfo.guest_to_host_node[] in 3).My idea is: 1) use xc_nodeload() to get a list of host nodes with the respective amount of free memory 2) either use the user-provided number of guest nodes and determine the number based on the memory availability (=n) 3) select the <n> best nodes from the list (algorithm still to be discussed, but a simple approach is sufficient for the first time) 4) populate numainfo.guest_to_host_node accordingly 5) pin vCPUs based on this array This is basically the missing function (TM) I described earlier.> 5) I think we also need to present the numa guest with virtual cpu topology, e.g., throught the initial APCI ID. In current xen, > apic_id = vcpu_id * 2; even if we have the guest SRAT support and use2 guest nodes for a vcpus=n guest,> the guest would still think it's on a package with n cores without theknowledge of vcpu and cache> topology and this would do harm to the performance of guest. > I think we can use each guest node as a guest package and by giving the guest a proper APIC ID > (consisting of guest SMT_ID/Core_ID/Package_ID) to show the vcputopology to guest.> This needs changes to the hvmloader's SRAT/MADT's APID ID fields,xen's cpuid/vlapic emulation. The APIC ID scenario does not work on AMD CPUs, which don't have a bit field based association between compute units and APIC IDs. For NUMA purposes SRAT should be sufficient, as it overrides APIC based decisions. But you are right in that it needs more CPUID / ACPI tweaking to get the topology right, although this should be addressed in separate patches: Currently(?) it is very cumbersome to inject a specific "cores per socket" number into Xen (by tweaking those ugly CPUID bit masks). For QEMU/KVM I introduced an easy config scheme (smp=8,cores=2,threads=2) to allow this (purely CPUID based). If only I had time for this I would do this for Xen, too.> > 6) HVM vcpu's hot add/remove functionlity was added into xen recently. The guest numa support should take this into consideration.Are you volunteering? ;-)> 7) I don't see the live migration support in your patches. Looks it's hard for hvm numa guest to do live migration as the > src/dest hosts could be very different in HW configuration.I don't think this is a problem. We need to separate guest specific options (like VCPUs to guest nodes or guest memory to guest nodes mapping) from host specific parts (guest nodes to host nodes). I haven't tested it yet, but I assume that the config file options to specify the guest specific parts should be sent already right now, resulting in the new guest setting up with the proper guest config. The guest node to host node association is determined by the new host dynamically depending on the current host's resources. This can turn out to be sub-optimal, like migrating a "4 guest node on 4 host nodes" guest on a dual node host, but this would currently map to 0-1-0-1 setup, where two guest nodes are assigned the same host node. I don't see much of an problem here. Thanks for your thoughts and looking forward to future collaboration. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Apr-04 21:41 UTC
Re: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Cui, Dexuan wrote:> Hi Andre, will you re-post your patches?Yes, I will do in the next days. I plan to add the missing automatic assignment patch before posting.> Now I think for the first implementation, we can make things simple,> e.g, we should specify how many guest nodes (the "guestnodes" option > in your patch -- I think "numa_nodes", or "nodes", may be a better > naming) the hvm guest will see, and we distribute guest memory and > vcpus uniformly among the guest nodes. I agree, making things simple in the first step was also my intention. We have enough time to make it better later if we have more experience with it. To be honest, my first try also used "nodes" and later "numa_nodes" to specify the number, but I learned that it confuses users who don''t see the difference between host and guest NUMA functionality. So I wanted to make sure that it is clear that this number is from the guest''s point of view.> And we should add one more option "uniform_nodes" -- this boolean> option''s default value can be True, meaning if we can''t construct > uniform nodes to guest(e.g., on the related host node, no enough> memory as expected can be allocated to the guest), the guest> creation should fail. This option is useful to users who want > predictable guest performance. I agree that we need to avoid missing user influence, although I''d prefer to have the word "strict" somewhere in this name. As I wrote in one my earlier mails, I''d opt for a single option describing the policy, the "strict" meaning could be integrated in there: numa_policy="strict|uniform|automatic|none|single|..." Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 488-3567-12 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cui, Dexuan
2010-Apr-07 07:42 UTC
RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Andre Przywara wrote:> Cui, Dexuan wrote: >> Hi Andre, will you re-post your patches? > Yes, I will do in the next days. I plan to add the missing automatic > assignment patch before posting.Glad to know this. BTW: To support PV NUMA, on this Monday, Dulloor posted some paches that change libxc and the hypervisor, too. Hi Andre, Dulloor, I believe we should have some coordination to share the code and to avoid duplicate efforts. e.g., Dullor''s linux-01-sync-interface.patch is similar to Andre''s old patch http://lists.xensource.com/archives/html/xen-devel/2008-07/msg00254.html, though the formar is for PV kernel and the latter is for libxc and hypervisor. :-) e.g., Dullor''s xen-02-exact-node-request.patch has implemented the the MEMF_exact_node flag, which I intended to do. :-) e.g., Dullor''s xen-03-guest-numa-interface.patch implements a hypercall to export host numa info -- actually Nitin has sent out a patch to export more useful numa info: http://old.nabble.com/Host-Numa-informtion-in-dom0-td27379527.html and I suppose Nitin will re-send it soon. e.g., Dullor''s xen-04-node-mem-allocation.patch''s xc_select_best_fit_nodes() is similar to the Andre''s xc_getnodeload(): http://lists.xensource.com/archives/html/xen-devel/2010-02/msg00284.html. e.g., In Dulloer''s xen-05-basic-cpumap-utils.patch and xen-07-tools-arch-setup.patch, I think some parts could be shared by pv/hvm numa implementations if we make some necessary changes to them.>> Now I think for the first implementation, we can make things simple, > > e.g, we should specify how many guest nodes (the "guestnodes" > option > in your patch -- I think "numa_nodes", or "nodes", may be a > better > naming) the hvm guest will see, and we distribute guest > memory and > vcpus uniformly among the guest nodes. > I agree, making things simple in the first step was also my intention. > We have enough time to make it better later if we have more experience > with it. > To be honest, my first try also used "nodes" and later "numa_nodes" to > specify the number, but I learned that it confuses users who don''t see > the difference between host and guest NUMA functionality. So I wanted > to make sure that it is clear that this number is from the guest''s > point of view. > >> And we should add one more option "uniform_nodes" -- this boolean > > option''s default value can be True, meaning if we can''t construct > > uniform nodes to guest(e.g., on the related host node, no enough >> memory as expected can be allocated to the guest), the guest > > creation should fail. This option is useful to users who want > > predictable guest performance. > I agree that we need to avoid missing user influence, although I''d > prefer to have the word "strict" somewhere in this name. As I wrote in > one my earlier mails, I''d opt for a single option describing the > policy, the "strict" meaning could be integrated in there: > numa_policy="strict|uniform|automatic|none|single|..."Hi Andre, I think this looks too complex for the first simple implementation and it''s very likely a real user will be bewildered. :-) I think ideally we can have 2 options: guest_nodes=n uniform_nodes=True|False (the default is True) Thanks, -- Dexuan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Apr-07 07:52 UTC
Re: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
On 07/04/2010 08:42, "Cui, Dexuan" <dexuan.cui@intel.com> wrote:> Hi Andre, Dulloor, > I believe we should have some coordination to share the code and to avoid > duplicate efforts.Yes, I''d like to see a coordinated patch series for HVM and PV. Or at least code sharing and cross-Acks if remaining as two separate patch series. -- Keir> e.g., Dullor''s linux-01-sync-interface.patch is similar to Andre''s old patch > http://lists.xensource.com/archives/html/xen-devel/2008-07/msg00254.html, > though the formar is for PV kernel and the latter is for libxc and > hypervisor. :-) > e.g., Dullor''s xen-02-exact-node-request.patch has implemented the the > MEMF_exact_node flag, which I intended to do. :-) > e.g., Dullor''s xen-03-guest-numa-interface.patch implements a hypercall to > export host numa info -- actually Nitin has sent out a patch to export more > useful numa info: > http://old.nabble.com/Host-Numa-informtion-in-dom0-td27379527.html and I > suppose Nitin will re-send it soon. > e.g., Dullor''s xen-04-node-mem-allocation.patch''s xc_select_best_fit_nodes() > is similar to the Andre''s xc_getnodeload(): > http://lists.xensource.com/archives/html/xen-devel/2010-02/msg00284.html. > e.g., In Dulloer''s xen-05-basic-cpumap-utils.patch and > xen-07-tools-arch-setup.patch, I think some parts could be shared by pv/hvm > numa implementations if we make some necessary changes to them._______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Apr-07 09:03 UTC
Re: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Cui, Dexuan wrote:> Andre Przywara wrote: >> Cui, Dexuan wrote: >>> Hi Andre, will you re-post your patches? >> Yes, I will do in the next days. I plan to add the missing automatic >> assignment patch before posting. > Glad to know this. > BTW: To support PV NUMA, on this Monday, Dulloor posted some paches that change libxc and the hypervisor, too.Yes, I saw them. I am about to look at them more thoroughly. Will get back to you later on this. <skip>>>> And we should add one more option "uniform_nodes" -- this boolean >>> option''s default value can be True, meaning if we can''t construct >>> uniform nodes to guest(e.g., on the related host node, no enough >>> memory as expected can be allocated to the guest), the guest >>> creation should fail. This option is useful to users who want >>> predictable guest performance. >> I agree that we need to avoid missing user influence, although I''d >> prefer to have the word "strict" somewhere in this name. As I wrote in >> one my earlier mails, I''d opt for a single option describing the >> policy, the "strict" meaning could be integrated in there: >> numa_policy="strict|uniform|automatic|none|single|..." > Hi Andre, > I think this looks too complex for the first simple implementation and it''s very likely a real user will be bewildered. :-) > I think ideally we can have 2 options: > guest_nodes=n > uniform_nodes=True|False (the default is True)I agree on the guest_nodes, but I want to avoid a bunch of "single bit" options that we need to carry on later. Lets introduce a numa_policy option and only implement the words we need for now, e.g. "strict" (equivalent to "uniform_nodes=True") and "automatic" (aka "uniform_nodes=False"). The list I gave above was just a quick example of _possible_ words, it was neither exhaustive or non-redundant. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cui, Dexuan
2010-Apr-07 09:28 UTC
RE: [Xen-devel] [PATCH 0/5] [POST-4.0]: RFC: HVM NUMA guest support
Andre Przywara wrote:> Cui, Dexuan wrote: >> Andre Przywara wrote: >>> Cui, Dexuan wrote: >>>> Hi Andre, will you re-post your patches? >>> Yes, I will do in the next days. I plan to add the missing automatic >>> assignment patch before posting. >> Glad to know this. >> BTW: To support PV NUMA, on this Monday, Dulloor posted some paches >> that change libxc and the hypervisor, too. > Yes, I saw them. I am about to look at them more thoroughly. Will get > back to you later on this. > > <skip> >>>> And we should add one more option "uniform_nodes" -- this boolean >>>> option''s default value can be True, meaning if we can''t construct >>>> uniform nodes to guest(e.g., on the related host node, no enough >>>> memory as expected can be allocated to the guest), the guest >>>> creation should fail. This option is useful to users who want >>>> predictable guest performance. >>> I agree that we need to avoid missing user influence, although I''d >>> prefer to have the word "strict" somewhere in this name. As I wrote >>> in one my earlier mails, I''d opt for a single option describing the >>> policy, the "strict" meaning could be integrated in there: >>> numa_policy="strict|uniform|automatic|none|single|..." >> Hi Andre, >> I think this looks too complex for the first simple implementation >> and it''s very likely a real user will be bewildered. :-) I think >> ideally we can have 2 options: >> guest_nodes=n >> uniform_nodes=True|False (the default is True) > I agree on the guest_nodes, but I want to avoid a bunch of "single > bit" options that we need to carry on later. Lets introduce a > numa_policy option and only implement the words we need for now, e.g. > "strict" (equivalent to "uniform_nodes=True") and "automatic" (aka > "uniform_nodes=False").I think the word "strict" or "automatic" is too obscure to the user. We''d better use an unambiguous name. Thanks, -- Dexuan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel