Hi everyone, With automatic placement finally landing into xen-unstable, I stated thinking about what I could work on next, still in the field of improving Xen''s NUMA support. Well, it turned out that running out of things to do is not an option! :-O In fact, I can think of quite a bit of open issues in that area, that I''m just braindumping here. If anyone has thoughts or idea or feedback or whatever, I''d be happy to serve as a collector of them. I''ve already created a Wiki page to help with the tracking. You can see it here (for now it basically replicates this e-mail): http://wiki.xen.org/wiki/Xen_NUMA_Roadmap I''m putting a [D] (standing for Dario) near the points I''ve started working on or looking at, and again, I''d be happy to try tracking this too, i.e., keeping the list of "who-is-doing-what" updated, in order to ease collaboration. So, let''s cut the talking: - Automatic placement at guest creation time. Basics are there and will be shipping with 4.2. However, a lot of other things are missing and/or can be improved, for instance: [D] * automated verification and testing of the placement; * benchmarks and improvements of the placement heuristic; [D] * choosing/building up some measure of node load (more accurate than just counting vcpus) onto which to rely during placement; * consider IONUMA during placement; * automatic placement of Dom0, if possible (my current series is only affecting DomU) * having internal xen data structure honour the placement (e.g., I''ve been told that right now vcpu stacks are always allocated on node 0... Andrew?). [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on nodes'' pcpus, just have them _prefer_ running on the nodes where their memory is. [D] - Dynamic memory migration between different nodes of the host. As the counter-part of the NUMA-aware scheduler. - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a guest ends up on more than one nodes, make sure it knows it''s running on a NUMA platform (smaller than the actual host, but still NUMA). This interacts with some of the above points: * consider this during automatic placement for resuming/migrating domains (if they have a virtual topology, better not to change it); * consider this during memory migration (it can change the actual topology, should we update it on-line or disable memory migration?) - NUMA and ballooning and memory sharing. In some more details: * page sharing on NUMA boxes: it''s probably sane to make it possible disabling sharing pages across nodes; * ballooning and its interaction with placement (races, amount of memory needed and reported being different at different time, etc.). - Inter-VM dependencies and communication issues. If a workload is made up of more than just a VM and they all share the same (NUMA) host, it might be best to have them sharing the nodes as much as possible, or perhaps do right the opposite, depending on the specific characteristics of he workload itself, and this might be considered during placement, memory migration and perhaps scheduling. - Benchmarking and performances evaluation in general. Meaning both agreeing on a (set of) relevant workload(s) and on how to extract meaningful performances data from there (and maybe how to do that automatically?). So, what do you think? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Wed, 2012-08-01 at 18:16 +0200, Dario Faggioli wrote:> Hi everyone, >Quite a bad subject... I put it there just as a placeholder and then forgot to change it into something sensible. :-( Sorry for that. I hope the content can still get some attention. :-P Thanks again and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 01/08/12 17:16, Dario Faggioli wrote:> Hi everyone, > > With automatic placement finally landing into xen-unstable, I stated > thinking about what I could work on next, still in the field of > improving Xen''s NUMA support. Well, it turned out that running out of > things to do is not an option! :-O > > In fact, I can think of quite a bit of open issues in that area, that I''m > just braindumping here. If anyone has thoughts or idea or feedback or > whatever, I''d be happy to serve as a collector of them. I''ve already > created a Wiki page to help with the tracking. You can see it here > (for now it basically replicates this e-mail): > > http://wiki.xen.org/wiki/Xen_NUMA_Roadmap > > I''m putting a [D] (standing for Dario) near the points I''ve started > working on or looking at, and again, I''d be happy to try tracking this > too, i.e., keeping the list of "who-is-doing-what" updated, in order to > ease collaboration. > > So, let''s cut the talking: > > - Automatic placement at guest creation time. Basics are there and > will be shipping with 4.2. However, a lot of other things are > missing and/or can be improved, for instance: > [D] * automated verification and testing of the placement; > * benchmarks and improvements of the placement heuristic; > [D] * choosing/building up some measure of node load (more accurate > than just counting vcpus) onto which to rely during placement; > * consider IONUMA during placement; > * automatic placement of Dom0, if possible (my current series is > only affecting DomU) > * having internal xen data structure honour the placement (e.g., > I''ve been told that right now vcpu stacks are always allocated > on node 0... Andrew?). > > [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on nodes'' pcpus, > just have them _prefer_ running on the nodes where their memory > is. > > [D] - Dynamic memory migration between different nodes of the host. As > the counter-part of the NUMA-aware scheduler. > > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a > guest ends up on more than one nodes, make sure it knows it''s > running on a NUMA platform (smaller than the actual host, but > still NUMA). This interacts with some of the above points: > * consider this during automatic placement for > resuming/migrating domains (if they have a virtual topology, > better not to change it); > * consider this during memory migration (it can change the > actual topology, should we update it on-line or disable memory > migration?) > > - NUMA and ballooning and memory sharing. In some more details: > * page sharing on NUMA boxes: it''s probably sane to make it > possible disabling sharing pages across nodes; > * ballooning and its interaction with placement (races, amount of > memory needed and reported being different at different time, > etc.). > > - Inter-VM dependencies and communication issues. If a workload is > made up of more than just a VM and they all share the same (NUMA) > host, it might be best to have them sharing the nodes as much as > possible, or perhaps do right the opposite, depending on the > specific characteristics of he workload itself, and this might be > considered during placement, memory migration and perhaps > scheduling. > > - Benchmarking and performances evaluation in general. Meaning both > agreeing on a (set of) relevant workload(s) and on how to extract > meaningful performances data from there (and maybe how to do that > automatically?).- Xen NUMA internals. Placing items such as the per-cpu stacks and data area on the local NUMA node, rather than unconditionally on node 0 at the moment. As part of this, there will be changes to alloc_{dom,xen}heap_page() to allow specification of which node(s) to allocate memory from. ~Andrew> > > So, what do you think? > > Thanks and Regards, > Dario >-- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com --------------090302080202010206030406 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <br> On 01/08/12 17:16, Dario Faggioli wrote:<br> <span style="white-space: pre;">> Hi everyone,<br> ><br> > With automatic placement finally landing into xen-unstable, I stated<br> > thinking about what I could work on next, still in the field of<br> > improving Xen''s NUMA support. Well, it turned out that running out of<br> > things to do is not an option! :-O<br> ><br> > In fact, I can think of quite a bit of open issues in that area, that I''m<br> > just braindumping here. If anyone has thoughts or idea or feedback or<br> > whatever, I''d be happy to serve as a collector of them. I''ve already<br> > created a Wiki page to help with the tracking. You can see it here<br> > (for now it basically replicates this e-mail):<br> ><br> > <a class="moz-txt-link-freetext" href="http://wiki.xen.org/wiki/Xen_NUMA_Roadmap">http://wiki.xen.org/wiki/Xen_NUMA_Roadmap</a><br> ><br> > I''m putting a [D] (standing for Dario) near the points I''ve started<br> > working on or looking at, and again, I''d be happy to try tracking this<br> > too, i.e., keeping the list of "who-is-doing-what" updated, in order to<br> > ease collaboration.<br> ><br> > So, let''s cut the talking:<br> ><br> > - Automatic placement at guest creation time. Basics are there and<br> > will be shipping with 4.2. However, a lot of other things are<br> > missing and/or can be improved, for instance:<br> > [D] * automated verification and testing of the placement;<br> > * benchmarks and improvements of the placement heuristic;<br> > [D] * choosing/building up some measure of node load (more accurate<br> > than just counting vcpus) onto which to rely during placement;<br> > * consider IONUMA during placement;<br> > * automatic placement of Dom0, if possible (my current series is<br> > only affecting DomU)<br> > * having internal xen data structure honour the placement (e.g., <br> > I''ve been told that right now vcpu stacks are always allocated<br> > on node 0... Andrew?).<br> ><br> > [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on nodes'' pcpus,<br> > just have them _prefer_ running on the nodes where their memory<br> > is.<br> ><br> > [D] - Dynamic memory migration between different nodes of the host. As<br> > the counter-part of the NUMA-aware scheduler.<br> ><br> > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a<br> > guest ends up on more than one nodes, make sure it knows it''s<br> > running on a NUMA platform (smaller than the actual host, but<br> > still NUMA). This interacts with some of the above points:<br> > * consider this during automatic placement for<br> > resuming/migrating domains (if they have a virtual topology,<br> > better not to change it);<br> > * consider this during memory migration (it can change the<br> > actual topology, should we update it on-line or disable memory<br> > migration?)<br> ><br> > - NUMA and ballooning and memory sharing. In some more details:<br> > * page sharing on NUMA boxes: it''s probably sane to make it<br> > possible disabling sharing pages across nodes;<br> > * ballooning and its interaction with placement (races, amount of<br> > memory needed and reported being different at different time,<br> > etc.).<br> ><br> > - Inter-VM dependencies and communication issues. If a workload is<br> > made up of more than just a VM and they all share the same (NUMA)<br> > host, it might be best to have them sharing the nodes as much as<br> > possible, or perhaps do right the opposite, depending on the<br> > specific characteristics of he workload itself, and this might be<br> > considered during placement, memory migration and perhaps<br> > scheduling.<br> ><br> > - Benchmarking and performances evaluation in general. Meaning both<br> > agreeing on a (set of) relevant workload(s) and on how to extract<br> > meaningful performances data from there (and maybe how to do that<br> > automatically?).</span><br> <br> - Xen NUMA internals. Placing items such as the per-cpu stacks and data area on the local NUMA node, rather than unconditionally on node 0 at the moment. As part of this, there will be changes to alloc_{dom,xen}heap_page() to allow specification of which node(s) to allocate memory from.<br> <br> ~Andrew<br> <br> <span style="white-space: pre;">><br> ><br> > So, what do you think?<br> ><br> > Thanks and Regards,<br> > Dario<br> ></span><br> <br> -- <br> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer<br> T: +44 (0)1223 225 900, <a class="moz-txt-link-freetext" href="http://www.citrix.com">http://www.citrix.com</a><br> <br> </body> </html> --------------090302080202010206030406-- --===============7985365601985832137=Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============7985365601985832137==--
On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote:> - Inter-VM dependencies and communication issues. If a workload is > made up of more than just a VM and they all share the same (NUMA) > host, it might be best to have them sharing the nodes as much as > possible, or perhaps do right the opposite, depending on the > specific characteristics of he workload itself, and this might be > considered during placement, memory migration and perhaps > scheduling. > > - Benchmarking and performances evaluation in general. Meaning both > agreeing on a (set of) relevant workload(s) and on how to extract > meaningful performances data from there (and maybe how to do that > automatically?).I haven''t tried out the latest Xen NUMA features yet, but we''ve been keeping track of the IPC benchmarks as we get newer machines here: http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html The newer chipsets (Sandy Bridge and AMD Valencia) both have quite different inter-core/socket/MPM performance characteristics from their respective previous generations; e.g. http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmpfCBrYh.html http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/details/tmppI61nX.html Happy to share the raw data if you have cycles to figure out the best way to auto-place multiple VMs so they are near each other from a memory latency perspective. We haven''t run many macro-benchmarks though, so in practise it might not matter, so it would be nice to settle on a good set of benchmarks to determine that for sure. -anil
On Wed, 2012-08-01 at 17:30 +0100, Andrew Cooper wrote:> On 01/08/12 17:16, Dario Faggioli wrote: > > ... > > > - Automatic placement at guest creation time. Basics are there and > > will be shipping with 4.2. However, a lot of other things are > > missing and/or can be improved, for instance: > > [D] * automated verification and testing of the placement; > > * benchmarks and improvements of the placement heuristic; > > [D] * choosing/building up some measure of node load (more accurate > > than just counting vcpus) onto which to rely during placement; > > * consider IONUMA during placement; > > * automatic placement of Dom0, if possible (my current series is > > only affecting DomU) > > * having internal xen data structure honour the placement (e.g., > > I''ve been told that right now vcpu stacks are always allocated > > on node 0... Andrew?). > > > > - Xen NUMA internals. Placing items such as the per-cpu stacks and > data area on the local NUMA node, rather than unconditionally on node > 0 at the moment. As part of this, there will be changes to > alloc_{dom,xen}heap_page() to allow specification of which node(s) to > allocate memory from.As you see, I already tried to consider that (as you told me it does that couple of weeks ago :-) ). I''ll add your wording of it (much better than mine) to the wiki... I understand you''re working on this, aren''t you? Can I put that down to? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 01/08/12 17:47, Dario Faggioli wrote:> On Wed, 2012-08-01 at 17:30 +0100, Andrew Cooper wrote: >> On 01/08/12 17:16, Dario Faggioli wrote: >> >> ... >> >>> - Automatic placement at guest creation time. Basics are there and >>> will be shipping with 4.2. However, a lot of other things are >>> missing and/or can be improved, for instance: >>> [D] * automated verification and testing of the placement; >>> * benchmarks and improvements of the placement heuristic; >>> [D] * choosing/building up some measure of node load (more accurate >>> than just counting vcpus) onto which to rely during placement; >>> * consider IONUMA during placement; >>> * automatic placement of Dom0, if possible (my current series is >>> only affecting DomU) >>> * having internal xen data structure honour the placement (e.g., >>> I''ve been told that right now vcpu stacks are always allocated >>> on node 0... Andrew?). >>> >> >> - Xen NUMA internals. Placing items such as the per-cpu stacks and >> data area on the local NUMA node, rather than unconditionally on node >> 0 at the moment. As part of this, there will be changes to >> alloc_{dom,xen}heap_page() to allow specification of which node(s) to >> allocate memory from. > > As you see, I already tried to consider that (as you told me it does > that couple of weeks ago :-) ). I''ll add your wording of it (much better > than mine) to the wiki... I understand you''re working on this, aren''t > you? Can I put that down to? > > Thanks and Regards, > Dario >Wow - I completely managed to miss that while reading. Someone will be working on it for XS.next, and that someone will probably be me - put me down for it. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com --------------030103050606070408050209 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: 8bit <html> <head> <meta content="text/html; charset=UTF-8" http-equiv="Content-Type"> </head> <body bgcolor="#FFFFFF" text="#000000"> <br> On 01/08/12 17:47, Dario Faggioli wrote:<br> <span style="white-space: pre;">> On Wed, 2012-08-01 at 17:30 +0100, Andrew Cooper wrote:<br> >> On 01/08/12 17:16, Dario Faggioli wrote:<br> >><br> >> ...<br> >><br> >>> - Automatic placement at guest creation time. Basics are there and<br> >>> will be shipping with 4.2. However, a lot of other things are<br> >>> missing and/or can be improved, for instance:<br> >>> [D] * automated verification and testing of the placement;<br> >>> * benchmarks and improvements of the placement heuristic;<br> >>> [D] * choosing/building up some measure of node load (more accurate<br> >>> than just counting vcpus) onto which to rely during placement;<br> >>> * consider IONUMA during placement;<br> >>> * automatic placement of Dom0, if possible (my current series is<br> >>> only affecting DomU)<br> >>> * having internal xen data structure honour the placement (e.g., <br> >>> I''ve been told that right now vcpu stacks are always allocated<br> >>> on node 0... Andrew?).<br> >>><br> >><br> >> - Xen NUMA internals. Placing items such as the per-cpu stacks and<br> >> data area on the local NUMA node, rather than unconditionally on node<br> >> 0 at the moment. As part of this, there will be changes to<br> >> alloc_{dom,xen}heap_page() to allow specification of which node(s) to<br> >> allocate memory from.<br> ><br> > As you see, I already tried to consider that (as you told me it does<br> > that couple of weeks ago :-) ). I''ll add your wording of it (much better<br> > than mine) to the wiki... I understand you''re working on this, aren''t<br> > you? Can I put that down to?<br> ><br> > Thanks and Regards,<br> > Dario<br> ></span><br> <br> Wow - I completely managed to miss that while reading. Someone will be working on it for XS.next, and that someone will probably be me - put me down for it.<br> <br> -- <br> Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer<br> T: +44 (0)1223 225 900, <a class="moz-txt-link-freetext" href="http://www.citrix.com">http://www.citrix.com</a><br> <br> </body> </html> --------------030103050606070408050209-- --===============2161362334832821623=Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============2161362334832821623==--
On Wed, 2012-08-01 at 17:32 +0100, Anil Madhavapeddy wrote:> On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote: > > > - Inter-VM dependencies and communication issues. If a workload is > > made up of more than just a VM and they all share the same (NUMA) > > host, it might be best to have them sharing the nodes as much as > > possible, or perhaps do right the opposite, depending on the > > specific characteristics of he workload itself, and this might be > > considered during placement, memory migration and perhaps > > scheduling. > > > > - Benchmarking and performances evaluation in general. Meaning both > > agreeing on a (set of) relevant workload(s) and on how to extract > > meaningful performances data from there (and maybe how to do that > > automatically?). > > I haven''t tried out the latest Xen NUMA features yet, but we''ve been > keeping track of the IPC benchmarks as we get newer machines here: >> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html >Wow... That''s really cool. I''ll definitely take a deep look at all these data! I''m also adding the link to the wiki, if you''re fine with that...> Happy to share the raw data if you have cycles to figure out the best > way to auto-place multiple VMs so they are near each other from a memory > latency perspective. >I don''t have anything precise in mind yet, but we need to think about this.> We haven''t run many macro-benchmarks though, so > in practise it might not matter, so it would be nice to settle on a good > set of benchmarks to determine that for sure. >Yes, that''s what we need. I''m open and available on trying to figure this out anytime... I seem to recall you''re going to be in SanDiego for XenSummit, am I right? If yes, we can discuss this more there. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On 01/08/12 17:58, Dario Faggioli wrote:> On Wed, 2012-08-01 at 17:32 +0100, Anil Madhavapeddy wrote: >> On 1 Aug 2012, at 17:16, Dario Faggioli <raistlin@linux.it> wrote: >> >>> - Inter-VM dependencies and communication issues. If a workload is >>> made up of more than just a VM and they all share the same (NUMA) >>> host, it might be best to have them sharing the nodes as much as >>> possible, or perhaps do right the opposite, depending on the >>> specific characteristics of he workload itself, and this might be >>> considered during placement, memory migration and perhaps >>> scheduling. >>> >>> - Benchmarking and performances evaluation in general. Meaning both >>> agreeing on a (set of) relevant workload(s) and on how to extract >>> meaningful performances data from there (and maybe how to do that >>> automatically?). >> >> I haven''t tried out the latest Xen NUMA features yet, but we''ve been >> keeping track of the IPC benchmarks as we get newer machines here: >> > >> http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/results.html >> > Wow... That''s really cool. I''ll definitely take a deep look at all these > data! I''m also adding the link to the wiki, if you''re fine with that...No problem with adding a link, as this is public data :) If possible, it''d be splendid to put a note next to this link encouraging people to submit their own results -- doing so is very simple, and helps us extend the database. Instructions are at http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/ (or, for a short link, http://fable.io).>> Happy to share the raw data if you have cycles to figure out the best >> way to auto-place multiple VMs so they are near each other from a memory >> latency perspective. >> > I don''t have anything precise in mind yet, but we need to think about > this.While there has been plenty of work on optimizing co-location of different kinds of workloads, there''s relatively little work (that I am aware of) on VM scheduling in this environment. One (sadly somewhat lacking) paper at HotCloud this year [1] looked at NUMA-aware VM migration to balance memory accesses. Of greater interest is possibly the Google ISCA paper on the detrimental effect of sharing micro-architectural resources between different kinds of workloads, although it is not explicitly focused on NUMA, and the metrics are defined with regards to specific classes of latency-sensitive jobs [2]. One interesting thing to look at (that we haven''t looked at yet) is what memory allocators do about NUMA these days; there is an AMD whitepaper from 2009 discussing the performance benefits of a NUMA-aware version of tcmalloc [3], but I have found it hard to reproduce their results on modern hardware. Of course, being virtualized may complicate matters here, since the memory allocator can no longer freely pick and choose where to allocate from. Scheduling, notably, is key here, since the CPU a process is scheduled on may determine where its memory is allocated -- frequent migrations are likely to be bad for performance due to remote memory accesses, although we have been unable to quantify a significant difference on non-synthetic macrobenchmarks; that said, we did not try very hard so far. Cheers, Malte [1] - Ahn et al., "Dynamic Virtual Machine Scheduling in Clouds for Architectural Shared Resources", in Proceedings of HotCloud 2012, https://www.usenix.org/conference/hotcloud12/dynamic-virtual-machine-scheduling-clouds-architectural-shared-resources [2] - Tang et al., "The impact of memory subsystem resource sharing on datacenter applications", in Proceedings of ISCA 2011, http://dl.acm.org/citation.cfm?id=2000099 [3] - http://developer.amd.com/Assets/NUMA_aware_heap_memory_manager_article_final.pdf
Dario Faggioli wrote on 2012-08-02:> Hi everyone, > > With automatic placement finally landing into xen-unstable, I stated > thinking about what I could work on next, still in the field of > improving Xen''s NUMA support. Well, it turned out that running out of > things to do is not an option! :-O > > In fact, I can think of quite a bit of open issues in that area, that I''m > just braindumping here. If anyone has thoughts or idea or feedback or > whatever, I''d be happy to serve as a collector of them. I''ve already > created a Wiki page to help with the tracking. You can see it here > (for now it basically replicates this e-mail): > > http://wiki.xen.org/wiki/Xen_NUMA_Roadmap > I''m putting a [D] (standing for Dario) near the points I''ve started > working on or looking at, and again, I''d be happy to try tracking this > too, i.e., keeping the list of "who-is-doing-what" updated, in order to > ease collaboration. > > So, let''s cut the talking: > > - Automatic placement at guest creation time. Basics are there and > will be shipping with 4.2. However, a lot of other things are > missing and/or can be improved, for instance: > [D] * automated verification and testing of the placement; > * benchmarks and improvements of the placement heuristic; > [D] * choosing/building up some measure of node load (more accurate > than just counting vcpus) onto which to rely during placement; > * consider IONUMA during placement;We should consider two things: 1. Dom0 IONUMA: Devices used by dom0 should get the dma buffer from the node which it resides. Currently, Dom0 allocates dma buffer without provide the node info to the hypercall.. 2.Guest IONUMA: when guest boots up with pass through device, we need to allocate the memory from the node where the device resides for further dma buffer allocation. And let guest know the IONUMA topology. This rely on the guest NUMA. This topic was mentioned in xen summit 2011: http://xen.org/files/xensummit_seoul11/nov2/5_XSAsia11_KTian_IO_Scalability_in_Xen.pdf> * automatic placement of Dom0, if possible (my current series is > only affecting DomU) * having internal xen data structure > honour the placement (e.g., I''ve been told that right now vcpu > stacks are always allocated on node 0... Andrew?). > [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on nodes'' pcpus, > just have them _prefer_ running on the nodes where their memory > is. > [D] - Dynamic memory migration between different nodes of the host. As > the counter-part of the NUMA-aware scheduler. > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a > guest ends up on more than one nodes, make sure it knows it''s > running on a NUMA platform (smaller than the actual host, but > still NUMA). This interacts with some of the above points: > * consider this during automatic placement for > resuming/migrating domains (if they have a virtual topology, > better not to change it); * consider this during memory > migration (it can change the actual topology, should we update > it on-line or disable memory migration?) > - NUMA and ballooning and memory sharing. In some more details: > * page sharing on NUMA boxes: it''s probably sane to make it > possible disabling sharing pages across nodes; * ballooning and > its interaction with placement (races, amount of memory needed > and reported being different at different time, etc.). > - Inter-VM dependencies and communication issues. If a workload is > made up of more than just a VM and they all share the same (NUMA) > host, it might be best to have them sharing the nodes as much as > possible, or perhaps do right the opposite, depending on the > specific characteristics of he workload itself, and this might be > considered during placement, memory migration and perhaps > scheduling. > - Benchmarking and performances evaluation in general. Meaning both > agreeing on a (set of) relevant workload(s) and on how to extract > meaningful performances data from there (and maybe how to do that > automatically?). > So, what do you think? > > Thanks and Regards, > Dario > > -- <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- Dario > Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software > Engineer, Citrix Systems R&D Ltd., Cambridge (UK) > > > -- <<This happens because I choose it to happen!>> (Raistlin Majere) > ----------------------------------------------------------------- Dario > Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software > Engineer, Citrix Systems R&D Ltd., Cambridge (UK)Best regards, Yang
>>> On 01.08.12 at 18:30, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > - Xen NUMA internals. Placing items such as the per-cpu stacks and data > area on the local NUMA node, rather than unconditionally on node 0 at > the moment. As part of this, there will be changes to > alloc_{dom,xen}heap_page() to allow specification of which node(s) to > allocate memory from.Those interfaces already support flags to be passed, including a node ID. It just needs to be made use of in more places. Jan
>>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a > guest ends up on more than one nodes, make sure it knows it''s > running on a NUMA platform (smaller than the actual host, but > still NUMA). This interacts with some of the above points:The question is whether this is really useful beyond the (I would suppose) relatively small set of cases where migration isn''t needed.> * consider this during automatic placement for > resuming/migrating domains (if they have a virtual topology, > better not to change it); > * consider this during memory migration (it can change the > actual topology, should we update it on-line or disable memory > migration?)The question is whether trading functionality for performance is an acceptable choice. Jan
On Thu, 2012-08-02 at 10:40 +0100, Jan Beulich wrote:> >>> On 01.08.12 at 18:30, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > - Xen NUMA internals. Placing items such as the per-cpu stacks and data > > area on the local NUMA node, rather than unconditionally on node 0 at > > the moment. As part of this, there will be changes to > > alloc_{dom,xen}heap_page() to allow specification of which node(s) to > > allocate memory from. > > Those interfaces already support flags to be passed, including a > node ID. It just needs to be made use of in more places. >Yes, I also remember it being already node_affinity conscious, and think it''s more a matter of how it is called. I''ll update the wiki accordingly (it doesn''t need to contain these sort of details anyway). Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote:> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: > > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a > > guest ends up on more than one nodes, make sure it knows it''s > > running on a NUMA platform (smaller than the actual host, but > > still NUMA). This interacts with some of the above points: > > The question is whether this is really useful beyond the (I would > suppose) relatively small set of cases where migration isn''t > needed. >Mmm... Not sure I''m getting what you''re saying here, sorry. Are you suggesting that exposing a virtual topology is not a good idea as it poses constraints/prevents live migration? If yes, well, I mostly agree that this is an huge issue, and that''s why I think wee need some bright idea on how to deal with it. I mean, it''s easy to make it optional and let it automatically disable migration, giving users the choice what they prefer, but I think this is more dodging the problem than dealing with it! :-P> > * consider this during automatic placement for > > resuming/migrating domains (if they have a virtual topology, > > better not to change it); > > * consider this during memory migration (it can change the > > actual topology, should we update it on-line or disable memory > > migration?) > > The question is whether trading functionality for performance > is an acceptable choice. >Indeed. Again, I think it is possible to implement things flexibly enough, but then we need to come out with a sane default, so we''re not allowed to avoid discussing and deciding on this. One can argue that it is an issue only for big-enough guests (and/or nearly overcommitted hosts) that don''t fit in only one node (as, if they do, there is no virtual topology to export), but I''m not sure we can neglect them on this basis. Thanks for the feedback, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 02.08.12 at 15:34, Dario Faggioli <raistlin@linux.it> wrote: > On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote: >> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: >> > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a >> > guest ends up on more than one nodes, make sure it knows it''s >> > running on a NUMA platform (smaller than the actual host, but >> > still NUMA). This interacts with some of the above points: >> >> The question is whether this is really useful beyond the (I would >> suppose) relatively small set of cases where migration isn''t >> needed. >> > Mmm... Not sure I''m getting what you''re saying here, sorry. Are you > suggesting that exposing a virtual topology is not a good idea as it > poses constraints/prevents live migration?Yes.> If yes, well, I mostly agree that this is an huge issue, and that''s why > I think wee need some bright idea on how to deal with it. I mean, it''s > easy to make it optional and let it automatically disable migration, > giving users the choice what they prefer, but I think this is more > dodging the problem than dealing with it! :-PIndeed.>> > * consider this during automatic placement for >> > resuming/migrating domains (if they have a virtual topology, >> > better not to change it); >> > * consider this during memory migration (it can change the >> > actual topology, should we update it on-line or disable memory >> > migration?) >> >> The question is whether trading functionality for performance >> is an acceptable choice. >> > Indeed. Again, I think it is possible to implement things flexibly > enough, but then we need to come out with a sane default, so we''re not > allowed to avoid discussing and deciding on this. > > One can argue that it is an issue only for big-enough guests (and/or > nearly overcommitted hosts) that don''t fit in only one node (as, if they > do, there is no virtual topology to export), but I''m not sure we can > neglect them on this basis.We certainly can''t, the more that the "big enough" case may not be that infrequent going forward. Jan
On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote:> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote: >> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: >> > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a >> > guest ends up on more than one nodes, make sure it knows it''s >> > running on a NUMA platform (smaller than the actual host, but >> > still NUMA). This interacts with some of the above points: >> >> The question is whether this is really useful beyond the (I would >> suppose) relatively small set of cases where migration isn''t >> needed. >> > Mmm... Not sure I''m getting what you''re saying here, sorry. Are you > suggesting that exposing a virtual topology is not a good idea as it > poses constraints/prevents live migration? > > If yes, well, I mostly agree that this is an huge issue, and that''s why > I think wee need some bright idea on how to deal with it. I mean, it''s > easy to make it optional and let it automatically disable migration, > giving users the choice what they prefer, but I think this is more > dodging the problem than dealing with it! :-P > >> > * consider this during automatic placement for >> > resuming/migrating domains (if they have a virtual topology, >> > better not to change it); >> > * consider this during memory migration (it can change the >> > actual topology, should we update it on-line or disable memory >> > migration?)I think we could use cpu hot-plug to change the "virtual topology" of VMs, couldn''t we? We could probably even do that on a running guest if we really needed to. -George
>>> On 02.08.12 at 18:36, George Dunlap <George.Dunlap@eu.citrix.com> wrote: > On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote: >> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote: >>> >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: >>> > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a >>> > guest ends up on more than one nodes, make sure it knows it''s >>> > running on a NUMA platform (smaller than the actual host, but >>> > still NUMA). This interacts with some of the above points: >>> >>> The question is whether this is really useful beyond the (I would >>> suppose) relatively small set of cases where migration isn''t >>> needed. >>> >> Mmm... Not sure I''m getting what you''re saying here, sorry. Are you >> suggesting that exposing a virtual topology is not a good idea as it >> poses constraints/prevents live migration? >> >> If yes, well, I mostly agree that this is an huge issue, and that''s why >> I think wee need some bright idea on how to deal with it. I mean, it''s >> easy to make it optional and let it automatically disable migration, >> giving users the choice what they prefer, but I think this is more >> dodging the problem than dealing with it! :-P >> >>> > * consider this during automatic placement for >>> > resuming/migrating domains (if they have a virtual topology, >>> > better not to change it); >>> > * consider this during memory migration (it can change the >>> > actual topology, should we update it on-line or disable memory >>> > migration?) > > I think we could use cpu hot-plug to change the "virtual topology" of > VMs, couldn''t we? We could probably even do that on a running guest > if we really needed to.Hmm, not sure - using hotplug behind the back of the guest might be possible, but you''d first need to hot-unplug the vCPU. That''s something that I don''t think you can do on HVM guests (and for PV guests, guest visible NUMA support makes even less sense than for HVM ones). Jan
On 08/03/2012 11:23 AM, Jan Beulich wrote:>>>> On 02.08.12 at 18:36, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote: >>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote: >>>>>>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: >>>>> - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a >>>>> guest ends up on more than one nodes, make sure it knows it''s >>>>> running on a NUMA platform (smaller than the actual host, but >>>>> still NUMA). This interacts with some of the above points: >>>> >>>> The question is whether this is really useful beyond the (I would >>>> suppose) relatively small set of cases where migration isn''t >>>> needed. >>>> >>> Mmm... Not sure I''m getting what you''re saying here, sorry. Are you >>> suggesting that exposing a virtual topology is not a good idea as it >>> poses constraints/prevents live migration?Honestly, what would be the problems with migration? NUMA awareness is actually a software optimization, so we will not really break something if the advertised topology isn''t the real one. This is especially true if we lower the number of NUMA nodes. Say the guest starts with two nodes and then gets migrated to a machine where it can happily live in one node. There would be some extra effort by the guest OS to obey the virtual NUMA topology, but if there isn''t actually a NUMA penalty anymore this shouldn''t really hurt, right? Even if we would need to go to a machine where we have more nodes for a certain guest than before, this is actually what we have today: guest NUMA unawareness. I am not sure if this is really a migration showstopper, and certainly not a NUMA guest showstopper. But we could make it a config file option, so we leave this decision to the admin. I have talked to people with huge guests, they keep asking me about this feature.>>> >>> If yes, well, I mostly agree that this is an huge issue, and that''s why >>> I think wee need some bright idea on how to deal with it. I mean, it''s >>> easy to make it optional and let it automatically disable migration, >>> giving users the choice what they prefer, but I think this is more >>> dodging the problem than dealing with it! :-P >>> >>>>> * consider this during automatic placement for >>>>> resuming/migrating domains (if they have a virtual topology, >>>>> better not to change it); >>>>> * consider this during memory migration (it can change the >>>>> actual topology, should we update it on-line or disable memory >>>>> migration?) >> >> I think we could use cpu hot-plug to change the "virtual topology" of >> VMs, couldn''t we? We could probably even do that on a running guest >> if we really needed to. > > Hmm, not sure - using hotplug behind the back of the guest might > be possible, but you''d first need to hot-unplug the vCPU. That''s > something that I don''t think you can do on HVM guests (and for > PV guests, guest visible NUMA support makes even less sense > than for HVM ones).I don''t think that hotplug would really work. I have checked this some times ago, at least the Linux NUMA code cannot be really fooled by this. The SRAT table is firmware defined and static by nature, so there is no code in Linux to change the NUMA topology at runtime. This is especially true for the memory layout. But as said above, I don''t really buy this as an argument against guest NUMA. At least provide it as an option to people who know what they do. Regards, Andre. -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712
On 08/01/2012 06:16 PM, Dario Faggioli wrote:> Hi everyone, > > With automatic placement finally landing into xen-unstable, I stated > thinking about what I could work on next, still in the field of > improving Xen''s NUMA support. Well, it turned out that running out of > things to do is not an option! :-O > > In fact, I can think of quite a bit of open issues in that area, that I''m > just braindumping here.> ... > > * automatic placement of Dom0, if possible (my current series is > only affecting DomU)I think Dom0 NUMA awareness should be one of the top priorities. If I boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which actually has memory from all 8 nodes and thinks it''s memory is flat. There are some tricks to confine it to node 0 (dom0_mem=<memory of node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires intimate knowledge of the systems parameters and is error-prone. Also this does not work well with ballooning. Actually we could improve the NUMA placement with that: By asking the Dom0 explicitly for memory from a certain node.> * having internal xen data structure honour the placement (e.g., > I''ve been told that right now vcpu stacks are always allocated > on node 0... Andrew?). > > [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on nodes'' pcpus, > just have them _prefer_ running on the nodes where their memory > is.This would be really cool. I once thought about something like a home-node. We start with placement to allocate memory from one node. Then we relax the VCPU-pinning, but mark this node as special for this guest, so that it if possible happens to get run there. But in times of CPU pressure we are happy to let it run on other nodes: CPU starving is much worse than NUMA penalty.> > [D] - Dynamic memory migration between different nodes of the host. As > the counter-part of the NUMA-aware scheduler.I once read about a VMware feature: bandwith-limited migration in the background, hot pages first. So we get flexibility and avoid CPU starving, but still don''t hog the system with memory copying. Sounds quite ambitious, though. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany
>>> On 03.08.12 at 11:48, Andre Przywara <andre.przywara@amd.com> wrote: > On 08/03/2012 11:23 AM, Jan Beulich wrote: >>>>> On 02.08.12 at 18:36, George Dunlap <George.Dunlap@eu.citrix.com> wrote: >>> On Thu, Aug 2, 2012 at 2:34 PM, Dario Faggioli <raistlin@linux.it> wrote: >>>> On Thu, 2012-08-02 at 10:43 +0100, Jan Beulich wrote: >>>>>>>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: >>>>>> - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a >>>>>> guest ends up on more than one nodes, make sure it knows it''s >>>>>> running on a NUMA platform (smaller than the actual host, but >>>>>> still NUMA). This interacts with some of the above points: >>>>> >>>>> The question is whether this is really useful beyond the (I would >>>>> suppose) relatively small set of cases where migration isn''t >>>>> needed. >>>>> >>>> Mmm... Not sure I''m getting what you''re saying here, sorry. Are you >>>> suggesting that exposing a virtual topology is not a good idea as it >>>> poses constraints/prevents live migration? > > Honestly, what would be the problems with migration? NUMA awareness is > actually a software optimization, so we will not really break something > if the advertised topology isn''t the real one.Sure, nothing would break, but the purpose of the whole feature is improving performance, and that might get entirely lost (or even worse) after a migration to a different topology host. Jan
>>> On 03.08.12 at 12:02, Andre Przywara <andre.przywara@amd.com> wrote: > On 08/01/2012 06:16 PM, Dario Faggioli wrote: >> Hi everyone, >> >> With automatic placement finally landing into xen-unstable, I stated >> thinking about what I could work on next, still in the field of >> improving Xen''s NUMA support. Well, it turned out that running out of >> things to do is not an option! :-O >> >> In fact, I can think of quite a bit of open issues in that area, that I''m >> just braindumping here. > >> ... >> >> * automatic placement of Dom0, if possible (my current series is >> only affecting DomU) > > I think Dom0 NUMA awareness should be one of the top priorities. If I > boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which > actually has memory from all 8 nodes and thinks it''s memory is flat. > There are some tricks to confine it to node 0 (dom0_mem=<memory of > node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires > intimate knowledge of the systems parameters and is error-prone.How about "dom0_mem=node<n> dom0_vcpus=node<n>" as an extension to the current options?> Also this does not work well with ballooning. > Actually we could improve the NUMA placement with that: By asking the > Dom0 explicitly for memory from a certain node.Yes, passing sideband information to the balloon driver was always a missing item, not only for NUMA support, but also for address restricted memory (e.g. such needed to start 32-bit PV guests on big systems). Jan
On 03/08/12 10:48, Andre Przywara wrote:>>> I think we could use cpu hot-plug to change the "virtual topology" of >>> VMs, couldn''t we? We could probably even do that on a running guest >>> if we really needed to. >> Hmm, not sure - using hotplug behind the back of the guest might >> be possible, but you''d first need to hot-unplug the vCPU. That''s >> something that I don''t think you can do on HVM guests (and for >> PV guests, guest visible NUMA support makes even less sense >> than for HVM ones). > I don''t think that hotplug would really work. I have checked this some > times ago, at least the Linux NUMA code cannot be really fooled by this. > The SRAT table is firmware defined and static by nature, so there is no > code in Linux to change the NUMA topology at runtime. This is especially > true for the memory layout.I was more thinking of giving a VM the biggest topology you would want at boot, and then asking Linux to online or offline vcpus; for example, giving it a 4x2 topology (4 vcores x 2 vnodes). When running on a system with 2 cores per node, you offline 2 vcpus per vnode, giving it an effective layout of 2x2. When running on a system with 4 cores per node, you could offline all of the cores on one node, giving it an effective topology of 4x1. Unfortunately, I just realized that you could change the number of vcpus in a given node, but you couldn''t move the memory around very easily. Unless you have memory hotplug? Hmm..... :-) -George
On 08/03/2012 12:40 PM, Jan Beulich wrote:>>>> On 03.08.12 at 12:02, Andre Przywara <andre.przywara@amd.com> wrote: >> On 08/01/2012 06:16 PM, Dario Faggioli wrote: >>> Hi everyone, >>> >>> With automatic placement finally landing into xen-unstable, I stated >>> thinking about what I could work on next, still in the field of >>> improving Xen''s NUMA support. Well, it turned out that running out of >>> things to do is not an option! :-O >>> >>> In fact, I can think of quite a bit of open issues in that area, that I''m >>> just braindumping here. >> >>> ... >>> >>> * automatic placement of Dom0, if possible (my current series is >>> only affecting DomU) >> >> I think Dom0 NUMA awareness should be one of the top priorities. If I >> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which >> actually has memory from all 8 nodes and thinks it''s memory is flat. >> There are some tricks to confine it to node 0 (dom0_mem=<memory of >> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires >> intimate knowledge of the systems parameters and is error-prone. > > How about "dom0_mem=node<n> dom0_vcpus=node<n>" as > an extension to the current options?Yes, that sounds like a good idea. And relatively easy to implement. Maybe a list or a number of nodes (to make it more complicated ;-) Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany
>>> On 03.08.12 at 13:26, Andre Przywara <andre.przywara@amd.com> wrote: > On 08/03/2012 12:40 PM, Jan Beulich wrote: >>>>> On 03.08.12 at 12:02, Andre Przywara <andre.przywara@amd.com> wrote: >>> On 08/01/2012 06:16 PM, Dario Faggioli wrote: >>>> Hi everyone, >>>> >>>> With automatic placement finally landing into xen-unstable, I stated >>>> thinking about what I could work on next, still in the field of >>>> improving Xen''s NUMA support. Well, it turned out that running out of >>>> things to do is not an option! :-O >>>> >>>> In fact, I can think of quite a bit of open issues in that area, that I''m >>>> just braindumping here. >>> >>>> ... >>>> >>>> * automatic placement of Dom0, if possible (my current series is >>>> only affecting DomU) >>> >>> I think Dom0 NUMA awareness should be one of the top priorities. If I >>> boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which >>> actually has memory from all 8 nodes and thinks it''s memory is flat. >>> There are some tricks to confine it to node 0 (dom0_mem=<memory of >>> node0> dom0_vcpus=<cores in node0> dom0_vcpus_pin), but this requires >>> intimate knowledge of the systems parameters and is error-prone. >> >> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as >> an extension to the current options? > > Yes, that sounds like a good idea. And relatively easy to implement. > Maybe a list or a number of nodes (to make it more complicated ;-)Oh yes, of course I implied this flexibility. Just wanted to give an easy to read example. Jan
On Fri, 2012-08-03 at 12:38 +0100, Jan Beulich wrote:> >> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as > >> an extension to the current options? > > > > Yes, that sounds like a good idea. And relatively easy to implement. > > Maybe a list or a number of nodes (to make it more complicated ;-) > > Oh yes, of course I implied this flexibility. Just wanted to give > an easy to read example. >Yep, I agree it sounds nice and should be not to hard. I''ll update the Wiki page. I only have one question, should we try to take IONUMA into account here as well? I mean, if it turns out that I/O hubs are connected to some specific node(s), shouldn''t we consider pinning/"affining" Dom0 to those node(s), as it most likely will be responsible for some/most DomUs'' I/O? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 03.08.12 at 15:14, Dario Faggioli <raistlin@linux.it> wrote: > On Fri, 2012-08-03 at 12:38 +0100, Jan Beulich wrote: >> >> How about "dom0_mem=node<n> dom0_vcpus=node<n>" as >> >> an extension to the current options? >> > >> > Yes, that sounds like a good idea. And relatively easy to implement. >> > Maybe a list or a number of nodes (to make it more complicated ;-) >> >> Oh yes, of course I implied this flexibility. Just wanted to give >> an easy to read example. >> > Yep, I agree it sounds nice and should be not to hard. I''ll update the > Wiki page. > > I only have one question, should we try to take IONUMA into account here > as well? I mean, if it turns out that I/O hubs are connected to some > specific node(s), shouldn''t we consider pinning/"affining" Dom0 to those > node(s), as it most likely will be responsible for some/most DomUs'' I/O?I don''t think the necessary information is available at the time when Dom0 gets constructed. Jan
> From: Dario Faggioli [mailto:raistlin@linux.it] > Subject: [Xen-devel] NUMA TODO-list for xen-devel > > Hi everyone,Hi Dario -- Thanks for your great work on NUMA... an interest area of mine but one, sadly, I haven''t been able to give much time to, so I''m glad you''ve taken this bull by the horns. I''ve been sitting on an idea for some time that probably deserves some exposure on your list. Naturally, it involves my favorite topic tmem (readers, please don''t tune out yet :-). It has occurred to me that a fundamental tenet of NUMA is to put infrequently used data on "other" nodes, while pulling frequently used data onto a "local" node. Tmem very nicely separates infrequently-used data from frequently-used data with an API/ABI that is now fully implemented in upstream Linux. If Xen had a "alloc_page_on_any_node_but_the_current_one()" (or "any_node_except_this_guests_node_set" for multinode guests) and Xen''s tmem implementation were to use it, especially in combination with selfballooning (also upstream), this could solve a significant part of the NUMA problem when running tmem-enabled guests. The most frequently used data stays in the guest (thus in the guest''s "current node") and the less frequently used data lives in tmem in the hypervisor (on the guest''s "complement guest''s Naturally, this doesn''t solve any NUMA problems at all for tmem-ignorant or tmem-disabled guests, but if it works sufficiently well for tmem-enabled guests, that might encourage other OS''s to do a simple implementation of tmem. Sadly, I''m not able to invest much time in this idea, but the combination of tmem and NUMA might interest some developers and/or grad students, in which case I''d be happy to spend a little time assisting. I''ll be at Xen Summit for at least the first day, so we can chat more if you are interested. George/Jan, I suspect you have the best knowledge of tmem outside of Oracle as well as being NUMA-fluent, so I''d appreciate your thoughts as well! Thanks, Dan
> From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Thursday, August 02, 2012 3:43 AM > To: Dario Faggioli > Cc: Andre Przywara; Anil Madhavapeddy; George Dunlap; xen-devel; Andrew Cooper; Yang Z Zhang > Subject: Re: [Xen-devel] NUMA TODO-list for xen-devel > > >>> On 01.08.12 at 18:16, Dario Faggioli <raistlin@linux.it> wrote: > > - Virtual NUMA topology exposure to guests (a.k.a guest-numa). If a > > guest ends up on more than one nodes, make sure it knows it''s > > running on a NUMA platform (smaller than the actual host, but > > still NUMA). This interacts with some of the above points: > > The question is whether this is really useful beyond the (I would > suppose) relatively small set of cases where migration isn''t > needed. > > > * consider this during automatic placement for > > resuming/migrating domains (if they have a virtual topology, > > better not to change it); > > * consider this during memory migration (it can change the > > actual topology, should we update it on-line or disable memory > > migration?) > > The question is whether trading functionality for performance > is an acceptable choice.If there were a lwn.net equivalent for Xen, I''d be pushing to get quoted on the following: "Virtualization: You can have flexibility or you can have performance. Pick one." A couple of years ago when NUMA was first being extensively discussed for Xen, I suggested that this should really be a "top level" flag that a sysadmin should be able to select: Either optimize for performance or optimize for flexibility. Then Xen and the Xen tools should "do the right thing" depending on the selection. I still think this is a good way to surface the tradeoffs for a very complex problem to the vast majority of users/admins. Clearly they will want "both" but forcing the choice will provoke more thought about their use model, as well as provide important guidance to the underlying implementations.
> >>>>> The question is whether this is really useful beyond the (I would > >>>>> suppose) relatively small set of cases where migration isn''t > >>>>> needed. > >>>>> > >>>> Mmm... Not sure I''m getting what you''re saying here, sorry. Are you > >>>> suggesting that exposing a virtual topology is not a good idea as it > >>>> poses constraints/prevents live migration? > > > > Honestly, what would be the problems with migration? NUMA awareness is > > actually a software optimization, so we will not really break something > > if the advertised topology isn''t the real one. > > Sure, nothing would break, but the purpose of the whole feature > is improving performance, and that might get entirely lost (or > even worse) after a migration to a different topology host.+1 In the end, customers who care about getting 99.9% of native performance should use physical hardware. Live migration means that someone/something is trying to do resource optimization and so performance optimization is secondary. But claiming great performance before migration and getting sucky performance after migration, is IMHO a disaster, especially when future "cloud users" won''t have a clue whether their environment has migrated or not. Just my two cents...
> > [D] - Dynamic memory migration between different nodes of the host. As > > the counter-part of the NUMA-aware scheduler. > > I once read about a VMware feature: bandwith-limited migration in the > background, hot pages first. So we get flexibility and avoid CPU > starving, but still don''t hog the system with memory copying. > Sounds quite ambitious, though.Something like this, but between NUMA nodes instead of physical systems? http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf
>>> On 04.08.12 at 00:34, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> The question is whether trading functionality for performance >> is an acceptable choice. > > If there were a lwn.net equivalent for Xen, I''d be pushing to get > quoted on the following: > > "Virtualization: You can have flexibility or you can have performance. > Pick one." > > A couple of years ago when NUMA was first being extensively discussed > for Xen, I suggested that this should really be a "top level" flag > that a sysadmin should be able to select: Either optimize for > performance or optimize for flexibility. Then Xen and the Xen tools > should "do the right thing" depending on the selection. > > I still think this is a good way to surface the tradeoffs for > a very complex problem to the vast majority of users/admins. > Clearly they will want "both" but forcing the choice will > provoke more thought about their use model, as well as provide > important guidance to the underlying implementations.I would expect a good part to pick performance, and then go whine about something not working in an emergency. On xen-devel one could respond with this-is-what-you-get, but you can''t necessarily do so to paying customers... Jan
> From: Jan Beulich [mailto:JBeulich@suse.com] > Subject: RE: [Xen-devel] NUMA TODO-list for xen-devel > > >>> On 04.08.12 at 00:34, Dan Magenheimer <dan.magenheimer@oracle.com> wrote: > >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> The question is whether trading functionality for performance > >> is an acceptable choice. > > > > If there were a lwn.net equivalent for Xen, I''d be pushing to get > > quoted on the following: > > > > "Virtualization: You can have flexibility or you can have performance. > > Pick one." > > > > A couple of years ago when NUMA was first being extensively discussed > > for Xen, I suggested that this should really be a "top level" flag > > that a sysadmin should be able to select: Either optimize for > > performance or optimize for flexibility. Then Xen and the Xen tools > > should "do the right thing" depending on the selection. > > > > I still think this is a good way to surface the tradeoffs for > > a very complex problem to the vast majority of users/admins. > > Clearly they will want "both" but forcing the choice will > > provoke more thought about their use model, as well as provide > > important guidance to the underlying implementations. > > I would expect a good part to pick performance, and then > go whine about something not working in an emergency. On > xen-devel one could respond with this-is-what-you-get, but > you can''t necessarily do so to paying customers...Well, you can, but you have to first convince marketing that virtualization doesn''t solve all problems for all users all the time. :-) The two options would have to be clearly documented as: "flexibility-is-my-highest-priority-and-performance-is-second-priority" and "performance-is-my-highest-priority-and-flexibility-is-second-priority" and when a user selects the latter, they should be prompted with "Are you really sure you want to use virtualization instead of bare metal?" Sigh. We can only wish. Dan
On Thu, 2012-08-02 at 01:04 +0000, Zhang, Yang Z wrote:> > - Automatic placement at guest creation time. Basics are there and > > will be shipping with 4.2. However, a lot of other things are > > missing and/or can be improved, for instance: > > [D] * automated verification and testing of the placement; > > * benchmarks and improvements of the placement heuristic; > > [D] * choosing/building up some measure of node load (more accurate > > than just counting vcpus) onto which to rely during placement; > > * consider IONUMA during placement; > We should consider two things: > 1. Dom0 IONUMA: Devices used by dom0 should get the dma buffer from the node which it resides. Currently, Dom0 allocates dma buffer without provide the node info to the hypercall.. > 2.Guest IONUMA: when guest boots up with pass through device, we need to allocate the memory from the node where the device resides for further dma buffer allocation. And let guest know the IONUMA topology. This rely on the guest NUMA. > This topic was mentioned in xen summit 2011: > http://xen.org/files/xensummit_seoul11/nov2/5_XSAsia11_KTian_IO_Scalability_in_Xen.pdf >Seems fine, I knew that presentation and I added these details to the Wiki page (sorry for the delay). Are you (or someone from your group) perhaps working or planning to work on it? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 2012-08-03 at 15:22 -0700, Dan Magenheimer wrote:> Hi Dario -- >Hello Dan,> Thanks for your great work on NUMA... an interest area of > mine but one, sadly, I haven''t been able to give much time to, > so I''m glad you''ve taken this bull by the horns. >Trying to... Let''s see! :-P> I''ve been sitting on an idea for some time that probably > deserves some exposure on your list. Naturally, it involves > my favorite topic tmem (readers, please don''t tune out yet :-). >It sure does! I''ve already put something quite generic about "memory sharing" there, because I know that it has all but trivial interactions with the improved NUMA support I am/we are trying to envision. The fact that it is, as I said, generic, is due to my ignorance (let''s say for now) of the whole tmem thing, so thanks for the contribution, it''s very useful to hear your point of view on this!> It has occurred to me that a fundamental tenet of NUMA > is to put infrequently used data on "other" nodes, while > pulling frequently used data onto a "local" node. > > Tmem very nicely separates infrequently-used data from > frequently-used data with an API/ABI that is now fully > implemented in upstream Linux. >I see, and it seems nice.> [..] > > Naturally, this doesn''t solve any NUMA problems at all for > tmem-ignorant or tmem-disabled guests, but if it works > sufficiently well for tmem-enabled guests, that might > encourage other OS''s to do a simple implementation of tmem. >Sure. In my opinion, this is not an area where we could aim at "solving every problem for everyone". However, we should definitely target having a sensible solution for default and/or most common use cases and scenarios.> Sadly, I''m not able to invest much time in this idea, > but the combination of tmem and NUMA might interest some > developers and/or grad students, in which case I''d be happy > to spend a little time assisting. >That''s definitely the case. I''ve tried to put a summary of what you said in this mail to the Wiki (http://wiki.xen.org/wiki/Xen_NUMA_Roadmap) and also put your contact next to it. Feel free to update/correct if you fin something wrong. :-P> I''ll be at Xen Summit for at least the first day, so we > can chat more if you are interested. >I indeed am interested, so let''s make that happen! :-) Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Thu, 2012-08-02 at 01:04 +0100, Malte Schwarzkopf wrote:> > Wow... That''s really cool. I''ll definitely take a deep look at all these > > data! I''m also adding the link to the wiki, if you''re fine with that... > > No problem with adding a link, as this is public data :) If possible, > it''d be splendid to put a note next to this link encouraging people to > submit their own results -- doing so is very simple, and helps us extend > the database. Instructions are at > http://www.cl.cam.ac.uk/research/srg/netos/ipc-bench/ (or, for a short > link, http://fable.io). >Ok, I''ve tried doing this, here it is how it looks: http://wiki.xen.org/wiki/Xen_NUMA_Roadmap http://wiki.xen.org/wiki/Xen_NUMA_Roadmap#Inter-VM_dependencies_and_communication_issues Thanks also for the references, I''ll definitely take a look at them. :-)> One interesting thing to look at (that we haven''t looked at yet) is what > memory allocators do about NUMA these days; there is an AMD whitepaper > from 2009 discussing the performance benefits of a NUMA-aware version of > tcmalloc [3], but I have found it hard to reproduce their results on > modern hardware. Of course, being virtualized may complicate matters > here, since the memory allocator can no longer freely pick and choose > where to allocate from. > > Scheduling, notably, is key here, since the CPU a process is scheduled > on may determine where its memory is allocated -- frequent migrations > are likely to be bad for performance due to remote memory accesses, >That might be true for Linux, but it''s not so much true (fortunately :-P) for Xen. However, I also think scheduling is a very important aspect of this whole NUMA thing... I''ll repost my NUMA aware credit scheduler patches soon.> although we have been unable to quantify a significant difference on > non-synthetic macrobenchmarks; that said, we did not try very hard so far. >I think both kinds of benchmarks are interesting. I tried to concentrate a bit on macrobenchmark (specjbb, I''ll let you decide if that''s synthetic or not :-D). Another issue, if we want to tackle the problem of communicating/cooperating VMs, pops up at the interface level, i.e., how do we want the user to tell us that 2 (or more) VMs are "related"? Up to what level of detail? Should this "relationship" be permanent or might it change over time? Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 2012-08-03 at 15:42 -0700, Dan Magenheimer wrote:> > > [D] - Dynamic memory migration between different nodes of the host. As > > > the counter-part of the NUMA-aware scheduler. > > > > I once read about a VMware feature: bandwith-limited migration in the > > background, hot pages first. So we get flexibility and avoid CPU > > starving, but still don''t hog the system with memory copying. > > Sounds quite ambitious, though. > > Something like this, but between NUMA nodes instead of physical systems? > > http://osnet.cs.binghamton.edu/publications/hines09postcopy_osr.pdf >Likely. The analogy between this kind of "memory migration" and the actual live migration we already have is indeed something I want to take advantage of. The fact that we support that small thing called _paravirtualization_ is complicating it all quite a bit, but I''m looking into it... Thanks for the reference. :-) Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
On Fri, 2012-08-03 at 12:02 +0200, Andre Przywara wrote:> On 08/01/2012 06:16 PM, Dario Faggioli wrote: > > ... > > > > * automatic placement of Dom0, if possible (my current series is > > only affecting DomU) > > I think Dom0 NUMA awareness should be one of the top priorities. If I > boot my 8-node box with Xen, I end up with a NUMA-clueless Dom0 which > actually has memory from all 8 nodes and thinks it''s memory is flat. >Ok, I updated the Wiki page with a link to this (sub)thread --- more specifically, the mails where we agree about the new syntax. I can work on it, but not in the next few days, so let''s see if anyone steps up before I get to look at it. :-)> > [D] - NUMA aware scheduling in Xen. Don''t pin vcpus on nodes'' pcpus, > > just have them _prefer_ running on the nodes where their memory > > is. > > This would be really cool. I once thought about something like a > home-node. We start with placement to allocate memory from one node. > Then we relax the VCPU-pinning, but mark this node as special for this > guest, so that it if possible happens to get run there. But in times of > CPU pressure we are happy to let it run on other nodes: CPU starving is > much worse than NUMA penalty. >Yep. Patches coming soon. Thanks and Regards, Dario -- <<This happens because I choose it to happen!>> (Raistlin Majere) ----------------------------------------------------------------- Dario Faggioli, Ph.D, http://retis.sssup.it/people/faggioli Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK) _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel