Hello, does the last version of Xen(3.4.1) support NUMA machines? Is there a .pdf or a link that can give me some more details about that? I work on a project for xen performace in numa machines. And in xen 3.3.0 this performance isn''t good. Have something changed in last version? Thanks in advance, Papagiannis Anastasios _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Add Xen boot parameter ''numa=on'' to enable NUMA detection. Then it''s up to you to, for example, pin domains to specific nodes, using the ''cpus=...'' option in the domain config file. See /etc/xen/xmexample1 for an example of its usage. -- Keir On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr> wrote:> Hello, > > does the last version of Xen(3.4.1) support NUMA machines? Is there a .pdf > or a link that can give me some more details about that? I work on a > project for xen performace in numa machines. And in xen 3.3.0 this > performance isn''t good. Have something changed in last version? > > Thanks in advance, > Papagiannis Anastasios > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
VMware has the notion of a "cell" where VMs can be scheduled only within a cell, not across cells. Cell boundaries are determined by VMware by default, though certains settings can override them. An interesting project might be to implement "numa=cell" for Xen.... or maybe something similar is already in George Dunlap''s scheduler plans?> -----Original Message----- > From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] > Sent: Wednesday, November 04, 2009 5:33 AM > To: Papagiannis Anastasios; xen-devel@lists.xensource.com > Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support > > > Add Xen boot parameter ''numa=on'' to enable NUMA detection. > Then it''s up to > you to, for example, pin domains to specific nodes, using the > ''cpus=...'' > option in the domain config file. See /etc/xen/xmexample1 for > an example of > its usage. > > -- Keir > > On 04/11/2009 12:02, "Papagiannis Anastasios" > <apapag@ics.forth.gr> wrote: > > > Hello, > > > > does the last version of Xen(3.4.1) support NUMA machines? > Is there a .pdf > > or a link that can give me some more details about that? I work on a > > project for xen performace in numa machines. And in xen 3.3.0 this > > performance isn''t good. Have something changed in last version? > > > > Thanks in advance, > > Papagiannis Anastasios > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com > > http://lists.xensource.com/xen-devel > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I haven''t had time to look at NUMA stuff at all. I probably will look at it eventually, if no one else does, but I''d be happy if someone else could pursue it. -George Dan Magenheimer wrote:> VMware has the notion of a "cell" where VMs can be > scheduled only within a cell, not across cells. > Cell boundaries are determined by VMware by > default, though certains settings can override them. > > An interesting project might be to implement > "numa=cell" for Xen.... or maybe something similar > is already in George Dunlap''s scheduler plans? > > >> -----Original Message----- >> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >> Sent: Wednesday, November 04, 2009 5:33 AM >> To: Papagiannis Anastasios; xen-devel@lists.xensource.com >> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support >> >> >> Add Xen boot parameter ''numa=on'' to enable NUMA detection. >> Then it''s up to >> you to, for example, pin domains to specific nodes, using the >> ''cpus=...'' >> option in the domain config file. See /etc/xen/xmexample1 for >> an example of >> its usage. >> >> -- Keir >> >> On 04/11/2009 12:02, "Papagiannis Anastasios" >> <apapag@ics.forth.gr> wrote: >> >> >>> Hello, >>> >>> does the last version of Xen(3.4.1) support NUMA machines? >>> >> Is there a .pdf >> >>> or a link that can give me some more details about that? I work on a >>> project for xen performace in numa machines. And in xen 3.3.0 this >>> performance isn''t good. Have something changed in last version? >>> >>> Thanks in advance, >>> Papagiannis Anastasios >>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George, What''s the current scope and status of your scheduler work ? Is it going to look similar to the Linux scheduler (with scheduling domains, et al). In that case, topology is already accounted for, to a large extent. It would be good to know so that I can work on something that doesn''t overlap. -dulloor On Mon, Nov 9, 2009 at 6:33 AM, George Dunlap <george.dunlap@eu.citrix.com> wrote:> I haven''t had time to look at NUMA stuff at all. I probably will look at it > eventually, if no one else does, but I''d be happy if someone else could > pursue it. > > -George > > Dan Magenheimer wrote: >> >> VMware has the notion of a "cell" where VMs can be >> scheduled only within a cell, not across cells. >> Cell boundaries are determined by VMware by >> default, though certains settings can override them. >> >> An interesting project might be to implement >> "numa=cell" for Xen.... or maybe something similar >> is already in George Dunlap''s scheduler plans? >> >> >>> >>> -----Original Message----- >>> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >>> Sent: Wednesday, November 04, 2009 5:33 AM >>> To: Papagiannis Anastasios; xen-devel@lists.xensource.com >>> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support >>> >>> >>> Add Xen boot parameter ''numa=on'' to enable NUMA detection. Then it''s up >>> to >>> you to, for example, pin domains to specific nodes, using the ''cpus=...'' >>> option in the domain config file. See /etc/xen/xmexample1 for an example >>> of >>> its usage. >>> >>> -- Keir >>> >>> On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr> >>> wrote: >>> >>> >>>> >>>> Hello, >>>> >>>> does the last version of Xen(3.4.1) support NUMA machines? >>> >>> Is there a .pdf >>> >>>> >>>> or a link that can give me some more details about that? I work on a >>>> project for xen performace in numa machines. And in xen 3.3.0 this >>>> performance isn''t good. Have something changed in last version? >>>> >>>> Thanks in advance, >>>> Papagiannis Anastasios >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel >>>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >>> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Cpupools? :-) NUMA was a topic I wanted to look at as soon as cpupools are officially accepted. Keir wanted to propose a way to get rid of the function continue_hypercall_on_cpu() which was causing most of the stuff leading to the objection of cpupools. I guess Keir had some higher priority jobs. :-) So I will try a new patch for cpupools without continue_hypercall_on_cpu() and perhaps with NUMA support. George, would this be okay for you? I think your scheduler still will have problems with domain weights as long as domains are restricted to some processors, right? Juergen George Dunlap wrote:> I haven''t had time to look at NUMA stuff at all. I probably will look > at it eventually, if no one else does, but I''d be happy if someone else > could pursue it. > > -George > > Dan Magenheimer wrote: >> VMware has the notion of a "cell" where VMs can be >> scheduled only within a cell, not across cells. >> Cell boundaries are determined by VMware by >> default, though certains settings can override them. >> >> An interesting project might be to implement >> "numa=cell" for Xen.... or maybe something similar >> is already in George Dunlap''s scheduler plans? >> >> >>> -----Original Message----- >>> From: Keir Fraser [mailto:keir.fraser@eu.citrix.com] >>> Sent: Wednesday, November 04, 2009 5:33 AM >>> To: Papagiannis Anastasios; xen-devel@lists.xensource.com >>> Subject: Re: [Xen-devel] Xen 3.4.1 NUMA support >>> >>> >>> Add Xen boot parameter ''numa=on'' to enable NUMA detection. Then it''s >>> up to >>> you to, for example, pin domains to specific nodes, using the ''cpus=...'' >>> option in the domain config file. See /etc/xen/xmexample1 for an >>> example of >>> its usage. >>> >>> -- Keir >>> >>> On 04/11/2009 12:02, "Papagiannis Anastasios" <apapag@ics.forth.gr> >>> wrote: >>> >>> >>>> Hello, >>>> >>>> does the last version of Xen(3.4.1) support NUMA machines? >>> Is there a .pdf >>> >>>> or a link that can give me some more details about that? I work on a >>>> project for xen performace in numa machines. And in xen 3.3.0 this >>>> performance isn''t good. Have something changed in last version? >>>> >>>> Thanks in advance, >>>> Papagiannis Anastasios >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel >>>> >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel >>> >>> > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel > >-- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 636 47950 Fujitsu Technolgy Solutions e-mail: juergen.gross@ts.fujitsu.com Otto-Hahn-Ring 6 Internet: ts.fujitsu.com D-81739 Muenchen Company details: ts.fujitsu.com/imprint.html _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Nov 9, 2009 at 11:44 AM, Juergen Gross <juergen.gross@ts.fujitsu.com> wrote:> George, would this be okay for you? I think your scheduler still will have > problems with domain weights as long as domains are restricted to some > processors, right?Hmm, this may be a point of discussion at some point. My plan was actually to have one runqueue per L2 processor cache. Thus as many as 4 cores (and possibly 8 hyperthreads) would be sharing the same runqueue; doing CPU pinning within the same runqueue would be problematic. I was planning on having credits work mainly within one runqueue, and then do load balancing between runqueues. In that case pinning to a specific runqueue shouldn''t cause a problem, because credits of one runqueue wouldn''t affect credtis of another one. However, I haven''t implemented or tested this idea yet; it''s possible that having credits kept distinct and doing load balancing between runqueues will cause unacceptable levels of unfairness. I expect it to be fine (esp since Linux''s scheduler does this kind of load balancing, but doesn''t share runqueues between logical processors), but without implementation and testing I can''t say for sure. Thoughts are welcome at this point, but it will probably be better to have a real discussion once I''ve posted some patches. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On Mon, Nov 9, 2009 at 11:39 AM, Dulloor <dulloor@gmail.com> wrote:> What''s the current scope and status of your scheduler work ? Is it > going to look similar to the Linux scheduler (with scheduling domains, > et al). In that case, topology is already accounted for, to a large > extent. It would be good to know so that I can work on something that > doesn''t overlap.My plan was to do something similar to Linux, but with this difference: Instead of having one runqueue per logical processor (as both Xen and Linux currently do), and having "domains" all the way up (as Linux currently does), I had planned on having one runqueue per L2 processor cache. The main reason to avoid migration is to preserve a warm cache; but since L1''s are replaced so quickly, there should be little impact to a VM migrating between different threads and cores which share the same L2. Above the L2s I was planning on having an idea similar to the Linux "domains" (although obviously it would need a different name to avoid confusion), and doing explicit load-balancing between them. But as I have not had a chance to test this kind of load balancing yet, the plan may change somewhate before then. Problems to solve wrt NUMA, as I understand it, are to balance the performance cost of sharing a busy local CPU, vs the performance cost of non-local memory accesses. This would involve adding the NUMA logic to the load balancing algorithm. Which I guess would depend in part on having a load balancing algorithm to begin with. :-) Once I have the basic credit patches in working order, would you be interested in working on the load-balancing between runqueues? I can then work on further testing of the credit algorithm. My ultimate goal would be to have a basic regression test that people could use to measure how their changes to the scheduler affect a wide variety of workloads. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 09/11/2009 11:44, "Juergen Gross" <juergen.gross@ts.fujitsu.com> wrote:> NUMA was a topic I wanted to look at as soon as cpupools are officially > accepted. Keir wanted to propose a way to get rid of the function > continue_hypercall_on_cpu() which was causing most of the stuff leading > to the objection of cpupools. > I guess Keir had some higher priority jobs. :-)Well, I forgot about it. I think the plan was to perhaps keep something like continue_hypercall_on_cpu(), but not need to actually run the vcpu itself ''over there'' but instead schedule a tasklet or somesuch, and sleep on its completion. That would get rid of the skanky affinity hacks you had to do to support continue_hypercall_on_cpu(). I''ll have a look back at what we discussed. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Sure ! Let know when you have the patches ready. Also, that might be a good time to see if runq-per-l2 works better. -dulloor On Mon, Nov 9, 2009 at 7:29 AM, George Dunlap <George.Dunlap@eu.citrix.com> wrote:> On Mon, Nov 9, 2009 at 11:39 AM, Dulloor <dulloor@gmail.com> wrote: >> What''s the current scope and status of your scheduler work ? Is it >> going to look similar to the Linux scheduler (with scheduling domains, >> et al). In that case, topology is already accounted for, to a large >> extent. It would be good to know so that I can work on something that >> doesn''t overlap. > > My plan was to do something similar to Linux, but with this > difference: Instead of having one runqueue per logical processor (as > both Xen and Linux currently do), and having "domains" all the way up > (as Linux currently does), I had planned on having one runqueue per L2 > processor cache. The main reason to avoid migration is to preserve a > warm cache; but since L1''s are replaced so quickly, there should be > little impact to a VM migrating between different threads and cores > which share the same L2. > > Above the L2s I was planning on having an idea similar to the Linux > "domains" (although obviously it would need a different name to avoid > confusion), and doing explicit load-balancing between them. But as I > have not had a chance to test this kind of load balancing yet, the > plan may change somewhate before then. > > Problems to solve wrt NUMA, as I understand it, are to balance the > performance cost of sharing a busy local CPU, vs the performance cost > of non-local memory accesses. This would involve adding the NUMA > logic to the load balancing algorithm. Which I guess would depend in > part on having a load balancing algorithm to begin with. :-) > > Once I have the basic credit patches in working order, would you be > interested in working on the load-balancing between runqueues? I can > then work on further testing of the credit algorithm. My ultimate > goal would be to have a basic regression test that people could use to > measure how their changes to the scheduler affect a wide variety of > workloads. > > -George >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dan Magenheimer wrote:>> Add Xen boot parameter ''numa=on'' to enable NUMA detection. >> Then it''s up to you to, for example, pin domains to specific nodes, >> using the ''cpus=...'' option in the domain config file. See >> /etc/xen/xmexample1 for an example of its usage. > VMware has the notion of a "cell" where VMs can be > scheduled only within a cell, not across cells. > Cell boundaries are determined by VMware by > default, though certains settings can override them.Well, If I got this right, then you are describing the current behaviour of Xen. It has a similar feature for some time now (since 3.3, I guess). When you launch a domain on a numa=on machine, it will pick the least busiest node (which can hold the requested memory) and restrict the domain to that node (by only allowing CPUs of that node). This is in XendDomainInfo.py (c/s 17131, 17247, 17709) Looks like this one: (kernel xen.gz numa=on dom0_mem=6144M dom0_max_vcpus=6 dom0_vcpus_pin) # xm create opensuse.hvm # xm create opensuse2.hvm # xm vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity 001-LTP 1 0 6 -b- 17.8 6-11 001-LTP 1 1 7 -b- 6.3 6-11 002-LTP 2 0 12 -b- 19.0 12-17 002-LTP 2 1 16 -b- 1.6 12-17 002-LTP 2 2 17 -b- 1.7 12-17 002-LTP 2 3 14 -b- 1.6 12-17 002-LTP 2 4 16 -b- 1.6 12-17 002-LTP 2 5 15 -b- 1.5 12-17 002-LTP 2 6 12 -b- 1.3 12-17 002-LTP 2 7 13 -b- 1.8 12-17 Domain-0 0 0 0 -b- 12.6 0 Domain-0 0 1 1 -b- 7.6 1 Domain-0 0 2 2 -b- 8.0 2 Domain-0 0 3 3 -b- 14.6 3 Domain-0 0 4 4 r-- 1.4 4 Domain-0 0 5 5 -b- 0.9 5 # xm debug-keys U (XEN) Domain 0 (total: 2097152): (XEN) Node 0: 2097152 (XEN) Node 1: 0 (XEN) Node 2: 0 (XEN) Node 3: 0 (XEN) Node 4: 0 (XEN) Node 5: 0 (XEN) Node 6: 0 (XEN) Node 7: 0 (XEN) Domain 1 (total: 394219): (XEN) Node 0: 0 (XEN) Node 1: 394219 (XEN) Node 2: 0 (XEN) Node 3: 0 (XEN) Node 4: 0 (XEN) Node 5: 0 (XEN) Node 6: 0 (XEN) Node 7: 0 (XEN) Domain 2 (total: 394219): (XEN) Node 0: 0 (XEN) Node 1: 0 (XEN) Node 2: 394219 (XEN) Node 3: 0 (XEN) Node 4: 0 (XEN) Node 5: 0 (XEN) Node 6: 0 (XEN) Node 7: 0 Note that there were no cpus= lines in the config files, Xen did that automatically. Domains can be localhost-migrated to another node: # xm migrate --node=4 1 localhost The only issue is with domains larger than a node. If someone has a useful use-case, I can start rebasing my old patches for NUMA aware HVM domains to Xen unstable. Regards, Andre. BTW: Shouldn''t we set finally numa=on as the default value? -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara wrote:> BTW: Shouldn''t we set finally numa=on as the default value? >Is there any data to support the idea that this helps significantly on common systems? -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >BTW: Shouldn''t we set finally numa=on as the default value?I''d say no, at least until the default confinement of a guest to a single node gets fixed to properly deal with guests having more vCPU-s than a node''s worth of pCPU-s (i.e. I take it for granted that the benefits of not overcommitting CPUs outweigh the drawbacks of cross-node memory accesses at the very least for CPU-bound workloads). Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap wrote:> Andre Przywara wrote: >> BTW: Shouldn''t we set finally numa=on as the default value? >> > Is there any data to support the idea that this helps significantly on > common systems?I don''t have any numbers handy, but I will try if I can generate some. Looking from a high level perspective it is a shame that it''s not the default: With numa=off the Xen domain loader will allocate physical memory from some node (maybe even from several nodes) and will schedule the guest on some other (even rapidly changing) nodes. According to Murphy''s law you will end up with _all_ the memory access of a guest to be remote. But in fact a NUMA architecture is really beneficial for virtualization: As there are close to zero cross domain memory accesses (except for Dom0), each node is more or less self contained and each guest can use the node''s memory controller almost exclusively. But this is all spoiled as most people don''t know about Xen''s NUMA capabilities and don''t set numa=on. Using this as a default would solve this. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 488-3567-12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Jochen Polster; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> >>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> > >BTW: Shouldn''t we set finally numa=on as the default value? > > I''d say no, at least until the default confinement of a guest to a single > node gets fixed to properly deal with guests having more vCPU-s than > a node''s worth of pCPU-s (i.e. I take it for granted that the benefits of > not overcommitting CPUs outweigh the drawbacks of cross-node memory > accesses at the very least for CPU-bound workloads).What default confinement? I thought guests had an all-pCPUs affinity mask be default? I suspect we will get benefits enabling NUMA even if all the guests have all-pCPUs affinity masks: all guests will have memory stripped across all nodes, which is likely better than allocating from one node and then the other. Obviously assigning VMs to node(s) and allocating memory accordingly is the best plan. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
I am not finding this. Can you please point to the code ? numa=on/off is only for setting up numa in xen (similar to the linux knob, but turned off by default). The allocation of memory from a single node (that you observe) could be because of the way alloc_heap_pages is implemented (trying to allocate from all the heaps from a node, before trying the next one) - try looking at dump_numa output. And, affinities are not set anywhere based on the node from which allocation happens. -dulloor On Mon, Nov 9, 2009 at 5:51 PM, Andre Przywara <andre.przywara@amd.com> wrote:> George Dunlap wrote: >> >> Andre Przywara wrote: >>> >>> BTW: Shouldn''t we set finally numa=on as the default value? >>> >> >> Is there any data to support the idea that this helps significantly on >> common systems? > > I don''t have any numbers handy, but I will try if I can generate some. > > Looking from a high level perspective it is a shame that it''s not the > default: With numa=off the Xen domain loader will allocate physical memory > from some node (maybe even from several nodes) and will schedule the guest > on some other (even rapidly changing) nodes. According to Murphy''s law you > will end up with _all_ the memory access of a guest to be remote. But in > fact a NUMA architecture is really beneficial for virtualization: As there > are close to zero cross domain memory accesses (except for Dom0), each node > is more or less self contained and each guest can use the node''s memory > controller almost exclusively. > But this is all spoiled as most people don''t know about Xen''s NUMA > capabilities and don''t set numa=on. Using this as a default would solve > this. > > Regards, > Andre. > > -- > Andre Przywara > AMD-Operating System Research Center (OSRC), Dresden, Germany > Tel: +49 351 488-3567-12 > ----to satisfy European Law for business letters: > Advanced Micro Devices GmbH > Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen > Geschaeftsfuehrer: Jochen Polster; Thomas M. McCoy; Giuliano Meroni > Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen > Registergericht Muenchen, HRB Nr. 43632 > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Dulloor wrote:> I am not finding this. Can you please point to the code ?tools/python/xen/xend/XendDomainInfo.py (around line 2600) with the core code being: ------------- index = nodeload.index( min(nodeload) ) cpumask = info[''node_to_cpu''][index] for v in range(0, self.info[''VCPUs_max'']): xc.vcpu_setaffinity(self.domid, v, cpumask) -------------- The code got introduced with c/s 17131 and later got refined with c/s 17247 and c/s 17709.> > numa=on/off is only for setting up numa in xen (similar to the linux > knob, but turned off by default). The allocation of memory from a > single node (that you observe) could be because of the way > alloc_heap_pages is implemented (trying to allocate from all the heaps > from a node, before trying the next one)Yes, but if the domain is pinned before it allocated it''s memory, then the natural behavior of Xen is to take memory from this local node.> - try looking at dump_numa > output. And, affinities are not set anywhere based on the node from > which allocation happens.It is the other way round, first the domain is pinned, later the memory is allocated (based on the node to which the currently scheduled CPU is belonging to). Regards, Andre.> > -dulloor > > On Mon, Nov 9, 2009 at 5:51 PM, Andre Przywara <andre.przywara@amd.com> wrote: >> George Dunlap wrote: >>> Andre Przywara wrote: >>>> BTW: Shouldn''t we set finally numa=on as the default value? >>>> >>> Is there any data to support the idea that this helps significantly on >>> common systems? >> I don''t have any numbers handy, but I will try if I can generate some. >> >> Looking from a high level perspective it is a shame that it''s not the >> default: With numa=off the Xen domain loader will allocate physical memory >> from some node (maybe even from several nodes) and will schedule the guest >> on some other (even rapidly changing) nodes. According to Murphy''s law you >> will end up with _all_ the memory access of a guest to be remote. But in >> fact a NUMA architecture is really beneficial for virtualization: As there >> are close to zero cross domain memory accesses (except for Dom0), each node >> is more or less self contained and each guest can use the node''s memory >> controller almost exclusively. >> But this is all spoiled as most people don''t know about Xen''s NUMA >> capabilities and don''t set numa=on. Using this as a default would solve >> this. >> >> Regards, >> Andre. >>-- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>> Ian Pratt <Ian.Pratt@eu.citrix.com> 10.11.09 02:46 >>> >> >>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >> >BTW: Shouldn''t we set finally numa=on as the default value? >> >> I''d say no, at least until the default confinement of a guest to a single >> node gets fixed to properly deal with guests having more vCPU-s than >> a node''s worth of pCPU-s (i.e. I take it for granted that the benefits of >> not overcommitting CPUs outweigh the drawbacks of cross-node memory >> accesses at the very least for CPU-bound workloads). > >What default confinement? I thought guests had an all-pCPUs affinity mask be default?Not with numa=on (see also Andre''s post to this effect): The guest will get assigned to a node, and its affinity set to that node''s CPUs. Jan _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 10/11/2009 08:51, "Jan Beulich" <JBeulich@novell.com> wrote:>> What default confinement? I thought guests had an all-pCPUs affinity mask be >> default? > > Not with numa=on (see also Andre''s post to this effect): The guest will > get assigned to a node, and its affinity set to that node''s CPUs....And if it didn''t, striping would not happen. In fact iirc the default NUMA allocation policy for an all-pcpus domain is in some respects pessimal: vcpu0''s initial node gets drained of memory first. I.e., you get *less* ''striping'' than you could with numa=off where you might at least get lucky. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 09/11/2009 15:19, "Jan Beulich" <JBeulich@novell.com> wrote:>>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >> BTW: Shouldn''t we set finally numa=on as the default value? > > I''d say no, at least until the default confinement of a guest to a single > node gets fixed to properly deal with guests having more vCPU-s than > a node''s worth of pCPU-s (i.e. I take it for granted that the benefits of > not overcommitting CPUs outweigh the drawbacks of cross-node memory > accesses at the very least for CPU-bound workloads).If this would be fixed (e.g., turn off node locality entirely by default for domains which will not fit into a single node) then I think we could consider numa=on by default. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
George Dunlap wrote:> Andre Przywara wrote: >> BTW: Shouldn''t we set finally numa=on as the default value? >> > Is there any data to support the idea that this helps significantly on > common systems?I did some tests on an 8 node machine. I will retry this later on 4-nodes and 2-nodes systems, but I assume similar numbers. I used multiple guests in parallel each running bw_mem of lmbench, which is admittedly quite NUMA sensitive. I cannot publish real numbers (yet?), but the results were dramatic: with numa=on I got the same results for each guest (the same as the native result) when the number of guests was smaller or equal the number of nodes (since each guest got it''s own memory controller). If I disabled NUMA aware placement by explicitly specifying cpus="0-31" in the config file or booted with numa=off, the values dropped down by factor 3-5 (!) (even for a few guests) with some variance due to the random nature of core to memory mapping. Overcommitting the nodes (letting multiple guests use each node) lowered the values to about 80% for two guests and 60% for three guests per node, but it never got anywhere close to the numa=off values. So these results encourage me again to opt for numa=on as the default value. Keir, I will check if dropping the node containment in the CPU overcommitment case is an option, but what would be the right strategy in that case? Warn the user? Don''t contain at all? Contain to more than onde node? Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Overcommitting the nodes (letting multiple guests use each node) lowered > the values to about 80% for two guests and 60% for three guests per > node, but it never got anywhere close to the numa=off values. > So these results encourage me again to opt for numa=on as the default > value. > Keir, I will check if dropping the node containment in the CPU > overcommitment case is an option, but what would be the right strategy > in that case? > Warn the user? > Don''t contain at all? > Contain to more than onde node?In the case where a VM is asking for more vCPUs there are pCPUs in a node we should contain the guest to multiple nodes. (I presume we favour nodes according to the number of vCPUs they already have committed to them?) We should turn off automatic node containment of any kind if the total number of pCPUs in the system is <= 8 -- on such systems the statistical multiplexing gain of having access to more pCPUs likely outweighs the NUMA placement benefit and memory striping will be a better strategy. I''m inclined to believe that may be true for 2 node systems with <=16 pCPUs too under many workloads I''d really like to see us enumerate pCPUs in a sensible order so that it''s easier to see the topology. It should be nodes.sockets.cores{.threads}, leaving gaps for missing execution units due to hot plug or non power of two packing. Right now we''re inconsistent in the enumeration order depending on how the BIOS has set things up. It would be great if someone could volunteer to fix this... Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/11/2009 14:14, "Andre Przywara" <andre.przywara@amd.com> wrote:> Keir, I will check if dropping the node containment in the CPU > overcommitment case is an option, but what would be the right strategy > in that case? > Warn the user? > Don''t contain at all? > Contain to more than onde node?I would suggest simply don''t contain at all (i.e., keep equivalent numa=off behaviour) would be safest. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt wrote:> In the case where a VM is asking for more vCPUs there are pCPUs in a node we should contain the guest to multiple nodes. (I presume we favour nodes according to the number of vCPUs they already have committed to them?)Seems like CPU load might be a better measure. Xen doesn''t calculate load currently, but it''s on my list of things to do. -George _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/11/2009 14:29, "Ian Pratt" <Ian.Pratt@eu.citrix.com> wrote:> I''d really like to see us enumerate pCPUs in a sensible order so that it''s > easier to see the topology. It should be nodes.sockets.cores{.threads}, > leaving gaps for missing execution units due to hot plug or non power of two > packing. > Right now we''re inconsistent in the enumeration order depending on how the > BIOS has set things up. It would be great if someone could volunteer to fix > this...Even better would be to have pCPUs addressable and listable explicitly as dotted tuples. That can be implemented entirely within the toolstack, and could even allow wildcarding of tuple components to efficiently express cpumasks. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> Ian Pratt wrote: > > In the case where a VM is asking for more vCPUs there are pCPUs in a > node we should contain the guest to multiple nodes. (I presume we favour > nodes according to the number of vCPUs they already have committed to > them?) > > Seems like CPU load might be a better measure. Xen doesn''t calculate > load currently, but it''s on my list of things to do.I''d rather get this stuff fixed now than wait for the new scheduler. It''s not clear that instantaneous CPU load is any better than just counting the number of vCPUs. The XCP xapi stack also records good historical data, and would be in a better position to do the placement. Further work. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > Keir, I will check if dropping the node containment in the CPU > > overcommitment case is an option, but what would be the right strategy > > in that case? > > Warn the user? > > Don''t contain at all? > > Contain to more than onde node? > > I would suggest simply don''t contain at all (i.e., keep equivalent > numa=off > behaviour) would be safest.I disagree. In systems with 2 nodes it will use all nodes, which is the same as your propose[*]. In systems with more nodes it will do placement to some subset. Note that systems with >2 nodes generally have stronger NUMA effects and these are exactly the systems where node placement is a good thing. [*] note that numa=off is quite different from just disabling node placement. If node placement is disabled we still get the benefit of memory striping across nodes, which at least avoids some performance cliffs. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> > I''d really like to see us enumerate pCPUs in a sensible order so that > it''s > > easier to see the topology. It should be nodes.sockets.cores{.threads}, > > leaving gaps for missing execution units due to hot plug or non power of > two > > packing. > > Right now we''re inconsistent in the enumeration order depending on how > the > > BIOS has set things up. It would be great if someone could volunteer to > fix > > this... > > Even better would be to have pCPUs addressable and listable explicitly as > dotted tuples. That can be implemented entirely within the toolstack, and > could even allow wildcarding of tuple components to efficiently express > cpumasks.Yes, I''d certainly like to see the toolstack support dotted tuple notation. However, I just don''t trust the toolstack to get this right unless xen has already set it up nicely for it with a sensible enumeration and defined sockets-per-node, cores-per-socket and threads-per-core parameters. Xen should provide a clean interface to the toolstack in this respect. Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
On 13/11/2009 15:40, "Ian Pratt" <Ian.Pratt@eu.citrix.com> wrote:>> Even better would be to have pCPUs addressable and listable explicitly as >> dotted tuples. That can be implemented entirely within the toolstack, and >> could even allow wildcarding of tuple components to efficiently express >> cpumasks. > > Yes, I''d certainly like to see the toolstack support dotted tuple notation. > > However, I just don''t trust the toolstack to get this right unless xen has > already set it up nicely for it with a sensible enumeration and defined > sockets-per-node, cores-per-socket and threads-per-core parameters. Xen should > provide a clean interface to the toolstack in this respect.Xen provides a topology-interrogation hypercall which should suffice for tools to build up a {node,socket,core,thread}<->cpuid mapping table. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2009-Nov-30 15:40 UTC
[Xen-devel] [PATCH] tools: avoid over-commitment if numa=on
Jan Beulich wrote:>>>> Andre Przywara <andre.przywara@amd.com> 09.11.09 16:02 >>> >> BTW: Shouldn''t we set finally numa=on as the default value? > > I''d say no, at least until the default confinement of a guest to a single > node gets fixed to properly deal with guests having more vCPU-s than > a node''s worth of pCPU-s (i.e. I take it for granted that the benefits of > not overcommitting CPUs outweigh the drawbacks of cross-node memory > accesses at the very least for CPU-bound workloads).That sounds reasonable. Attached a patch to lift the restriction of one node per guest if the number of VCPUs is greater than the number of cores / node. This isn''t optimal (the best way would be to inform the guest about it, but this is another patchset ;-), but should solve the above concerns. Please apply, Andre. Signed-off-by: Andre Przywara <andre.przywara@amd.com> -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel