Andre Przywara
2010-Jan-12  16:30 UTC
[Xen-devel] [PATCH] numa: fix problems with memory-less nodes
Hi,
If we decided to not report memory-less nodes in physinfo we should also 
skip them in the node_to_{cpu,memory,dma32_mem} Python lists. Currently 
Xen will not start guests on machines with memory-less nodes which are 
not the last ones. On an 8-node machine with empty nodes 4 and 5 "xm 
info" was reporting wrongly, also the node assignment algorithm crashed 
with a division by zero error.
The attached patch fixes this by skipping empty nodes in the enumeration 
of resources.
Regards,
Andre.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany
Tel: +49 351 448 3567 12
----to satisfy European Law for business letters:
Advanced Micro Devices GmbH
Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen
Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni
Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jan-13  08:26 UTC
[Xen-devel] Re: [PATCH] numa: fix problems with memory-less nodes
On 12/01/2010 16:30, "Andre Przywara" <andre.przywara@amd.com> wrote:> If we decided to not report memory-less nodes in physinfo we should also > skip them in the node_to_{cpu,memory,dma32_mem} Python lists. Currently > Xen will not start guests on machines with memory-less nodes which are > not the last ones. On an 8-node machine with empty nodes 4 and 5 "xm > info" was reporting wrongly, also the node assignment algorithm crashed > with a division by zero error. > The attached patch fixes this by skipping empty nodes in the enumeration > of resources.Where to begin? Firstly, I thought that the ordering of nodes in the node_to_* lists actually mattered -- the lists are indexed by nodeid (a handle which can be passed to other Xen interfaces) are they not? If you don''t include empty entries, then the index position of entries is no longer meaningful. Secondly, you avoid appending to the node_to_cpu list if the node is cpu-less. But you avoid appending to the node_to_{memory,dma32} lists only if the node is *both* cpu-less and memory-less. That''s not even consistent. Please just fix the crap Python code. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Andre Przywara
2010-Jan-13  09:42 UTC
[Xen-devel] Re: [PATCH] numa: fix problems with memory-less nodes
Keir Fraser wrote:> On 12/01/2010 16:30, "Andre Przywara" <andre.przywara@amd.com> wrote: > >> If we decided to not report memory-less nodes in physinfo we should also >> skip them in the node_to_{cpu,memory,dma32_mem} Python lists. Currently >> Xen will not start guests on machines with memory-less nodes which are >> not the last ones. On an 8-node machine with empty nodes 4 and 5 "xm >> info" was reporting wrongly, also the node assignment algorithm crashed >> with a division by zero error. >> The attached patch fixes this by skipping empty nodes in the enumeration >> of resources. > > Where to begin? Firstly, I thought that the ordering of nodes in the > node_to_* lists actually mattered -- the lists are indexed by nodeid (a > handle which can be passed to other Xen interfaces) are they not? If you > don''t include empty entries, then the index position of entries is no longer > meaningful.OK, that seems to be an issue. To be honest I am not a fan of omitting nodes from physinfo, but that is what the current code (RC1!) does and it definitely breaks Xen on my box. So I just made this small patch to make it work again. Actually I would opt to revert the patch cropping the number of nodes reported by physinfo (20762:a1d0a575b4ba ?). Yes, that would result in nodes reported with zero memory, but in my tests this did not raise problems, as a node''s memory can (and will) be exhausted even during normal operation. To illustrate the problem: My box has 8 nodes, I removed the memory from nodes 4 & 5. With the unpatched version xm info says: total_memory : 73712 free_memory : 70865 node_to_cpu : node0:0-5,24-35 node1:6-11 node2:12-17 node3:18-23 node4:no cpus node5:no cpus node_to_memory : node0:14267 node1:8167 node2:16335 node3:8167 node4:0 node5:0 So this listing completely omits the last two nodes (CPUs 36-47 and the 24 GB connected to them). The debug key triggered Xen-internal listing is correct, though: (XEN) idx0 -> NODE0 start->0 size->4423680 (XEN) phys_to_nid(0000000000001000) -> 0 should be 0 (XEN) idx1 -> NODE1 start->4423680 size->2097152 (XEN) phys_to_nid(0000000438001000) -> 1 should be 1 (XEN) idx2 -> NODE2 start->6520832 size->4194304 (XEN) phys_to_nid(0000000638001000) -> 2 should be 2 (XEN) idx3 -> NODE3 start->10715136 size->2097152 (XEN) phys_to_nid(0000000a38001000) -> 3 should be 3 (XEN) idx6 -> NODE6 start->12812288 size->4194304 (XEN) phys_to_nid(0000000c38001000) -> 6 should be 6 (XEN) idx7 -> NODE7 start->17006592 size->2097152 (XEN) phys_to_nid(0000001038001000) -> 7 should be 7 With the patched xc.so xm info reports: node_to_cpu : node0:0-5,24-35 node1:6-11 node2:12-17 node3:18-23 node4:36-41 node5:42-47 node_to_memory : node0:14267 node1:8167 node2:16335 node3:8167 node4:16335 node5:7590 Although memory less nodes are not very common, it could happen sometimes with our new dual-node processor, where one could (even accidentally) forget to populate certain memory slots, as it has in fact a dual-node dual-channel memory interface.> Secondly, you avoid appending to the node_to_cpu list if the node is > cpu-less. But you avoid appending to the node_to_{memory,dma32} lists only > if the node is *both* cpu-less and memory-less. That''s not even consistent.OK, that''s a point. I see that the value of node_exists can change.> Please just fix the crap Python code.What part do you exactly mean? The part triggering the division by zero? I will see if I can fix this properly. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany Tel: +49 351 448 3567 12 ----to satisfy European Law for business letters: Advanced Micro Devices GmbH Karl-Hammerschmidt-Str. 34, 85609 Dornach b. Muenchen Geschaeftsfuehrer: Andrew Bowd; Thomas M. McCoy; Giuliano Meroni Sitz: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jan-13  09:55 UTC
[Xen-devel] Re: [PATCH] numa: fix problems with memory-less nodes
On 13/01/2010 09:42, "Andre Przywara" <andre.przywara@amd.com> wrote:> To be honest I am not a fan of omitting nodes from physinfo, but that is > what the current code (RC1!) does and it definitely breaks Xen on my > box. So I just made this small patch to make it work again. > Actually I would opt to revert the patch cropping the number of nodes > reported by physinfo (20762:a1d0a575b4ba ?). Yes, that would result in > nodes reported with zero memory, but in my tests this did not raise > problems, as a node''s memory can (and will) be exhausted even during > normal operation.The intention of 20762 was not to change the semantics of the node_to_* lists. It''s simply supposed to make available max_node_id to the toolstack (since this can differ from nr_nodes if there are holes in the online node map). If the node_to_* semantics really have been changed by 20762, then it is a bug. It''s not a bug I can eyeball however. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2010-Jan-13  10:02 UTC
[Xen-devel] Re: [PATCH] numa: fix problems with memory-less nodes
On 13/01/2010 09:42, "Andre Przywara" <andre.przywara@amd.com> wrote:> node_to_cpu : node0:0-5,24-35 > node1:6-11 > node2:12-17 > node3:18-23 > node4:no cpus > node5:no cpus > node_to_memory : node0:14267 > node1:8167 > node2:16335 > node3:8167 > node4:0 > node5:0To be clear: I certainly agree that node_to_cpu[4,5] should list the CPUs that are present, even if the nodes are memory-less. But you tell me how 20762 has changed this behaviour. If it was changed by one of Intel''s slightly earlier changesets (which is possible) I''m happy to see it changed back. The new node_to_cpu[] behaviour you are observing is obviously stupid, and I assume unintended. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel