Hello, XenServer have recently acquired a quad-socket AMD Interlagos server and I have been playing around with it, and discovered a logical error in how Xen detects numa nodes. The server has 8 NUMA nodes, 4 of which have memory attached (the even nodes - see SRAT.dsl attached). This means that that node_set_online(nodeid) gets called for each node with memory attached. Later, in srat_detect_node(), node gets set to 0 if it was NUMA_NO_NODE, or if not node_online(). This leads to all the processors on the odd nodes being assigned to node 0, even though the odd nodes are present (see interlagos-xl-info-n.log) I present an RFC patch which changes srat_detect_node() to call node_set_online() for each node, which appears to fix the logic. Is this a sensible place to set the node online, or is there a better way to fix this logic? -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
>>> On 27.06.12 at 21:10, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > XenServer have recently acquired a quad-socket AMD Interlagos server and > I have been playing around with it, and discovered a logical error in > how Xen detects numa nodes. > > The server has 8 NUMA nodes, 4 of which have memory attached (the even > nodes - see SRAT.dsl attached). This means that that > node_set_online(nodeid) gets called for each node with memory attached. > Later, in srat_detect_node(), node gets set to 0 if it was NUMA_NO_NODE, > or if not node_online(). This leads to all the processors on the odd > nodes being assigned to node 0, even though the odd nodes are present > (see interlagos-xl-info-n.log) > > I present an RFC patch which changes srat_detect_node() to call > node_set_online() for each node, which appears to fix the logic. > > Is this a sensible place to set the node online, or is there a better > way to fix this logic?While the place looks sensible, it has the possible problem of potentially adding bits to the online map pretty late in the game. As the memory-related invocations of node_set_online() come out of numa_initmem_init()/acpi_scan_nodes(), perhaps the (boot time) CPU-related ones should be done there too (I''d still keep the adjustment you''re already doing, to also cover hotplug CPUs)? Jan Jan
On 28/06/12 10:51, Jan Beulich wrote:>>>> On 27.06.12 at 21:10, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> XenServer have recently acquired a quad-socket AMD Interlagos server and >> I have been playing around with it, and discovered a logical error in >> how Xen detects numa nodes. >> >> The server has 8 NUMA nodes, 4 of which have memory attached (the even >> nodes - see SRAT.dsl attached). This means that that >> node_set_online(nodeid) gets called for each node with memory attached. >> Later, in srat_detect_node(), node gets set to 0 if it was NUMA_NO_NODE, >> or if not node_online(). This leads to all the processors on the odd >> nodes being assigned to node 0, even though the odd nodes are present >> (see interlagos-xl-info-n.log) >> >> I present an RFC patch which changes srat_detect_node() to call >> node_set_online() for each node, which appears to fix the logic. >> >> Is this a sensible place to set the node online, or is there a better >> way to fix this logic? > While the place looks sensible, it has the possible problem of > potentially adding bits to the online map pretty late in the game. > > As the memory-related invocations of node_set_online() come > out of numa_initmem_init()/acpi_scan_nodes(), perhaps the > (boot time) CPU-related ones should be done there too (I''d > still keep the adjustment you''re already doing, to also cover > hotplug CPUs)? > > Jan >I have been doing quite a bit more testing this morning, and have come to some sad conclusions. This specific server is a Dell R815 loner, with 8x4GiB DIMMs, with 2 DIMMs hanging off each socket. As each socket is an interlagos processor, there are 4 memory controllers (with 8 DIMM slots as they are dual channel) What this means is that per socket, one node has half of its available DIMMs filled, and the other node has no memory. The performance implications are severe, but as it appears that almost all combinations of RAM you can select on the Dell website will leads to poor or worse performance, I can foresee many systems like this in the future. (I don''t wish to single Dell out here, other than it happens to be the provider of the server I am testing. Other server providers suffer the same issues) As to the problem at hand, I will investigate the numa code some more and see about setting it up earlier. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com
On Thu, Jun 28, 2012 at 11:29 AM, Andrew Cooper <andrew.cooper3@citrix.com> wrote:> What this means is that per socket, one node has half of its available > DIMMs filled, and the other node has no memory. The performance > implications are severe, but as it appears that almost all combinations > of RAM you can select on the Dell website will leads to poor or worse > performance, I can foresee many systems like this in the future. (I > don''t wish to single Dell out here, other than it happens to be the > provider of the server I am testing. Other server providers suffer the > same issues)Wow, that sounds almost more like AMP (asymmetric multiprocessing) than NUMA... -George