Hello, I''m posting this as the follow-up to the private mail discussion I already had with Sunay and Roamer. I''ll start with a description of what we are doing. We are working in 2-node-cluster configurations with large SPARC servers, e.g. SunFire 6900 clusters. Now we are moving to Maramba and M9000 clusters as the next hardware generation for our application. Currently we have an M9000 cluster in our labs for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2 cores x 2 threads) per node (128 ways per node). Currently we use 4 physical 1 GBit links as the cluster interconnect provided by SunCluster. Later on we plan to go to 2 links with 10 GBit each. The driver used for these NICs is nxge. The throughput on all links of the cluster interconnect is roughly 100.000 pkts/sec per direction. Most of this load is generated by 8 processes with 1 connection each. In our first tests we used Intel NICs with e1000g driver (4 links). In this setup, the interrupt load on the 4 CPUs doing the interrupt handling for the 4 NICs became a bottleneck: Therefore we''ve fenced those 4 cpus into a processor set (actually we fenced both strands of each core since the strands are sharing the interrupt logic), but the cpus are still almost saturated doing interrupt processing (although we already tuned some parameters with ndd). Nevertheless, interrupt fencing has just been a work-around for the prototyping. In a real-world setup with our high HA requirements we won''t be able to implement interrupt fencing with affordable effort (think of error situations where one cpu fails and interrupts move to different cpus and we have to re-bind all our processes and create new processor sets and so on...). So what we need is interrupt fanout to many cpus to avoid the need of interrupt fencing. What makes the situation more difficult is that we are running many processes with rather large heaps on the system. In order to get cache misses down and be able to scale on such a large 128-way server, we are explicitly binding the application threads of these processes to *all* 128 virtual cpus in the system. Threads bound to cpus which are doing heavy interrupt processing will starve. We therefore need to get interrupt processing spread out to so many cpus that even threads bound to interrupt cpus get enough cpu time. What I''ve learned so far is that the e1000g driver is not able to fanout interrupts to different cpus, but nxge is. So we will from now on use 1 GBit and 10 GBit NICs with nxge driver. Currently we are re-installing our cluster with such a setup. I''ve also learned that until the project Interrupt Resource Management (IRM) or Crossbow are available, the level of fanout can be controlled through the ddi_msix_alloc_limit global variable. We need some kind of work-around using fanout right now for evaluating whether this approach suits our needs and need a product solution by the end of this year, since our customers are already waiting for the new server generation. Since we have 128 virutal cpus in the system, I believe we need a fanout over at least 16 or 32 cpus in order to get the interrupt processing load down enough to be able to still bind threads to these cpus. After the discussion with Roamer and Sunay, and after reading your document "Hardware Resources Management and Virualization", I see the following problems for our setup: - Fanout to Rx Rings is done on a per-connection basis, based on L3/L4 classifiers. Currently we have 8 high-load connections and some more low-load connections. To fanout to 32 cpus, we would need at least 32 (probably better 64 or more) connections. Sunay already pointed out that even now 3 or 4 cpus are used to move a packet through the stack (interrupt/polling, soft ring, squeue). However, from our tests so far I believe that interrupt processing is what is hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt didn''t help us, so I expect we won''t come through with 8 connections. I''m currently in discussion with our platform developers on how to best increase the number of connections. - While we already have some ideas how to solve problem #1, the next problem really seems to destroy everything again: Sunay wrote, that from the possible L3/L4 classifiers, nxge only uses the source IP address as a classifier for fanout. Since we are talking about the traffic on our cluster interconnect, we always have the same source address for *all* our traffic: It''s all inter-node traffic, always coming from the other node of the cluster! This means no matter how many connections we use, they will always be mapped to the same cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus -- but not to 16 or 32). We will need some kind of solution for this! Do you have any ideas about what could be a solution for this? E.g., do you plan to extend nxge to also consider source & destination port as classifiers for fanout? We would also be very interested in taking part in a Crossbow beta test in November. Most important for us in this test would be that there is some kind of solution for problem #2. Thanks for the time for reading all this, and also thanks for your support so far! Nick. This message posted from opensolaris.org
Nicolas Michael wrote:> Hello, > > I''m posting this as the follow-up to the private mail discussion I already had with Sunay and Roamer. > > I''ll start with a description of what we are doing. We are working in 2-node-cluster configurations with large SPARC servers, e.g. SunFire 6900 clusters. Now we are moving to Maramba and M9000 clusters as the next hardware generation for our application. Currently we have an M9000 cluster in our labs for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2 cores x 2 threads) per node (128 ways per node). Currently we use 4 physical 1 GBit links as the cluster interconnect provided by SunCluster. Later on we plan to go to 2 links with 10 GBit each. The driver used for these NICs is nxge. The throughput on all links of the cluster interconnect is roughly 100.000 pkts/sec per direction. Most of this load is generated by 8 processes with 1 connection each. > > In our first tests we used Intel NICs with e1000g driver (4 links). In this setup, the interrupt load on the 4 CPUs doing the interrupt handling for the 4 NICs became a bottleneck: Therefore we''ve fenced those 4 cpus into a processor set (actually we fenced both strands of each core since the strands are sharing the interrupt logic), but the cpus are still almost saturated doing interrupt processing (although we already tuned some parameters with ndd). Nevertheless, interrupt fencing has just been a work-around for the prototyping. In a real-world setup with our high HA requirements we won''t be able to implement interrupt fencing with affordable effort (think of error situations where one cpu fails and interrupts move to different cpus and we have to re-bind all our processes and create new processor sets and so on...). > > So what we need is interrupt fanout to many cpus to avoid the need of interrupt fencing. What makes the situation more difficult is that we are running many processes with rather large heaps on the system. In order to get cache misses down and be able to scale on such a large 128-way server, we are explicitly binding the application threads of these processes to *all* 128 virtual cpus in the system. Threads bound to cpus which are doing heavy interrupt processing will starve. We therefore need to get interrupt processing spread out to so many cpus that even threads bound to interrupt cpus get enough cpu time. > > What I''ve learned so far is that the e1000g driver is not able to fanout interrupts to different cpus, but nxge is. So we will from now on use 1 GBit and 10 GBit NICs with nxge driver. Currently we are re-installing our cluster with such a setup. > > I''ve also learned that until the project Interrupt Resource Management (IRM) or Crossbow are available, the level of fanout can be controlled through the ddi_msix_alloc_limit global variable. We need some kind of work-around using fanout right now for evaluating whether this approach suits our needs and need a product solution by the end of this year, since our customers are already waiting for the new server generation. > > Since we have 128 virutal cpus in the system, I believe we need a fanout over at least 16 or 32 cpus in order to get the interrupt processing load down enough to be able to still bind threads to these cpus. > > After the discussion with Roamer and Sunay, and after reading your document "Hardware Resources Management and Virualization", I see the following problems for our setup: > > - Fanout to Rx Rings is done on a per-connection basis, based on L3/L4 classifiers. Currently we have 8 high-load connections and some more low-load connections. To fanout to 32 cpus, we would need at least 32 (probably better 64 or more) connections. Sunay already pointed out that even now 3 or 4 cpus are used to move a packet through the stack (interrupt/polling, soft ring, squeue). However, from our tests so far I believe that interrupt processing is what is hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt didn''t help us, so I expect we won''t come through with 8 connections. I''m currently in discussion with our platform developers on how to best increase the number of connections. > > - While we already have some ideas how to solve problem #1, the next problem really seems to destroy everything again: Sunay wrote, that from the possible L3/L4 classifiers, nxge only uses the source IP address as a classifier for fanout. Since we are talking about the traffic on our cluster interconnect, we always have the same source address for *all* our traffic: It''s all inter-node traffic, always coming from the other node of the cluster! This means no matter how many connections we use, they will always be mapped to the same cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus -- but not to 16 or 32). We will need some kind of solution for this! Do you have any ideas about what could be a solution for this? E.g., do you plan to extend nxge to also consider source & destination port as classifiers for fanout? >Nick, By default, the nxge driver performs hashing on the 5-tuple of the incoming packets. The hash determines which RX ring the packet will land on. Provided that the RX ring is bound the specific CPU by the way of MSI/MSI-X interrupt, you will have good chance of incoming packets distributed among the available CPUs. The Neptune/NIU HW has 16 RX DMA so technically the incoming packets could be distributed to upto 16 CPUS. L3/L4 classification could give you explicit control of connection ---> RX ring mapping. L3/L4 classification is planned to be controlled/configured by Crossbow and I believe it depends on Crossbow feature delivery phases. Regards Matheos> We would also be very interested in taking part in a Crossbow beta test in November. Most important for us in this test would be that there is some kind of solution for problem #2. > > Thanks for the time for reading all this, and also thanks for your support so far! > > Nick. > > > This message posted from opensolaris.org > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss >
Hi Nick, Welcome to OpenSolaris! Nicolas Michael wrote:> Hello, > > I''m posting this as the follow-up to the private mail discussion I already had with Sunay and Roamer. > I''ll start with a description of what we are doing. We are working in 2-node-cluster configurations > with large SPARC servers, e.g. SunFire 6900 clusters. Now we are moving to Maramba and M9000 clusters > as the next hardware generation for our application. Currently we have an M9000 cluster in our labs > for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2 cores x 2 threads) per node (128 ways > per node). Currently we use 4 physical 1 GBit links as the cluster interconnect provided by SunCluster.> Later on we plan to go to 2 links with 10 GBit each. The driver used for these NICs is nxge. The > throughput on all links of the cluster interconnect is roughly 100.000 pkts/sec per direction. > Most of this load is generated by 8 processes with 1 connection each.OK.> In our first tests we used Intel NICs with e1000g driver (4 links). In this setup, the interrupt > load on the 4 CPUs doing the interrupt handling for the 4 NICs became a bottleneck: Therefore we''ve > fenced those 4 cpus into a processor set (actually we fenced both strands of each core since the > strands are sharing the interrupt logic), but the cpus are still almost saturated doing interrupt > processing (although we already tuned some parameters with ndd). Nevertheless, interrupt fencing > has just been a work-around for the prototyping. In a real-world setup with our high HA > requirements we won''t be able to implement interrupt fencing with affordable effort (think of > error situations where one cpu fails and interrupts move to different cpus and we have to > re-bind all our processes and create new processor sets and so on...).Sure.> So what we need is interrupt fanout to many cpus to avoid the need of interrupt fencing. What > makes the situation more difficult is that we are running many processes with rather large heaps > on the system. In order to get cache misses down and be able to scale on such a large 128-way > server, we are explicitly binding the application threads of these processes to *all* 128 > virtual cpus in the system. Threads bound to cpus which are doing heavy interrupt processing > will starve. We therefore need to get interrupt processing spread out to so many cpus that > even threads bound to interrupt cpus get enough cpu time.Crossbow will allow you to specify a cpu-list which allows the system to figure out how many threads to use to fanout (in addition to the poll thread and worker threads). You can choose to bind these threads to the CPUs specified or leave them. The data path architecture which gives details on various threads involved in moving the data including polling etc is specified here http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt I would also recommend reading some of the overview documents at the Docs page http://www.opensolaris.org/os/project/crossbow/Docs (and any feedback would be very useful).> What I''ve learned so far is that the e1000g driver is not able to fanout interrupts to > different cpus, but nxge is. So we will from now on use 1 GBit and 10 GBit NICs with nxge driver. > Currently we are re-installing our cluster with such a setup.As mentioned before, the problem is both polling and fanout. Current S10 bits and Nevada can do some fanout but no polling. Instead of per packet interrupt (or per few packets interrupts), you want to use more polling under load and then do any fanout.> I''ve also learned that until the project Interrupt Resource Management (IRM) or Crossbow are > available, the level of fanout can be controlled through the ddi_msix_alloc_limit global variable. > We need some kind of work-around using fanout right now for evaluating whether this approach > suits our needs and need a product solution by the end of this year, since our customers are > already waiting for the new server generation.For NICs supporting MSI, yes.> Since we have 128 virutal cpus in the system, I believe we need a fanout over at least 16 or > 32 cpus in order to get the interrupt processing load down enough to be able to still bind > threads to these cpus.> After the discussion with Roamer and Sunay, and after reading your document "Hardware > Resources Management and Virualization", I see the following problems for our setup: > - Fanout to Rx Rings is done on a per-connection basis, based on L3/L4 classifiers. > Currently we have 8 high-load connections and some more low-load connections. To fanout to 32 cpus, > we would need at least 32 (probably better 64 or more) connections. Sunay already pointed out > that even now 3 or 4 cpus are used to move a packet through the stack (interrupt/polling, > soft ring, squeue). However, from our tests so far I believe that interrupt processing is what > is hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt didn''t help us, so I expect > we won''t come through with 8 connections. I''m currently in discussion with our platform > developers on how to best increase the number of connections.With Crossbow, under load, dynamic polling kicks in pretty efficiently. I just finished some experiments on T2000 where the system was doing about 250k packets/sec and the ratio of packets processed by interrupt vs poll was under .5%. So you don''t have to worry about interrupt that much under load.> - While we already have some ideas how to solve problem #1, the next problem really seems > to destroy everything again: Sunay wrote, that from the possible L3/L4 classifiers, nxge > only uses the source IP address as a classifier for fanout. Since we are talking about > the traffic on our cluster interconnect, we always have the same source address for > *all* our traffic: It''s all inter-node traffic, always coming from the other node of > the cluster! This means no matter how many connections we use, they will always be > mapped to the same cpu since all have the same source IP address! (Since we will use > 2 or 4 NICs as the interconnect for reasons of redundancy, they will be mapped to 2 or > 4 cpus -- but not to 16 or 32). We will need some kind of solution for this! Do you > have any ideas about what could be a solution for this? E.g., do you plan to extend > nxge to also consider source & destination port as classifiers for fanout?Sorry I was not correct about src_IP. As Matheous pointed out, nxge uses all 5 tuples so you should be good. Also with Crossbow, if H/W is inflexible, we can always do fanout in S/W based on any scheme of you choice.> We would also be very interested in taking part in a Crossbow beta test in November. > Most important for us in this test would be that there is some kind of solution for problem #2. > Thanks for the time for reading all this, and also thanks for your support so far!Sure, I will ask our PM to add you to the beta list. Will keep your scenario in mind as we do code. The description was very helpful in people understand the problem in real deployments. Thanks, Sunay> > Nick. > > > This message posted from opensolaris.org > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss-- Sunay Tripathi Distinguished Engineer Solaris Core Operating System Sun MicroSystems Inc. Solaris Networking: http://www.opensolaris.org/os/community/networking Project Crossbow: http://www.opensolaris.org/os/project/crossbow
Nicolas Michael wrote:> Hello, > - While we already have some ideas how to solve problem #1, the next problem really seems to destroy everything again: Sunay wrote, that from the possible L3/L4 classifiers, nxge only uses the source IP address as a classifier for fanout. Since we are talking about the traffic on our cluster interconnect, we always have the same source address for *all* our traffic: It''s all inter-node traffic, always coming from the other node of the cluster! This means no matter how many connections we use, they will always be mapped to the same cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus -- but not to 16 or 32). We will need some kind of solution for this! Do you have any ideas about what could be a solution for this? E.g., do you plan to extend nxge to also consider source & destination port as classifiers for fanout?Just to clarify what Matheos & Sunay have already mentioned. I have worked on a TCAM manager for the NXGE, and you can classify flows according to any of the protocol fields mentioned. Here is an excerpt from the description of a Neptune IPv4 TCAM key (IPv6 is a different matter): bits 111:104 TOS byte. 103:96 Protocol ID. 95:64 Either L4 port numbers or SPI. 63:32 ip_addr_sa IP source address. 31:0 ip_addr_da IP destination address. Here is a Crossbow flow description: typedef struct flow_desc_s { flow_mask_t fd_mask; struct ether_vlan_header fd_mac; char fd_ipversion; char fd_protocol; in6_addr_t fd_remoteaddr; in6_addr_t fd_localaddr; in_port_t fd_remoteport; in_port_t fd_localport; uint32_t fd_sap; char fd_pad[4]; /* 64-bit alignment */ } flow_desc_t; As Matheos metioned previously. the NXGE has 16 receive channels available. So if you programmed the TCAM accordingly (you can''t program it directly, but you can certainly use Crossbow to do so), you could define 16 different flows. --------------------------------------------------------------- Tom McMillan Sun Microsystems, Inc. (858) 526-9278 x55278
Tom McMillan wrote:> > Nicolas Michael wrote: >> Hello, >> - While we already have some ideas how to solve problem #1, the next problem really seems to destroy everything again: Sunay wrote, that from the possible L3/L4 classifiers, nxge only uses the source IP address as a classifier for fanout. Since we are talking about the traffic on our cluster interconnect, we always have the same source address for *all* our traffic: It''s all inter-node traffic, always coming from the other node of the cluster! This means no matter how many connections we use, they will always be mapped to the same cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus -- but not to 16 or 32). We will need some kind of solution for this! Do you have any ideas about what could be a solution for this? E.g., do you plan to extend nxge to also consider source & destination port as classifiers for fanout? > > Just to clarify what Matheos & Sunay have already mentioned. I have worked > on a TCAM manager for the NXGE, and you can classify flows according to any > of the protocol fields mentioned. > > Here is an excerpt from the description of a Neptune IPv4 TCAM key (IPv6 > is a different matter): > > bits > 111:104 TOS byte. > 103:96 Protocol ID. > 95:64 Either L4 port numbers or SPI. > 63:32 ip_addr_sa IP source address. > 31:0 ip_addr_da IP destination address. > > Here is a Crossbow flow description: > > typedef struct flow_desc_s { > flow_mask_t fd_mask; > struct ether_vlan_header fd_mac; > char fd_ipversion; > char fd_protocol; > in6_addr_t fd_remoteaddr; > in6_addr_t fd_localaddr; > in_port_t fd_remoteport; > in_port_t fd_localport; > uint32_t fd_sap; > char fd_pad[4]; /* 64-bit alignment */ > } flow_desc_t; > > As Matheos metioned previously. the NXGE has 16 receive channels > available. So if you programmed the TCAM accordingly (you can''t > program it directly, but you can certainly use Crossbow to do so), you > could define 16 different flows.Tom, I don''t think that works. The connection attributes related to ports etc get changed as connections get re established or machines reboots. And users shouldn''t be interacting with TCAMs and figuring out how to program them either with Crossbow or without. Cheers, Sunay -- Sunay Tripathi Distinguished Engineer Solaris Core Operating System Sun MicroSystems Inc. Solaris Networking: http://www.opensolaris.org/os/community/networking Project Crossbow: http://www.opensolaris.org/os/project/crossbow
Sunay Tripathi wrote:> Tom McMillan wrote: > >> Nicolas Michael wrote: >> >>> Hello, >>> - While we already have some ideas how to solve problem #1, the next problem really seems to destroy everything again: Sunay wrote, that from the possible L3/L4 classifiers, nxge only uses the source IP address as a classifier for fanout. Since we are talking about the traffic on our cluster interconnect, we always have the same source address for *all* our traffic: It''s all inter-node traffic, always coming from the other node of the cluster! This means no matter how many connections we use, they will always be mapped to the same cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus -- but not to 16 or 32). We will need some kind of solution for this! Do you have any ideas about what could be a solution for this? E.g., do you plan to extend nxge to also consider source & destination port as classifiers for fanout? >>> >> Just to clarify what Matheos & Sunay have already mentioned. I have worked >> on a TCAM manager for the NXGE, and you can classify flows according to any >> of the protocol fields mentioned. >> >> Here is an excerpt from the description of a Neptune IPv4 TCAM key (IPv6 >> is a different matter): >> >> bits >> 111:104 TOS byte. >> 103:96 Protocol ID. >> 95:64 Either L4 port numbers or SPI. >> 63:32 ip_addr_sa IP source address. >> 31:0 ip_addr_da IP destination address. >> >> Here is a Crossbow flow description: >> >> typedef struct flow_desc_s { >> flow_mask_t fd_mask; >> struct ether_vlan_header fd_mac; >> char fd_ipversion; >> char fd_protocol; >> in6_addr_t fd_remoteaddr; >> in6_addr_t fd_localaddr; >> in_port_t fd_remoteport; >> in_port_t fd_localport; >> uint32_t fd_sap; >> char fd_pad[4]; /* 64-bit alignment */ >> } flow_desc_t; >> >> As Matheos metioned previously. the NXGE has 16 receive channels >> available. So if you programmed the TCAM accordingly (you can''t >> program it directly, but you can certainly use Crossbow to do so), you >> could define 16 different flows. >> > > Tom, > > I don''t think that works. The connection attributes related to > ports etc get changed as connections get re established or machines > reboots. And users shouldn''t be interacting with TCAMs and figuring > out how to program them either with Crossbow or without. >yep. the feature to use here is the ring group load balancing policy. From the administrator side, dladm(1m)''s existing <-P policy> option for setting load the outbound load balancing criteria between members of an aggregation is being both generalized to extend to multiple rings of the same NIC/VNIC, and made applicable to both directions. From the driver interface side, the driver exposes a mac_set_lb_t entry point (MAC set Load Balancing) as part of the mac_rx_ring_group_info_t. This is detailed in the Crossbow Hardware Resources Management and Virtualization <http://www.opensolaris.org/os/project/crossbow/Docs/virtual_resources.pdf> Kais.> Cheers, > Sunay > > > >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20071004/1e0a895b/attachment.html>
Hello, did I get everything right in my last post on Oct 5th? Is this how Crossbow would be working in our setup? Thanks, Nick. This message posted from opensolaris.org
Nick, Yes, I believe so. I think the only think we will need to try out if the small number of connections you have will still produce the desired level of fanout or do we need to increase the number of connections. Cheers, Sunay Nicolas Michael wrote:> Hello, > > did I get everything right in my last post on Oct 5th? Is this how Crossbow would be working in our setup? > > Thanks, > Nick. > > > This message posted from opensolaris.org > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss-- Sunay Tripathi Distinguished Engineer Solaris Core Operating System Sun MicroSystems Inc. Solaris Networking: http://www.opensolaris.org/os/community/networking Project Crossbow: http://www.opensolaris.org/os/project/crossbow
Hi all, unfortunately it took us quite a while to finish the new setup of our system, but for a couple of days we were now running tests again. The results are good -- except for one major problem (see below). For the first tests, we''ve used 4 Quad-Port nxge 1 GBit NICs with each 1 port used for the cluster interconnect and set ddi_msix_alloc_limit to 8 - the largest allowed value (this is the discussed workaround until Crossbow will be available). 16 DMA channels on that card allow for a fanout over 4 cpus per port, so with 4 ports we have a fanout over 16 cpus. This is working pretty good and solves our current problems with the interrupt load on our 128-way server. (We were able to reduce the packet load by some further optimizations.) In the tests now we have the target configuration with 2 Dual-Port nxge 10 GBit NICs instead with each 1 port used, again giving us a fanout over 16 cpus (8 cpus per port). Basically, this configuration works very good as well, except for one major problem: The nxge driver accidentally decides to use cpu 0 as one of the cpus (among 7 others) for the interrupt handler. Since this cpu is always handling the clock interrupts, this cpu is now overloaded with interrupt processing. Although we already put cpu 0 into a processor set (so that nothing but interrupts are running on that cpu), it now reaches 100% sys load (mpstat). From the kstat interrupt statistics I''ve calculated the per-level interrupt time over a time of 5 minutes, which is: CPU 0 - Overall: 97.7% CPU 0 - Level 1: 15.9% CPU 0 - Level 6: 8.9% CPU 0 - Level 10: 71.9% The cpu 0 is already 71.9% busy with the clock interrupts. On top of that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level 6 interrupts -- this is our nxge NIC. Since the clock interrupt has the highest priority, the nxge performance suffers -- we see bad packet latencies on the interconnect when cpu 0 becomes overloaded. So my question is: How can I restrict nxge (and the PCIe/PCIe bridge) to chose cpu 0 for their interrupts?? "psradm -i 0" won''t help since this will also affect the clock interrupt (I want to make sure that no HW interrupts are running on the cpu that is handling the clock interrupt). So what I would need is something like telling the driver not to use cpu 0 (or better: not to use the cpu that the clock interrupt is using). Is there a solution for this problem in S10U4? Will there be a solition in Crossbow? Thanks a lot, Nick. This message posted from opensolaris.org
Nicolas Michael wrote:> Hi all, > > unfortunately it took us quite a while to finish the new setup of our system, but for a couple of days we were now running tests again. The results are good -- except for one major problem (see below). > > For the first tests, we''ve used 4 Quad-Port nxge 1 GBit NICs with each 1 port used for the cluster interconnect and set ddi_msix_alloc_limit to 8 - the largest allowed value (this is the discussed workaround until Crossbow will be available). 16 DMA channels on that card allow for a fanout over 4 cpus per port, so with 4 ports we have a fanout over 16 cpus. This is working pretty good and solves our current problems with the interrupt load on our 128-way server. (We were able to reduce the packet load by some further optimizations.) > > In the tests now we have the target configuration with 2 Dual-Port nxge 10 GBit NICs instead with each 1 port used, again giving us a fanout over 16 cpus (8 cpus per port). Basically, this configuration works very good as well, except for one major problem: > > The nxge driver accidentally decides to use cpu 0 as one of the cpus (among 7 others) for the interrupt handler. Since this cpu is always handling the clock interrupts, this cpu is now overloaded with interrupt processing. Although we already put cpu 0 into a processor set (so that nothing but interrupts are running on that cpu), it now reaches 100% sys load (mpstat). From the kstat interrupt statistics I''ve calculated the per-level interrupt time over a time of 5 minutes, which is: > CPU 0 - Overall: 97.7% > CPU 0 - Level 1: 15.9% > CPU 0 - Level 6: 8.9% > CPU 0 - Level 10: 71.9% > > The cpu 0 is already 71.9% busy with the clock interrupts. On top of that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level 6 interrupts -- this is our nxge NIC. Since the clock interrupt has the highest priority, the nxge performance suffers -- we see bad packet latencies on the interconnect when cpu 0 becomes overloaded. > > So my question is: How can I restrict nxge (and the PCIe/PCIe bridge) to chose cpu 0 for their interrupts?? "psradm -i 0" won''t help since this will also affect the clock interrupt (I want to make sure that no HW interrupts are running on the cpu that is handling the clock interrupt). So what I would need is something like telling the driver not to use cpu 0 (or better: not to use the cpu that the clock interrupt is using). > > Is there a solution for this problem in S10U4? > Will there be a solition in Crossbow? >Hi Nick, I don''t know if there is a way for doing this in S10U4 but with Crossbow this will be possible. In Crossbow you will be able to use dladm command to specifically re-target the interrupt/s from a NIC to CPU/s of your choice. -krgopi> Thanks a lot, > Nick. > > > This message posted from opensolaris.org > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss >
Hi all,> I don''t know if there is a way for doing this in > S10U4 but with Crossbow > this will be possible. In Crossbow you will be able > to use dladm command > to specifically re-target the interrupt/s from a NIC > to CPU/s of your > choice.What about intrd? It seems, this daemon does what we need -- supervision of the interrupt load and automatic reassignment of interrupts. However, it seems it is only part of OpenSolaris and not yet included in S10? Do you have any plans of integrating this in a future S10 update? Also, how exactly does intrd work? I.e, which interrupt routine(s) would be moved to a different cpu in my example? CPU 0 - Overall: 97.7% CPU 0 - Level 1: 15.9% CPU 0 - Level 6: 8.9% CPU 0 - Level 10: 71.9% Is the reassignment deterministic in any way? For example, are always lower-priority interrupts moved to different cpus first (in this example level 1 and 6 interrupts)? For our setup, we would want the clock interrupt to stay on cpu 0. And again: We urgently need a solution for this problem by early next year based on Solaris 10. Does anyone have an idea how to avoid interrupt overload in S10 (SPARC only)? Any help appreciated! Thanks, Nick. This message posted from opensolaris.org
Rajagopal Kunhappan writes: > > > Nicolas Michael wrote: > > Hi all, > > > > unfortunately it took us quite a while to finish the new setup of our system, but for a couple of days we were now running tests again. The results are good -- except for one major problem (see below). > > > > For the first tests, we''ve used 4 Quad-Port nxge 1 GBit NICs with each 1 port used for the cluster interconnect and set ddi_msix_alloc_limit to 8 - the largest allowed value (this is the discussed workaround until Crossbow will be available). 16 DMA channels on that card allow for a fanout over 4 cpus per port, so with 4 ports we have a fanout over 16 cpus. This is working pretty good and solves our current problems with the interrupt load on our 128-way server. (We were able to reduce the packet load by some further optimizations.) > > > > In the tests now we have the target configuration with 2 Dual-Port nxge 10 GBit NICs instead with each 1 port used, again giving us a fanout over 16 cpus (8 cpus per port). Basically, this configuration works very good as well, except for one major problem: > > > > The nxge driver accidentally decides to use cpu 0 as one of the cpus (among 7 others) for the interrupt handler. Since this cpu is always handling the clock interrupts, this cpu is now overloaded with interrupt processing. Although we already put cpu 0 into a processor set (so that nothing but interrupts are running on that cpu), it now reaches 100% sys load (mpstat). From the kstat interrupt statistics I''ve calculated the per-level interrupt time over a time of 5 minutes, which is: > > CPU 0 - Overall: 97.7% > > CPU 0 - Level 1: 15.9% > > CPU 0 - Level 6: 8.9% > > CPU 0 - Level 10: 71.9% > > > > The cpu 0 is already 71.9% busy with the clock interrupts. On top of that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level 6 interrupts -- this is our nxge NIC. Since the clock interrupt has the highest priority, the nxge performance suffers -- we see bad packet latencies on the interconnect when cpu 0 becomes overloaded. > > > > So my question is: How can I restrict nxge (and the PCIe/PCIe bridge) to chose cpu 0 for their interrupts?? "psradm -i 0" won''t help since this will also affect the clock interrupt (I want to make sure that no HW interrupts are running on the cpu that is handling the clock interrupt). So what I would need is something like telling the driver not to use cpu 0 (or better: not to use the cpu that the clock interrupt is using). > > > > Is there a solution for this problem in S10U4? > > Will there be a solition in Crossbow? > > > Hi Nick, > > I don''t know if there is a way for doing this in S10U4 but with Crossbow > this will be possible. In Crossbow you will be able to use dladm command > to specifically re-target the interrupt/s from a NIC to CPU/s of your > choice. > > -krgopi With Crossbow the problem should go away by itself since the NIC should be placed into polling mode anyway. -r > > Thanks a lot, > > Nick. > > > > > > This message posted from opensolaris.org > > _______________________________________________ > > crossbow-discuss mailing list > > crossbow-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss > > > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
Roch - PAE wrote:> > > > > > The cpu 0 is already 71.9% busy with the clock interrupts. On top of that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level 6 interrupts -- this is our nxge NIC. Since the clock interrupt has the highest priority, the nxge performance suffers -- we see bad packet latencies on the interconnect when cpu 0 becomes overloaded. > > > > > > So my question is: How can I restrict nxge (and the PCIe/PCIe bridge) to chose cpu 0 for their interrupts?? "psradm -i 0" won''t help since this will also affect the clock interrupt (I want to make sure that no HW interrupts are running on the cpu that is handling the clock interrupt). So what I would need is something like telling the driver not to use cpu 0 (or better: not to use the cpu that the clock interrupt is using). > > > > > > Is there a solution for this problem in S10U4? > > > Will there be a solition in Crossbow? > > > > > Hi Nick, > > > > I don''t know if there is a way for doing this in S10U4 but with Crossbow > > this will be possible. In Crossbow you will be able to use dladm command > > to specifically re-target the interrupt/s from a NIC to CPU/s of your > > choice. > > > > -krgopi > > > With Crossbow the problem should go away by itself since the > NIC should be placed into polling mode anyway.Not really. Even when polling mode is implemented, interrupts are still necessary. For rx traffic, the interrupt will trigger upper layer to start polling by disabling this interrupt, and this interrupt will be re-enabled when there is no completed inbound packet available in this rx ring. So, re-targeting interrupt is still important. Further, the polling thread will run on the same CPU that is interrupt by rx traffic interrupt. Thanks, Roamer> > > -r > > > > > Thanks a lot, > > > Nick. > > > > > > > > > This message posted from opensolaris.org > > > _______________________________________________ > > > crossbow-discuss mailing list > > > crossbow-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss > > > > > _______________________________________________ > > crossbow-discuss mailing list > > crossbow-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss > > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
Yunsong Lu writes: > > > Roch - PAE wrote: > > > > > > > > > The cpu 0 is already 71.9% busy with the clock interrupts. On top of that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level 6 interrupts -- this is our nxge NIC. Since the clock interrupt has the highest priority, the nxge performance suffers -- we see bad packet latencies on the interconnect when cpu 0 becomes overloaded. > > > > > > > > So my question is: How can I restrict nxge (and the PCIe/PCIe bridge) to chose cpu 0 for their interrupts?? "psradm -i 0" won''t help since this will also affect the clock interrupt (I want to make sure that no HW interrupts are running on the cpu that is handling the clock interrupt). So what I would need is something like telling the driver not to use cpu 0 (or better: not to use the cpu that the clock interrupt is using). > > > > > > > > Is there a solution for this problem in S10U4? > > > > Will there be a solition in Crossbow? > > > > > > > Hi Nick, > > > > > > I don''t know if there is a way for doing this in S10U4 but with Crossbow > > > this will be possible. In Crossbow you will be able to use dladm command > > > to specifically re-target the interrupt/s from a NIC to CPU/s of your > > > choice. > > > > > > -krgopi > > > > > > With Crossbow the problem should go away by itself since the > > NIC should be placed into polling mode anyway. > Not really. > Even when polling mode is implemented, interrupts are still necessary. > For rx traffic, the interrupt will trigger upper layer to start polling > by disabling this interrupt, and this interrupt will be re-enabled when > there is no completed inbound packet available in this rx ring. So, > re-targeting interrupt is still important. It seems the interrupts load will be so much smaller as to not matter. The int level is higher than clock so it will run in time but it will be so short as to not perturb clock. Do we have an estimate on how much interrupt CPU is necessary when a NIC is being polled ? > Further, the polling thread > will run on the same CPU that is interrupt by rx traffic interrupt. > This query seem to hit a use case where binding kernel network thread is undesirable. I would rather let network thread binding only for benchmark purposes and have threads unbound by default. -r > Thanks, > > Roamer > > > > > > -r > > > > > > > > Thanks a lot, > > > > Nick. > > > > > > > > > > > > This message posted from opensolaris.org > > > > _______________________________________________ > > > > crossbow-discuss mailing list > > > > crossbow-discuss at opensolaris.org > > > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss > > > > > > > _______________________________________________ > > > crossbow-discuss mailing list > > > crossbow-discuss at opensolaris.org > > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss > > > > _______________________________________________ > > crossbow-discuss mailing list > > crossbow-discuss at opensolaris.org > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
Roch - PAE wrote:> > > With Crossbow the problem should go away by itself since the > > > NIC should be placed into polling mode anyway. > > Not really. > > Even when polling mode is implemented, interrupts are still necessary. > > For rx traffic, the interrupt will trigger upper layer to start polling > > by disabling this interrupt, and this interrupt will be re-enabled when > > there is no completed inbound packet available in this rx ring. So, > > re-targeting interrupt is still important. > > It seems the interrupts load will be so much smaller as to > not matter. The int level is higher than clock so it will run in time > but it will be so short as to not perturb clock. Do we have > an estimate on how much interrupt CPU is necessary when a > NIC is being polled ?When a NIC is being polled, no rx traffic interrupt is needed at this moment. Generally, rx interrupt will happen rarely when traffic is heavy. Thanks, Roamer
Hello, I have an update regarding our interrupt collision problem between network and clock interrups: The high load of the clock interrupt has been a "home made" problem. Someone added set hires_tick = 1 set hires_hz = 1000 to my /etc/system. It wasn''t me, it wasn''t our installation (which adds quite a few lines). I have no idea yet where these lines come from, but they were responsible for the high load of the clock interrupt. With these settings removed again, we still have clock and nxge on cpu 0, but in sum the interrupt load is only 35-40% on cpu 0, so we''re ok no. Nevertheless, of course the possibility of cpu overload due to too many interrupt routines remains. But if you have solutions for this with dladm and/or intrd in Crossbow and later on in S10U6, this will be enough for us. With our currently largest system we''re fine. We expect to have even larger systems (>=256 ways) one year from now, so interrupt load may again become a problem, but I hope until then S10U6 with the Crossbox optimizations is available. Thanks a lot, Nick. This message posted from opensolaris.org
Hello, this thread is more than a year old, but probably you remember the discussion. We''ve had problems with the cpu load for interrupt processing for our high-loaded NICs. You were mentioning that as part of Crossbow you were implementing a polling mechanism that would kick in at high loads and deliver packets through polling rather than by generating interrupts. Has this been implemented in Crossbow already? If yes, has it been backported to Solaris 10 yet? Thanks a lot, Nick. -- This message posted from opensolaris.org
Nicolas Michael wrote:> Hello, > > this thread is more than a year old, but probably you remember the discussion. We''ve had problems with the cpu load for interrupt processing for our high-loaded NICs. You were mentioning that as part of Crossbow you were implementing a polling mechanism that would kick in at high loads and deliver packets through polling rather than by generating interrupts. Has this been implemented in Crossbow already?Yes, it''s in the current Crossbow bits that are going into OpenSolaris soon. You can already download the bits from OpenSolaris.org even now.> If yes, has it been backported to Solaris 10 yet? >No. Markus> Thanks a lot, > Nick. >