thr3ads.net - crossbow discuss - [crossbow-discuss] Interrupt Fanout [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Nicolas Michael

2007-Oct-04 08:28 UTC

[crossbow-discuss] Interrupt Fanout

Hello,

I''m posting this as the follow-up to the private mail discussion I
already had with Sunay and Roamer.

I''ll start with a description of what we are doing. We are working in
2-node-cluster configurations with large SPARC servers, e.g. SunFire 6900
clusters. Now we are moving to Maramba and M9000 clusters as the next hardware
generation for our application. Currently we have an M9000 cluster in our labs
for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2 cores x
2 threads) per node (128 ways per node). Currently we use 4 physical 1 GBit
links as the cluster interconnect provided by SunCluster. Later on we plan to go
to 2 links with 10 GBit each. The driver used for these NICs is nxge. The
throughput on all links of the cluster interconnect is roughly 100.000 pkts/sec
per direction. Most of this load is generated by 8 processes with 1 connection
each.

In our first tests we used Intel NICs with e1000g driver (4 links). In this
setup, the interrupt load on the 4 CPUs doing the interrupt handling for the 4
NICs became a bottleneck: Therefore we''ve fenced those 4 cpus into a
processor set (actually we fenced both strands of each core since the strands
are sharing the interrupt logic), but the cpus are still almost saturated doing
interrupt processing (although we already tuned some parameters with ndd).
Nevertheless, interrupt fencing has just been a work-around for the prototyping.
In a real-world setup with our high HA requirements we won''t be able to
implement interrupt fencing with affordable effort (think of error situations
where one cpu fails and interrupts move to different cpus and we have to re-bind
all our processes and create new processor sets and so on...).

So what we need is interrupt fanout to many cpus to avoid the need of interrupt
fencing. What makes the situation more difficult is that we are running many
processes with rather large heaps on the system. In order to get cache misses
down and be able to scale on such a large 128-way server, we are explicitly
binding the application threads of these processes to *all* 128 virtual cpus in
the system. Threads bound to cpus which are doing heavy interrupt processing
will starve. We therefore need to get interrupt processing spread out to so many
cpus that even threads bound to interrupt cpus get enough cpu time.

What I''ve learned so far is that the e1000g driver is not able to
fanout interrupts to different cpus, but nxge is. So we will from now on use 1
GBit and 10 GBit NICs with nxge driver. Currently we are re-installing our
cluster with such a setup.

I''ve also learned that until the project Interrupt Resource Management
(IRM) or Crossbow are available, the level of fanout can be controlled through
the ddi_msix_alloc_limit global variable. We need some kind of work-around using
fanout right now for evaluating whether this approach suits our needs and need a
product solution by the end of this year, since our customers are already
waiting for the new server generation.

Since we have 128 virutal cpus in the system, I believe we need a fanout over at
least 16 or 32 cpus in order to get the interrupt processing load down enough to
be able to still bind threads to these cpus.

After the discussion with Roamer and Sunay, and after reading your document
"Hardware Resources Management and Virualization", I see the following
problems for our setup:

- Fanout to Rx Rings is done on a per-connection basis, based on L3/L4
classifiers. Currently we have 8 high-load connections and some more low-load
connections. To fanout to 32 cpus, we would need at least 32 (probably better 64
or more) connections. Sunay already pointed out that even now 3 or 4 cpus are
used to move a packet through the stack (interrupt/polling, soft ring, squeue).
However, from our tests so far I believe that interrupt processing is what is
hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt didn''t
help us, so I expect we won''t come through with 8 connections.
I''m currently in discussion with our platform developers on how to best
increase the number of connections.

- While we already have some ideas how to solve problem #1, the next problem
really seems to destroy everything again: Sunay wrote, that from the possible
L3/L4 classifiers, nxge only uses the source IP address as a classifier for
fanout. Since we are talking about the traffic on our cluster interconnect, we
always have the same source address for *all* our traffic: It''s all
inter-node traffic, always coming from the other node of the cluster! This means
no matter how many connections we use, they will always be mapped to the same
cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as
the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus
-- but not to 16 or 32). We will need some kind of solution for this! Do you
have any ideas about what could be a solution for this? E.g., do you plan to
extend nxge to also consider source & destination port as classifiers for
fanout?

We would also be very interested in taking part in a Crossbow beta test in
November. Most important for us in this test would be that there is some kind of
solution for problem #2.

Thanks for the time for reading all this, and also thanks for your support so
far!

Nick.
 
 
This message posted from opensolaris.org

Matheos Worku

2007-Oct-04 16:06 UTC

head link

[crossbow-discuss] Interrupt Fanout

Nicolas Michael wrote:> Hello,
>
> I''m posting this as the follow-up to the private mail discussion I
already had with Sunay and Roamer.
>
> I''ll start with a description of what we are doing. We are working
in 2-node-cluster configurations with large SPARC servers, e.g. SunFire 6900
clusters. Now we are moving to Maramba and M9000 clusters as the next hardware
generation for our application. Currently we have an M9000 cluster in our labs
for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2 cores x
2 threads) per node (128 ways per node). Currently we use 4 physical 1 GBit
links as the cluster interconnect provided by SunCluster. Later on we plan to go
to 2 links with 10 GBit each. The driver used for these NICs is nxge. The
throughput on all links of the cluster interconnect is roughly 100.000 pkts/sec
per direction. Most of this load is generated by 8 processes with 1 connection
each.
>
> In our first tests we used Intel NICs with e1000g driver (4 links). In this
setup, the interrupt load on the 4 CPUs doing the interrupt handling for the 4
NICs became a bottleneck: Therefore we''ve fenced those 4 cpus into a
processor set (actually we fenced both strands of each core since the strands
are sharing the interrupt logic), but the cpus are still almost saturated doing
interrupt processing (although we already tuned some parameters with ndd).
Nevertheless, interrupt fencing has just been a work-around for the prototyping.
In a real-world setup with our high HA requirements we won''t be able to
implement interrupt fencing with affordable effort (think of error situations
where one cpu fails and interrupts move to different cpus and we have to re-bind
all our processes and create new processor sets and so on...).
>
> So what we need is interrupt fanout to many cpus to avoid the need of
interrupt fencing. What makes the situation more difficult is that we are
running many processes with rather large heaps on the system. In order to get
cache misses down and be able to scale on such a large 128-way server, we are
explicitly binding the application threads of these processes to *all* 128
virtual cpus in the system. Threads bound to cpus which are doing heavy
interrupt processing will starve. We therefore need to get interrupt processing
spread out to so many cpus that even threads bound to interrupt cpus get enough
cpu time.
>
> What I''ve learned so far is that the e1000g driver is not able to
fanout interrupts to different cpus, but nxge is. So we will from now on use 1
GBit and 10 GBit NICs with nxge driver. Currently we are re-installing our
cluster with such a setup.
>
> I''ve also learned that until the project Interrupt Resource
Management (IRM) or Crossbow are available, the level of fanout can be
controlled through the ddi_msix_alloc_limit global variable. We need some kind
of work-around using fanout right now for evaluating whether this approach suits
our needs and need a product solution by the end of this year, since our
customers are already waiting for the new server generation.
>
> Since we have 128 virutal cpus in the system, I believe we need a fanout
over at least 16 or 32 cpus in order to get the interrupt processing load down
enough to be able to still bind threads to these cpus.
>
> After the discussion with Roamer and Sunay, and after reading your document
"Hardware Resources Management and Virualization", I see the following
problems for our setup:
>
> - Fanout to Rx Rings is done on a per-connection basis, based on L3/L4
classifiers. Currently we have 8 high-load connections and some more low-load
connections. To fanout to 32 cpus, we would need at least 32 (probably better 64
or more) connections. Sunay already pointed out that even now 3 or 4 cpus are
used to move a packet through the stack (interrupt/polling, soft ring, squeue).
However, from our tests so far I believe that interrupt processing is what is
hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt didn''t
help us, so I expect we won''t come through with 8 connections.
I''m currently in discussion with our platform developers on how to best
increase the number of connections.
>
> - While we already have some ideas how to solve problem #1, the next
problem really seems to destroy everything again: Sunay wrote, that from the
possible L3/L4 classifiers, nxge only uses the source IP address as a classifier
for fanout. Since we are talking about the traffic on our cluster interconnect,
we always have the same source address for *all* our traffic: It''s all
inter-node traffic, always coming from the other node of the cluster! This means
no matter how many connections we use, they will always be mapped to the same
cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as
the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus
-- but not to 16 or 32). We will need some kind of solution for this! Do you
have any ideas about what could be a solution for this? E.g., do you plan to
extend nxge to also consider source & destination port as classifiers for
fanout?
>   Nick,

By default, the nxge driver performs hashing on the 5-tuple of the 
incoming packets. The hash determines which RX ring the packet will land 
on. Provided that the RX ring  is bound the specific CPU by the way of 
MSI/MSI-X interrupt, you will have good chance of incoming packets 
distributed among the available CPUs. The Neptune/NIU  HW has 16 RX DMA 
so technically the incoming packets could be distributed to upto 16 CPUS.

L3/L4 classification could give you explicit control of connection ---> 
RX ring mapping. L3/L4 classification is planned to be 
controlled/configured by Crossbow and I believe it depends on Crossbow 
feature delivery phases.

Regards
Matheos
> We would also be very interested in taking part in a Crossbow beta test in
November. Most important for us in this test would be that there is some kind of
solution for problem #2.
>
> Thanks for the time for reading all this, and also thanks for your support
so far!
>
> Nick.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
>

Sunay Tripathi

2007-Oct-04 16:35 UTC

head link

[crossbow-discuss] Interrupt Fanout

Hi Nick,

Welcome to OpenSolaris!

Nicolas Michael wrote:> Hello,
> 
> I''m posting this as the follow-up to the private mail discussion I
already had with Sunay and Roamer.
> I''ll start with a description of what we are doing. We are working
in 2-node-cluster configurations
> with large SPARC servers, e.g. SunFire 6900 clusters. Now we are moving to
Maramba and M9000 clusters
> as the next hardware generation for our application. Currently we have an
M9000 cluster in our labs
> for prototyping, equipped with 32 SPARC64-VI "Olympus" CPUs (2
cores x 2 threads) per node (128 ways
> per node). Currently we use 4 physical 1 GBit links as the cluster
interconnect provided by SunCluster.
> Later on we plan to go to 2 links with 10 GBit each. The driver used for
these NICs is nxge. The
> throughput on all links of the cluster interconnect is roughly 100.000
pkts/sec per direction.
> Most of this load is generated by 8 processes with 1 connection each.
OK.
> In our first tests we used Intel NICs with e1000g driver (4 links). In this
setup, the interrupt
> load on the 4 CPUs doing the interrupt handling for the 4 NICs became a
bottleneck: Therefore we''ve
> fenced those 4 cpus into a processor set (actually we fenced both strands
of each core since the
> strands are sharing the interrupt logic), but the cpus are still almost
saturated doing interrupt
> processing (although we already tuned some parameters with ndd).
Nevertheless, interrupt fencing
> has just been a work-around for the prototyping. In a real-world setup with
our high HA
> requirements we won''t be able to implement interrupt fencing with
affordable effort (think of
> error situations where one cpu fails and interrupts move to different cpus
and we have to
> re-bind all our processes and create new processor sets and so on...). 
Sure.
> So what we need is interrupt fanout to many cpus to avoid the need of
interrupt fencing. What
> makes the situation more difficult is that we are running many processes
with rather large heaps
> on the system. In order to get cache misses down and be able to scale on
such a large 128-way
> server, we are explicitly binding the application threads of these
processes to *all* 128
> virtual cpus in the system. Threads bound to cpus which are doing heavy
interrupt processing
> will starve. We therefore need to get interrupt processing spread out to so
many cpus that
> even threads bound to interrupt cpus get enough cpu time.
Crossbow will allow you to specify a cpu-list which allows the system to
figure out how many threads to use to fanout (in addition to the poll
thread and worker threads). You can choose to bind these threads to the
CPUs specified or leave them. The data path architecture which gives
details on various threads involved in moving the data including polling
etc is specified here

http://www.opensolaris.org/os/project/crossbow/Design_softringset.txt

I would also recommend reading some of the overview documents at
the Docs page http://www.opensolaris.org/os/project/crossbow/Docs
(and any feedback would be very useful).
> What I''ve learned so far is that the e1000g driver is not able to
fanout interrupts to
> different cpus, but nxge is. So we will from now on use 1 GBit and 10 GBit
NICs with nxge driver.
> Currently we are re-installing our cluster with such a setup.
As mentioned before, the problem is both polling and fanout. Current
S10 bits and Nevada can do some fanout but no polling. Instead of per
packet interrupt (or per few packets interrupts), you want to use more
polling under load and then do any fanout.
> I''ve also learned that until the project Interrupt Resource
Management (IRM) or Crossbow are
> available, the level of fanout can be controlled through the
ddi_msix_alloc_limit global variable.
> We need some kind of work-around using fanout right now for evaluating
whether this approach
> suits our needs and need a product solution by the end of this year, since
our customers are
> already waiting for the new server generation.
For NICs supporting MSI, yes.
> Since we have 128 virutal cpus in the system, I believe we need a fanout
over at least 16 or
> 32 cpus in order to get the interrupt processing load down enough to be
able to still bind
> threads to these cpus.
> After the discussion with Roamer and Sunay, and after reading your document
"Hardware
> Resources Management and Virualization", I see the following problems
for our setup:
> - Fanout to Rx Rings is done on a per-connection basis, based on L3/L4
classifiers.
> Currently we have 8 high-load connections and some more low-load
connections. To fanout to 32 cpus,
> we would need at least 32 (probably better 64 or more) connections. Sunay
already pointed out
> that even now 3 or 4 cpus are used to move a packet through the stack
(interrupt/polling,
> soft ring, squeue). However, from our tests so far I believe that interrupt
processing is what
> is hurting us most. Setting ip_squeue_fanout and ip_soft_rings_cnt
didn''t help us, so I expect
>  we won''t come through with 8 connections. I''m currently
in discussion with our platform
> developers on how to best increase the number of connections.
With Crossbow, under load, dynamic polling kicks in pretty efficiently.
I just finished some experiments on T2000 where the system was doing
about 250k packets/sec and the ratio of packets processed by interrupt
vs poll was under .5%. So you don''t have to worry about interrupt that
much under load.
> - While we already have some ideas how to solve problem #1, the next
problem really seems
> to destroy everything again: Sunay wrote, that from the possible L3/L4
classifiers, nxge
> only uses the source IP address as a classifier for fanout. Since we are
talking about
> the traffic on our cluster interconnect, we always have the same source
address for
> *all* our traffic: It''s all inter-node traffic, always coming from
the other node of
> the cluster! This means no matter how many connections we use, they will
always be
> mapped to the same cpu since all have the same source IP address! (Since we
will use
> 2 or 4 NICs as the interconnect for reasons of redundancy, they will be
mapped to 2 or
> 4 cpus -- but not to 16 or 32). We will need some kind of solution for
this! Do you
> have any ideas about what could be a solution for this? E.g., do you plan
to extend
> nxge to also consider source & destination port as classifiers for
fanout?
Sorry I was not correct about src_IP. As Matheous pointed out, nxge uses
all 5 tuples so you should be good. Also with Crossbow, if H/W is
inflexible, we can always do fanout in S/W based on any scheme of you
choice.
> We would also be very interested in taking part in a Crossbow beta test in
November.
> Most important for us in this test would be that there is some kind of
solution for problem #2.
> Thanks for the time for reading all this, and also thanks for your support
so far!
Sure, I will ask our PM to add you to the beta list. Will keep your 
scenario in mind as we do code. The description was very helpful in
people understand the problem in real deployments.

Thanks,
Sunay
> 
> Nick.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

-- 
Sunay Tripathi
Distinguished Engineer
Solaris Core Operating System
Sun MicroSystems Inc.

Solaris Networking:     http://www.opensolaris.org/os/community/networking
Project Crossbow:       http://www.opensolaris.org/os/project/crossbow

Tom McMillan

2007-Oct-04 16:45 UTC

head link

[crossbow-discuss] Interrupt Fanout

Nicolas Michael wrote:> Hello,
> - While we already have some ideas how to solve problem #1, the next
problem really seems to destroy everything again: Sunay wrote, that from the
possible L3/L4 classifiers, nxge only uses the source IP address as a classifier
for fanout. Since we are talking about the traffic on our cluster interconnect,
we always have the same source address for *all* our traffic: It''s all
inter-node traffic, always coming from the other node of the cluster! This means
no matter how many connections we use, they will always be mapped to the same
cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as
the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus
-- but not to 16 or 32). We will need some kind of solution for this! Do you
have any ideas about what could be a solution for this? E.g., do you plan to
extend nxge to also consider source & destination port as classifiers for
fanout?
Just to clarify what Matheos & Sunay have already mentioned.  I have worked
on a TCAM manager for the NXGE, and you can classify flows according to any
of the protocol fields mentioned.

Here is an excerpt from the description of a Neptune IPv4 TCAM key (IPv6
is a different matter):

bits
111:104 TOS byte.
103:96	Protocol ID.
95:64	Either L4 port numbers or SPI.
63:32	ip_addr_sa IP source address.
31:0	ip_addr_da IP destination address.

Here is a Crossbow flow description:

typedef struct flow_desc_s {
	flow_mask_t			fd_mask;
	struct ether_vlan_header	fd_mac;
	char				fd_ipversion;
	char				fd_protocol;
	in6_addr_t			fd_remoteaddr;
	in6_addr_t			fd_localaddr;
	in_port_t			fd_remoteport;
	in_port_t			fd_localport;
	uint32_t			fd_sap;
	char				fd_pad[4];	/* 64-bit alignment */
} flow_desc_t;

As Matheos metioned previously. the NXGE has 16 receive channels
available.  So if you programmed the TCAM accordingly (you can''t
program it directly, but you can certainly use Crossbow to do so), you
could define 16 different flows.

---------------------------------------------------------------
Tom McMillan                             Sun Microsystems, Inc.
(858) 526-9278                                          x55278

Sunay Tripathi

2007-Oct-04 17:02 UTC

head link

[crossbow-discuss] Interrupt Fanout

Tom McMillan wrote:> 
> Nicolas Michael wrote:
>> Hello,
>> - While we already have some ideas how to solve problem #1, the next
problem really seems to destroy everything again: Sunay wrote, that from the
possible L3/L4 classifiers, nxge only uses the source IP address as a classifier
for fanout. Since we are talking about the traffic on our cluster interconnect,
we always have the same source address for *all* our traffic: It''s all
inter-node traffic, always coming from the other node of the cluster! This means
no matter how many connections we use, they will always be mapped to the same
cpu since all have the same source IP address! (Since we will use 2 or 4 NICs as
the interconnect for reasons of redundancy, they will be mapped to 2 or 4 cpus
-- but not to 16 or 32). We will need some kind of solution for this! Do you
have any ideas about what could be a solution for this? E.g., do you plan to
extend nxge to also consider source & destination port as classifiers for
fanout?
> 
> Just to clarify what Matheos & Sunay have already mentioned.  I have
worked
> on a TCAM manager for the NXGE, and you can classify flows according to any
> of the protocol fields mentioned.
> 
> Here is an excerpt from the description of a Neptune IPv4 TCAM key (IPv6
> is a different matter):
> 
> bits
> 111:104 TOS byte.
> 103:96	Protocol ID.
> 95:64	Either L4 port numbers or SPI.
> 63:32	ip_addr_sa IP source address.
> 31:0	ip_addr_da IP destination address.
> 
> Here is a Crossbow flow description:
> 
> typedef struct flow_desc_s {
> 	flow_mask_t			fd_mask;
> 	struct ether_vlan_header	fd_mac;
> 	char				fd_ipversion;
> 	char				fd_protocol;
> 	in6_addr_t			fd_remoteaddr;
> 	in6_addr_t			fd_localaddr;
> 	in_port_t			fd_remoteport;
> 	in_port_t			fd_localport;
> 	uint32_t			fd_sap;
> 	char				fd_pad[4];	/* 64-bit alignment */
> } flow_desc_t;
> 
> As Matheos metioned previously. the NXGE has 16 receive channels
> available.  So if you programmed the TCAM accordingly (you can''t
> program it directly, but you can certainly use Crossbow to do so), you
> could define 16 different flows.
Tom,

I don''t think that works. The connection attributes related to
ports etc get changed as connections get re established or machines
reboots. And users shouldn''t be interacting with TCAMs and figuring
out how to program them either with Crossbow or without.

Cheers,
Sunay



-- 
Sunay Tripathi
Distinguished Engineer
Solaris Core Operating System
Sun MicroSystems Inc.

Solaris Networking:     http://www.opensolaris.org/os/community/networking
Project Crossbow:       http://www.opensolaris.org/os/project/crossbow

Kais Belgaied

2007-Oct-04 17:38 UTC

head link

[crossbow-discuss] Interrupt Fanout

Sunay Tripathi wrote:> Tom McMillan wrote:
>   
>> Nicolas Michael wrote:
>>     
>>> Hello,
>>> - While we already have some ideas how to solve problem #1, the
next problem really seems to destroy everything again: Sunay wrote, that from
the possible L3/L4 classifiers, nxge only uses the source IP address as a
classifier for fanout. Since we are talking about the traffic on our cluster
interconnect, we always have the same source address for *all* our traffic:
It''s all inter-node traffic, always coming from the other node of the
cluster! This means no matter how many connections we use, they will always be
mapped to the same cpu since all have the same source IP address! (Since we will
use 2 or 4 NICs as the interconnect for reasons of redundancy, they will be
mapped to 2 or 4 cpus -- but not to 16 or 32). We will need some kind of
solution for this! Do you have any ideas about what could be a solution for
this? E.g., do you plan to extend nxge to also consider source & destination
port as classifiers for fanout?
>>>       
>> Just to clarify what Matheos & Sunay have already mentioned.  I
have worked
>> on a TCAM manager for the NXGE, and you can classify flows according to
any
>> of the protocol fields mentioned.
>>
>> Here is an excerpt from the description of a Neptune IPv4 TCAM key
(IPv6
>> is a different matter):
>>
>> bits
>> 111:104 TOS byte.
>> 103:96	Protocol ID.
>> 95:64	Either L4 port numbers or SPI.
>> 63:32	ip_addr_sa IP source address.
>> 31:0	ip_addr_da IP destination address.
>>
>> Here is a Crossbow flow description:
>>
>> typedef struct flow_desc_s {
>> 	flow_mask_t			fd_mask;
>> 	struct ether_vlan_header	fd_mac;
>> 	char				fd_ipversion;
>> 	char				fd_protocol;
>> 	in6_addr_t			fd_remoteaddr;
>> 	in6_addr_t			fd_localaddr;
>> 	in_port_t			fd_remoteport;
>> 	in_port_t			fd_localport;
>> 	uint32_t			fd_sap;
>> 	char				fd_pad[4];	/* 64-bit alignment */
>> } flow_desc_t;
>>
>> As Matheos metioned previously. the NXGE has 16 receive channels
>> available.  So if you programmed the TCAM accordingly (you
can''t
>> program it directly, but you can certainly use Crossbow to do so), you
>> could define 16 different flows.
>>     
>
> Tom,
>
> I don''t think that works. The connection attributes related to
> ports etc get changed as connections get re established or machines
> reboots. And users shouldn''t be interacting with TCAMs and
figuring
> out how to program them either with Crossbow or without.
>   
yep. the feature to use here is the ring group load balancing policy.
 From the administrator side, dladm(1m)''s existing <-P policy>
option
for setting load the outbound
load balancing criteria between members of an aggregation is being both 
generalized to extend to
multiple rings of the same NIC/VNIC, and made applicable to both directions.
 From the driver interface side, the driver exposes a mac_set_lb_t entry 
point (MAC set Load Balancing) as
part of the mac_rx_ring_group_info_t. This is detailed in the Crossbow 
Hardware Resources Management and Virtualization 
<http://www.opensolaris.org/os/project/crossbow/Docs/virtual_resources.pdf>

    Kais.> Cheers,
> Sunay
>
>
>
>   
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20071004/1e0a895b/attachment.html>

Nicolas Michael

2007-Oct-09 08:45 UTC

head link

[crossbow-discuss] Interrupt Fanout

Hello,

did I get everything right in my last post on Oct 5th? Is this how Crossbow
would be working in our setup?

Thanks,
Nick.
 
 
This message posted from opensolaris.org

Sunay Tripathi

2007-Oct-09 17:22 UTC

head link

[crossbow-discuss] Interrupt Fanout

Nick,

Yes, I believe so. I think the only think we will need to try out if
the small number of connections you have will still produce the
desired level of fanout or do we need to increase the number of
connections.

Cheers,
Sunay

Nicolas Michael wrote:> Hello,
> 
> did I get everything right in my last post on Oct 5th? Is this how Crossbow
would be working in our setup?
> 
> Thanks,
> Nick.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

-- 
Sunay Tripathi
Distinguished Engineer
Solaris Core Operating System
Sun MicroSystems Inc.

Solaris Networking:     http://www.opensolaris.org/os/community/networking
Project Crossbow:       http://www.opensolaris.org/os/project/crossbow

Nicolas Michael

2007-Nov-02 11:13 UTC

head link

[crossbow-discuss] Interrupt Fanout

Hi all,

unfortunately it took us quite a while to finish the new setup of our system,
but for a couple of days we were now running tests again. The results are good
-- except for one major problem (see below).

For the first tests, we''ve used 4 Quad-Port nxge 1 GBit NICs with each
1 port used for the cluster interconnect and set ddi_msix_alloc_limit to 8 - the
largest allowed value (this is the discussed workaround until Crossbow will be
available). 16 DMA channels on that card allow for a fanout over 4 cpus per
port, so with 4 ports we have a fanout over 16 cpus. This is working pretty good
and solves our current problems with the interrupt load on our 128-way server.
(We were able to reduce the packet load by some further optimizations.)

In the tests now we have the target configuration with 2 Dual-Port nxge 10 GBit
NICs instead with each 1 port used, again giving us a fanout over 16 cpus (8
cpus per port). Basically, this configuration works very good as well, except
for one major problem:

The nxge driver accidentally decides to use cpu 0 as one of the cpus (among 7
others) for the interrupt handler. Since this cpu is always handling the clock
interrupts, this cpu is now overloaded with interrupt processing. Although we
already put cpu 0 into a processor set (so that nothing but interrupts are
running on that cpu), it now reaches 100% sys load (mpstat). From the kstat
interrupt statistics I''ve calculated the per-level interrupt time over
a time of 5 minutes, which is:
CPU   0 - Overall: 97.7%
CPU   0 - Level  1: 15.9%
CPU   0 - Level  6: 8.9%
CPU   0 - Level 10: 71.9%

The cpu 0 is already 71.9% busy with the clock interrupts. On top of that come
some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level 6
interrupts -- this is our nxge NIC. Since the clock interrupt has the highest
priority, the nxge performance suffers -- we see bad packet latencies on the
interconnect when cpu 0 becomes overloaded.

So my question is: How can I restrict nxge (and the PCIe/PCIe bridge) to chose
cpu 0 for their interrupts?? "psradm -i 0" won''t help since
this will also affect the clock interrupt (I want to make sure that no HW
interrupts are running on the cpu that is handling the clock interrupt). So what
I would need is something like telling the driver not to use cpu 0 (or better:
not to use the cpu that the clock interrupt is using).

Is there a solution for this problem in S10U4?
Will there be a solition in Crossbow?

Thanks a lot,
Nick.
 
 
This message posted from opensolaris.org

Rajagopal Kunhappan

2007-Nov-02 17:18 UTC

head link

[crossbow-discuss] Interrupt Fanout

Nicolas Michael wrote:> Hi all,
>
> unfortunately it took us quite a while to finish the new setup of our
system, but for a couple of days we were now running tests again. The results
are good -- except for one major problem (see below).
>
> For the first tests, we''ve used 4 Quad-Port nxge 1 GBit NICs with
each 1 port used for the cluster interconnect and set ddi_msix_alloc_limit to 8
- the largest allowed value (this is the discussed workaround until Crossbow
will be available). 16 DMA channels on that card allow for a fanout over 4 cpus
per port, so with 4 ports we have a fanout over 16 cpus. This is working pretty
good and solves our current problems with the interrupt load on our 128-way
server. (We were able to reduce the packet load by some further optimizations.)
>
> In the tests now we have the target configuration with 2 Dual-Port nxge 10
GBit NICs instead with each 1 port used, again giving us a fanout over 16 cpus
(8 cpus per port). Basically, this configuration works very good as well, except
for one major problem:
>
> The nxge driver accidentally decides to use cpu 0 as one of the cpus (among
7 others) for the interrupt handler. Since this cpu is always handling the clock
interrupts, this cpu is now overloaded with interrupt processing. Although we
already put cpu 0 into a processor set (so that nothing but interrupts are
running on that cpu), it now reaches 100% sys load (mpstat). From the kstat
interrupt statistics I''ve calculated the per-level interrupt time over
a time of 5 minutes, which is:
> CPU   0 - Overall: 97.7%
> CPU   0 - Level  1: 15.9%
> CPU   0 - Level  6: 8.9%
> CPU   0 - Level 10: 71.9%
>
> The cpu 0 is already 71.9% busy with the clock interrupts. On top of that
come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level 6
interrupts -- this is our nxge NIC. Since the clock interrupt has the highest
priority, the nxge performance suffers -- we see bad packet latencies on the
interconnect when cpu 0 becomes overloaded.
>
> So my question is: How can I restrict nxge (and the PCIe/PCIe bridge) to
chose cpu 0 for their interrupts?? "psradm -i 0" won''t help
since this will also affect the clock interrupt (I want to make sure that no HW
interrupts are running on the cpu that is handling the clock interrupt). So what
I would need is something like telling the driver not to use cpu 0 (or better:
not to use the cpu that the clock interrupt is using).
>
> Is there a solution for this problem in S10U4?
> Will there be a solition in Crossbow?
>   Hi Nick,

I don''t know if there is a way for doing this in S10U4 but with
Crossbow
this will be possible. In Crossbow you will be able to use dladm command 
to specifically re-target the interrupt/s from a NIC to CPU/s of your 
choice.

-krgopi> Thanks a lot,
> Nick.
>  
>  
> This message posted from opensolaris.org
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
>

Nicolas Michael

2007-Nov-05 09:13 UTC

head link

[crossbow-discuss] Interrupt Fanout

Hi all,
> I don''t know if there is a way for doing this in
> S10U4 but with Crossbow 
> this will be possible. In Crossbow you will be able
> to use dladm command 
> to specifically re-target the interrupt/s from a NIC
> to CPU/s of your 
> choice.
What about intrd? It seems, this daemon does what we need -- supervision of the
interrupt load and automatic reassignment of interrupts. However, it seems it is
only part of OpenSolaris and not yet included in S10? Do you have any plans of
integrating this in a future S10 update?

Also, how exactly does intrd work? I.e, which interrupt routine(s) would be
moved to a different cpu in my example?

CPU 0 - Overall: 97.7%
CPU 0 - Level 1: 15.9%
CPU 0 - Level 6: 8.9%
CPU 0 - Level 10: 71.9%

Is the reassignment deterministic in any way? For example, are always
lower-priority interrupts moved to different cpus first (in this example level 1
and 6 interrupts)? For our setup, we would want the clock interrupt to stay on
cpu 0.

And again: We urgently need a solution for this problem by early next year based
on Solaris 10. Does anyone have an idea how to avoid interrupt overload in S10
(SPARC only)?

Any help appreciated!

Thanks,
Nick.
 
 
This message posted from opensolaris.org

Roch - PAE

2007-Nov-07 10:04 UTC

head link

[crossbow-discuss] Interrupt Fanout

Rajagopal Kunhappan writes:
 > 
 > 
 > Nicolas Michael wrote:
 > > Hi all,
 > >
 > > unfortunately it took us quite a while to finish the new setup of our
system, but for a couple of days we were now running tests again. The results
are good -- except for one major problem (see below).
 > >
 > > For the first tests, we''ve used 4 Quad-Port nxge 1 GBit NICs
with each 1 port used for the cluster interconnect and set ddi_msix_alloc_limit
to 8 - the largest allowed value (this is the discussed workaround until
Crossbow will be available). 16 DMA channels on that card allow for a fanout
over 4 cpus per port, so with 4 ports we have a fanout over 16 cpus. This is
working pretty good and solves our current problems with the interrupt load on
our 128-way server. (We were able to reduce the packet load by some further
optimizations.)
 > >
 > > In the tests now we have the target configuration with 2 Dual-Port
nxge 10 GBit NICs instead with each 1 port used, again giving us a fanout over
16 cpus (8 cpus per port). Basically, this configuration works very good as
well, except for one major problem:
 > >
 > > The nxge driver accidentally decides to use cpu 0 as one of the cpus
(among 7 others) for the interrupt handler. Since this cpu is always handling
the clock interrupts, this cpu is now overloaded with interrupt processing.
Although we already put cpu 0 into a processor set (so that nothing but
interrupts are running on that cpu), it now reaches 100% sys load (mpstat). From
the kstat interrupt statistics I''ve calculated the per-level interrupt
time over a time of 5 minutes, which is:
 > > CPU   0 - Overall: 97.7%
 > > CPU   0 - Level  1: 15.9%
 > > CPU   0 - Level  6: 8.9%
 > > CPU   0 - Level 10: 71.9%
 > >
 > > The cpu 0 is already 71.9% busy with the clock interrupts. On top of
that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and some Level
6 interrupts -- this is our nxge NIC. Since the clock interrupt has the highest
priority, the nxge performance suffers -- we see bad packet latencies on the
interconnect when cpu 0 becomes overloaded.
 > >
 > > So my question is: How can I restrict nxge (and the PCIe/PCIe bridge)
to chose cpu 0 for their interrupts?? "psradm -i 0" won''t
help since this will also affect the clock interrupt (I want to make sure that
no HW interrupts are running on the cpu that is handling the clock interrupt).
So what I would need is something like telling the driver not to use cpu 0 (or
better: not to use the cpu that the clock interrupt is using).
 > >
 > > Is there a solution for this problem in S10U4?
 > > Will there be a solition in Crossbow?
 > >   
 > Hi Nick,
 > 
 > I don''t know if there is a way for doing this in S10U4 but with
Crossbow
 > this will be possible. In Crossbow you will be able to use dladm command 
 > to specifically re-target the interrupt/s from a NIC to CPU/s of your 
 > choice.
 > 
 > -krgopi


With Crossbow the problem should go away by itself since the
NIC should be placed into polling mode anyway.


-r


 > > Thanks a lot,
 > > Nick.
 > >  
 > >  
 > > This message posted from opensolaris.org
 > > _______________________________________________
 > > crossbow-discuss mailing list
 > > crossbow-discuss at opensolaris.org
 > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
 > >   
 > _______________________________________________
 > crossbow-discuss mailing list
 > crossbow-discuss at opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

Yunsong Lu

2007-Nov-07 10:53 UTC

head link

[crossbow-discuss] Interrupt Fanout

Roch - PAE wrote:
>  > >
>  > > The cpu 0 is already 71.9% busy with the clock interrupts. On
top of that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and
some Level 6 interrupts -- this is our nxge NIC. Since the clock interrupt has
the highest priority, the nxge performance suffers -- we see bad packet
latencies on the interconnect when cpu 0 becomes overloaded.
>  > >
>  > > So my question is: How can I restrict nxge (and the PCIe/PCIe
bridge) to chose cpu 0 for their interrupts?? "psradm -i 0"
won''t help since this will also affect the clock interrupt (I want to
make sure that no HW interrupts are running on the cpu that is handling the
clock interrupt). So what I would need is something like telling the driver not
to use cpu 0 (or better: not to use the cpu that the clock interrupt is using).
>  > >
>  > > Is there a solution for this problem in S10U4?
>  > > Will there be a solition in Crossbow?
>  > >   
>  > Hi Nick,
>  > 
>  > I don''t know if there is a way for doing this in S10U4 but
with Crossbow
>  > this will be possible. In Crossbow you will be able to use dladm
command
>  > to specifically re-target the interrupt/s from a NIC to CPU/s of your
>  > choice.
>  > 
>  > -krgopi
> 
> 
> With Crossbow the problem should go away by itself since the
> NIC should be placed into polling mode anyway.Not really.
Even when polling mode is implemented, interrupts are still necessary. 
For rx traffic, the interrupt will trigger upper layer to start polling 
by disabling this interrupt, and this interrupt will be re-enabled when 
there is no completed inbound packet available in this rx ring. So, 
re-targeting interrupt is still important. Further, the polling thread 
will run on the same CPU that is interrupt by rx traffic interrupt.

Thanks,

Roamer> 
> 
> -r
> 
> 
>  > > Thanks a lot,
>  > > Nick.
>  > >  
>  > >  
>  > > This message posted from opensolaris.org
>  > > _______________________________________________
>  > > crossbow-discuss mailing list
>  > > crossbow-discuss at opensolaris.org
>  > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
>  > >   
>  > _______________________________________________
>  > crossbow-discuss mailing list
>  > crossbow-discuss at opensolaris.org
>  > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
> 
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

Roch - PAE

2007-Nov-07 13:18 UTC

head link

[crossbow-discuss] Interrupt Fanout

Yunsong Lu writes:
 > 
 > 
 > Roch - PAE wrote:
 > 
 > >  > >
 > >  > > The cpu 0 is already 71.9% busy with the clock interrupts.
On top of that come some Level 1 interrupts (PCIe to PCI bridge driver (?)) and
some Level 6 interrupts -- this is our nxge NIC. Since the clock interrupt has
the highest priority, the nxge performance suffers -- we see bad packet
latencies on the interconnect when cpu 0 becomes overloaded.
 > >  > >
 > >  > > So my question is: How can I restrict nxge (and the
PCIe/PCIe bridge) to chose cpu 0 for their interrupts?? "psradm -i 0"
won''t help since this will also affect the clock interrupt (I want to
make sure that no HW interrupts are running on the cpu that is handling the
clock interrupt). So what I would need is something like telling the driver not
to use cpu 0 (or better: not to use the cpu that the clock interrupt is using).
 > >  > >
 > >  > > Is there a solution for this problem in S10U4?
 > >  > > Will there be a solition in Crossbow?
 > >  > >   
 > >  > Hi Nick,
 > >  > 
 > >  > I don''t know if there is a way for doing this in S10U4
but with Crossbow
 > >  > this will be possible. In Crossbow you will be able to use
dladm command
 > >  > to specifically re-target the interrupt/s from a NIC to CPU/s
of your
 > >  > choice.
 > >  > 
 > >  > -krgopi
 > > 
 > > 
 > > With Crossbow the problem should go away by itself since the
 > > NIC should be placed into polling mode anyway.
 > Not really.
 > Even when polling mode is implemented, interrupts are still necessary. 
 > For rx traffic, the interrupt will trigger upper layer to start polling 
 > by disabling this interrupt, and this interrupt will be re-enabled when 
 > there is no completed inbound packet available in this rx ring. So, 
 > re-targeting interrupt is still important. 

It seems the interrupts load will be so much smaller as to
not matter. The int level is higher than clock so it will run in time
but it will be so short as to not perturb clock. Do we have
an estimate on how much interrupt CPU is necessary when a
NIC is being polled ?

 > Further, the polling thread 
 > will run on the same CPU that is interrupt by rx traffic interrupt.
 > 

This query seem to hit a use case where binding kernel
network thread is undesirable. I would rather let network
thread binding only for benchmark purposes and have
threads unbound by default. 

-r

 > Thanks,
 > 
 > Roamer
 > > 
 > > 
 > > -r
 > > 
 > > 
 > >  > > Thanks a lot,
 > >  > > Nick.
 > >  > >  
 > >  > >  
 > >  > > This message posted from opensolaris.org
 > >  > > _______________________________________________
 > >  > > crossbow-discuss mailing list
 > >  > > crossbow-discuss at opensolaris.org
 > >  > >
http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
 > >  > >   
 > >  > _______________________________________________
 > >  > crossbow-discuss mailing list
 > >  > crossbow-discuss at opensolaris.org
 > >  > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
 > > 
 > > _______________________________________________
 > > crossbow-discuss mailing list
 > > crossbow-discuss at opensolaris.org
 > > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

Yunsong Lu

2007-Nov-07 21:53 UTC

head link

[crossbow-discuss] Interrupt Fanout

Roch - PAE wrote:
>  > > With Crossbow the problem should go away by itself since the
>  > > NIC should be placed into polling mode anyway.
>  > Not really.
>  > Even when polling mode is implemented, interrupts are still
necessary.
>  > For rx traffic, the interrupt will trigger upper layer to start
polling
>  > by disabling this interrupt, and this interrupt will be re-enabled
when
>  > there is no completed inbound packet available in this rx ring. So, 
>  > re-targeting interrupt is still important. 
> 
> It seems the interrupts load will be so much smaller as to
> not matter. The int level is higher than clock so it will run in time
> but it will be so short as to not perturb clock. Do we have
> an estimate on how much interrupt CPU is necessary when a
> NIC is being polled ?When a NIC is being polled, no rx traffic interrupt is needed at this 
moment. Generally, rx interrupt will happen rarely when traffic is heavy.

Thanks,

Roamer

Nicolas Michael

2007-Nov-08 16:48 UTC

head link

[crossbow-discuss] Interrupt Fanout

Hello,

I have an update regarding our interrupt collision problem between network and
clock interrups: The high load of the clock interrupt has been a "home
made" problem. Someone added
set hires_tick = 1
set hires_hz = 1000
to my /etc/system. It wasn''t me, it wasn''t our installation
(which adds quite a few lines). I have no idea yet where these lines come from,
but they were responsible for the high load of the clock interrupt.

With these settings removed again, we still have clock and nxge on cpu 0, but in
sum the interrupt load is only 35-40% on cpu 0, so we''re ok no.

Nevertheless, of course the possibility of cpu overload due to too many
interrupt routines remains. But if you have solutions for this with dladm and/or
intrd in Crossbow and later on in S10U6, this will be enough for us.

With our currently largest system we''re fine. We expect to have even
larger systems (>=256 ways) one year from now, so interrupt load may again
become a problem, but I hope until then S10U6 with the Crossbox optimizations is
available.

Thanks a lot,
Nick.
 
 
This message posted from opensolaris.org

Nicolas Michael

2008-Dec-03 09:42 UTC

head link

[crossbow-discuss] Interrupt Fanout

Hello,

this thread is more than a year old, but probably you remember the discussion.
We''ve had problems with the cpu load for interrupt processing for our
high-loaded NICs. You were mentioning that as part of Crossbow you were
implementing a polling mechanism that would kick in at high loads and deliver
packets through polling rather than by generating interrupts. Has this been
implemented in Crossbow already? If yes, has it been backported to Solaris 10
yet?

Thanks a lot,
Nick.
-- 
This message posted from opensolaris.org

Markus Flierl

2008-Dec-03 09:47 UTC

head link

[crossbow-discuss] Interrupt Fanout

Nicolas Michael wrote:> Hello,
>
> this thread is more than a year old, but probably you remember the
discussion. We''ve had problems with the cpu load for interrupt
processing for our high-loaded NICs. You were mentioning that as part of
Crossbow you were implementing a polling mechanism that would kick in at high
loads and deliver packets through polling rather than by generating interrupts.
Has this been implemented in Crossbow already?Yes, it''s in the current Crossbow bits that are going into OpenSolaris 
soon. You can already download the bits from OpenSolaris.org even
now.> If yes, has it been backported to Solaris 10 yet?
>   No.
Markus> Thanks a lot,
> Nick.
>

crossbow discuss - Oct 2007 - Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout

[crossbow-discuss] Interrupt Fanout