thr3ads.net - crossbow discuss - [crossbow-discuss] transmit side hashing for scaling? [May 2009]

If this information is useful, please help other people find it:
Share via:

Andrew Gallatin

2009-May-13 13:12 UTC

[crossbow-discuss] transmit side hashing for scaling?

Hi,

Can somebody shed some light on how crossbow hashes outgoing
packets to different transmit rings (not ring groups)?

My 10GbE driver has multiple rings (and a single group).  Each
transmit ring shares an interrupt with a corresponding receive
ring.  We call a set of 1 TX ring, 1 RX ring, and interrupt handler
state a "slice".  Transmit completions are handled from the interrupt
handler.   On OSes which support multiple transmit routes,
we''ve found that ensuring that a particular connection is always
hashed to the same slice by the host and the NIC helps quite a bit
with performance (improves CPU locality, reduces cache misses, decreases
power consumption).

Some OSes (like FreeBSD) allow a driver to assist in tagging a
connection so as to ensure that it is easy to hash
traffic for the same connection into the same slice in the host
and the NIC.  Others (Linux, S10) allow the driver to hash the
outgoing packets to provide this locality.

So.. Where is the transmit hashing done in crossbow?  Is it tunable?
Is there a hook where I can do provide a hash routine (like Linux)?
Can I tag packets (like FreeBSD)?  Is it at least something standard
like Toeplitz?

Drew

Nitin Hande

2009-May-13 17:09 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:> Hi,
>
> Can somebody shed some light on how crossbow hashes outgoing
> packets to different transmit rings (not ring groups)?
>
> My 10GbE driver has multiple rings (and a single group).  Each
> transmit ring shares an interrupt with a corresponding receive
> ring.  We call a set of 1 TX ring, 1 RX ring, and interrupt handler
> state a "slice".  Transmit completions are handled from the
interrupt
> handler.   On OSes which support multiple transmit routes,
> we''ve found that ensuring that a particular connection is always
> hashed to the same slice by the host and the NIC helps quite a bit
> with performance (improves CPU locality, reduces cache misses, decreases
> power consumption).
>
> Some OSes (like FreeBSD) allow a driver to assist in tagging a
> connection so as to ensure that it is easy to hash
> traffic for the same connection into the same slice in the host
> and the NIC.  Others (Linux, S10) allow the driver to hash the
> outgoing packets to provide this locality.
>
> So.. Where is the transmit hashing done in crossbow?  Is it tunable?
> Is there a hook where I can do provide a hash routine (like Linux)?
> Can I tag packets (like FreeBSD)?  Is it at least something standard
> like Toeplitz?
If your driver has advertised multiple tx rings, then look for 
mac_tx_fanout_mode() which in turn computes the hash on fanout hint 
passed from ip. Providing hooks for additional hash routines has been 
suggested.

Nitin
>
> Drew
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

Andrew Gallatin

2009-May-13 17:27 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Nitin Hande wrote:> Andrew Gallatin wrote:
>> Hi,
>>
>> Can somebody shed some light on how crossbow hashes outgoing
>> packets to different transmit rings (not ring groups)?
>>
>> My 10GbE driver has multiple rings (and a single group).  Each
>> transmit ring shares an interrupt with a corresponding receive
>> ring.  We call a set of 1 TX ring, 1 RX ring, and interrupt handler
>> state a "slice".  Transmit completions are handled from the
interrupt
>> handler.   On OSes which support multiple transmit routes,
>> we''ve found that ensuring that a particular connection is
always
>> hashed to the same slice by the host and the NIC helps quite a bit
>> with performance (improves CPU locality, reduces cache misses,
decreases
>> power consumption).
>>
>> Some OSes (like FreeBSD) allow a driver to assist in tagging a
>> connection so as to ensure that it is easy to hash
>> traffic for the same connection into the same slice in the host
>> and the NIC.  Others (Linux, S10) allow the driver to hash the
>> outgoing packets to provide this locality.
>>
>> So.. Where is the transmit hashing done in crossbow?  Is it tunable?
>> Is there a hook where I can do provide a hash routine (like Linux)?
>> Can I tag packets (like FreeBSD)?  Is it at least something standard
>> like Toeplitz?
> 
> If your driver has advertised multiple tx rings, then look for 
> mac_tx_fanout_mode() which in turn computes the hash on fanout hint 
> passed from ip. Providing hooks for additional hash routines has been 
> suggested.
I guess my best bet might be to lie, and say I have only one TX
ring, then fanout things myself, like I used to before Crossbow.
Is there any non-obvious disadvantage to that?

When looking at this, I noticed mac_tx_serializer_mode().  Am I reading
this right, in that is serializes a single queue?  That seems lacking,
compared to the nxge_serialize stuff it replaces.

Drew

rajagopal kunhappan

2009-May-13 18:07 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:> Nitin Hande wrote:
>> Andrew Gallatin wrote:
>>> Hi,
>>>
>>> Can somebody shed some light on how crossbow hashes outgoing
>>> packets to different transmit rings (not ring groups)?
>>>
>>> My 10GbE driver has multiple rings (and a single group).  Each
>>> transmit ring shares an interrupt with a corresponding receive
>>> ring.  We call a set of 1 TX ring, 1 RX ring, and interrupt handler
>>> state a "slice".  Transmit completions are handled from
the interrupt
>>> handler.   On OSes which support multiple transmit routes,
>>> we''ve found that ensuring that a particular connection is
always
>>> hashed to the same slice by the host and the NIC helps quite a bit
>>> with performance (improves CPU locality, reduces cache misses,
decreases
>>> power consumption).
>>>
>>> Some OSes (like FreeBSD) allow a driver to assist in tagging a
>>> connection so as to ensure that it is easy to hash
>>> traffic for the same connection into the same slice in the host
>>> and the NIC.  Others (Linux, S10) allow the driver to hash the
>>> outgoing packets to provide this locality.
>>>
>>> So.. Where is the transmit hashing done in crossbow?  Is it
tunable?
>>> Is there a hook where I can do provide a hash routine (like Linux)?
>>> Can I tag packets (like FreeBSD)?  Is it at least something
standard
>>> like Toeplitz?
>>
>> If your driver has advertised multiple tx rings, then look for 
>> mac_tx_fanout_mode() which in turn computes the hash on fanout hint 
>> passed from ip. Providing hooks for additional hash routines has been 
>> suggested.
> 
> I guess my best bet might be to lie, and say I have only one TX
> ring, then fanout things myself, like I used to before Crossbow.
> Is there any non-obvious disadvantage to that?
I don''t know if this is non-obvious but exposing Tx rings to mac layer 
would help in:
1) virtualization like VNICs getting their own Tx rings.
2) Flow control when multiple Tx rings is present (presently done for 
UDP). If a Tx ring is blocked (out of desc), then only conn_t''s sending
on that Tx rings gets blocked.

We want all the NICs to expose their Tx rings to the MAC layer.
> When looking at this, I noticed mac_tx_serializer_mode().  Am I reading
> this right, in that is serializes a single queue?  That seems lacking,
> compared to the nxge_serialize stuff it replaces.
It is a generic solution for NICs that do not have good locking on the 
Tx side. mac_tx_serializer_mode() is used when you have a single Tx 
ring. nxge would not use that mode. It exposes multiple Tx rings. When 
multiple Tx rings are present, mac_tx_fanout_mode() is used. 
mac_tx_fanout_mode() can operate in serialized mode also in which case 
there would be a serializer (soft ring) for each Tx ring. Nxge uses that 
mode.

-krgopi

> 
> Drew
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

--

Nitin Hande

2009-May-13 18:33 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:> Nitin Hande wrote:
>> Andrew Gallatin wrote:
>>> Hi,
>>>
>>> Can somebody shed some light on how crossbow hashes outgoing
>>> packets to different transmit rings (not ring groups)?
>>>
>>> My 10GbE driver has multiple rings (and a single group).  Each
>>> transmit ring shares an interrupt with a corresponding receive
>>> ring.  We call a set of 1 TX ring, 1 RX ring, and interrupt handler
>>> state a "slice".  Transmit completions are handled from
the interrupt
>>> handler.   On OSes which support multiple transmit routes,
>>> we''ve found that ensuring that a particular connection is
always
>>> hashed to the same slice by the host and the NIC helps quite a bit
>>> with performance (improves CPU locality, reduces cache misses, 
>>> decreases
>>> power consumption).
>>>
>>> Some OSes (like FreeBSD) allow a driver to assist in tagging a
>>> connection so as to ensure that it is easy to hash
>>> traffic for the same connection into the same slice in the host
>>> and the NIC.  Others (Linux, S10) allow the driver to hash the
>>> outgoing packets to provide this locality.
>>>
>>> So.. Where is the transmit hashing done in crossbow?  Is it
tunable?
>>> Is there a hook where I can do provide a hash routine (like Linux)?
>>> Can I tag packets (like FreeBSD)?  Is it at least something
standard
>>> like Toeplitz?
>>
>> If your driver has advertised multiple tx rings, then look for 
>> mac_tx_fanout_mode() which in turn computes the hash on fanout hint 
>> passed from ip. Providing hooks for additional hash routines has been 
>> suggested.
>
> I guess my best bet might be to lie, and say I have only one TX
> ring, then fanout things myself, like I used to before Crossbow.
> Is there any non-obvious disadvantage to that?
If you advertise a single ring, then the tx path will end up in 
mac_tx_single_ring_mode() , they way it does for an e1000g driver. I 
think in that case the entry point in the driver is through older 
xxx_m_tx(), you may have to pay attention to that in your driver. There 
could be slight variance in both the schemes. In case of 
single_ring_mode(), if you get backpressured from the driver on the tx 
side due to lack of descriptors, then packets will be enqueued at the tx 
srs. At that point, if there are multiple  threads trying to send 
additional packets, all the packets will end up getting queued, whereas 
there will be only one worker thread trying to clear up the queue 
build-up. At high packet rates its difficult for this one thread to 
catch up (Additionally also look at MAC_DROP_ON_NO_DESC flag in 
.mac_tx_srs_no_desc() which can drop the packets rather than queuing). 
Versus in mac_tx_fanout_mode() each tx ring gets its own softring in 
case of backpressure and its own worker thread.>
> When looking at this, I noticed mac_tx_serializer_mode().  Am I reading
> this right, in that is serializes a single queue?  That seems lacking,
> compared to the nxge_serialize stuff it replaces.
Yes. This part was done for nxge and as far as I remember recent 
performance of this scheme was very close to  that of the previous 
scheme. I think Gopi can comment more on this. What part do  you  think 
is missing here ?


Nitin
>
> Drew

Andrew Gallatin

2009-May-14 16:51 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Nitin Hande wrote:> Andrew Gallatin wrote:
>> When looking at this, I noticed mac_tx_serializer_mode().  Am I reading
>> this right, in that is serializes a single queue?  That seems lacking,
>> compared to the nxge_serialize stuff it replaces.
> 
> Yes. This part was done for nxge and as far as I remember recent 
> performance of this scheme was very close to  that of the previous 
> scheme. I think Gopi can comment more on this. What part do  you  think 
> is missing here ?
Perhaps I''m missing something..  Doesn''t nxge support multiple
TX rings?
If so, does the existing serialization serialize all traffic to a
single ring, or is mac_tx_serializer_mode() applied after 
mac_tx_fanout_mode()?

I had thought the original nxge serializer serialized each TX ring
separately in nxge.  The fork I made of it for myri10ge certainly
works that way.

Drew

rajagopal kunhappan

2009-May-14 17:20 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:> Nitin Hande wrote:
>> Andrew Gallatin wrote:
> 
>>> When looking at this, I noticed mac_tx_serializer_mode().  Am I
reading
>>> this right, in that is serializes a single queue?  That seems
lacking,
>>> compared to the nxge_serialize stuff it replaces.
>>
>> Yes. This part was done for nxge and as far as I remember recent 
>> performance of this scheme was very close to  that of the previous 
>> scheme. I think Gopi can comment more on this. What part do  you  
>> think is missing here ?
> 
> Perhaps I''m missing something..  Doesn''t nxge support
multiple TX rings?
yes.
> If so, does the existing serialization serialize all traffic to a
> single ring, or is mac_tx_serializer_mode() applied after 
> mac_tx_fanout_mode()?
Neither.

Cutting and pasting from my previous reply:
...
mac_tx_serializer_mode() is used when you have a single Tx ring. nxge 
would not use that mode. It exposes multiple Tx rings. When multiple Tx 
rings are present, mac_tx_fanout_mode() is used. mac_tx_fanout_mode() 
can operate in serialized mode also in which case there would be a 
serializer (soft ring) for each Tx ring. Nxge uses that mode.

...

-krgopi

> 
> I had thought the original nxge serializer serialized each TX ring
> separately in nxge.  The fork I made of it for myri10ge certainly
> works that way.
> 
> Drew
> 
> 
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

--

Nicolas Droux

2009-May-14 17:59 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote:
> Nitin Hande wrote:
>> Andrew Gallatin wrote:
>
>>> When looking at this, I noticed mac_tx_serializer_mode().  Am I  
>>> reading
>>> this right, in that is serializes a single queue?  That seems  
>>> lacking,
>>> compared to the nxge_serialize stuff it replaces.
>> Yes. This part was done for nxge and as far as I remember recent  
>> performance of this scheme was very close to  that of the previous  
>> scheme. I think Gopi can comment more on this. What part do  you   
>> think is missing here ?
>
> Perhaps I''m missing something..  Doesn''t nxge support
multiple TX
> rings?
> If so, does the existing serialization serialize all traffic to a
> single ring, or is mac_tx_serializer_mode() applied after  
> mac_tx_fanout_mode()?
>
> I had thought the original nxge serializer serialized each TX ring
> separately in nxge.  The fork I made of it for myri10ge certainly
> works that way.
The serializer is only for use by the nxge driver which has an  
inefficient TX path locking implementation. We didn''t have the  
resources to completely rewrite the nxge transmit path as part of the  
Crossbow project so we moved the serialization implementation in MAC  
for that driver. The serializer in MAC does serialization on a per- 
ring basis. The serializer should not be used by any other driver.

You don''t have to use the serializer to support multiple TX rings.  
Keep your TX path lean and mean, apply good design principles, e.g.  
avoid holding locks for too long on your data-path, and you should be  
fine.

Nicolas.
>
>
>
> Drew
>
>
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
-- 
Nicolas Droux - Solaris Kernel Networking - Sun Microsystems, Inc.
nicolas.droux at sun.com - http://blogs.sun.com/droux

Andrew Gallatin

2009-May-14 18:34 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Nicolas Droux wrote:> 
> On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote:
> 
>> Nitin Hande wrote:
>>> Andrew Gallatin wrote:
>>
>>>> When looking at this, I noticed mac_tx_serializer_mode().  Am I
reading
>>>> this right, in that is serializes a single queue?  That seems
lacking,
>>>> compared to the nxge_serialize stuff it replaces.
>>> Yes. This part was done for nxge and as far as I remember recent 
>>> performance of this scheme was very close to  that of the previous 
>>> scheme. I think Gopi can comment more on this. What part do  you  
>>> think is missing here ?
>>
>> Perhaps I''m missing something..  Doesn''t nxge support
multiple TX rings?
>> If so, does the existing serialization serialize all traffic to a
>> single ring, or is mac_tx_serializer_mode() applied after 
>> mac_tx_fanout_mode()?
>>
>> I had thought the original nxge serializer serialized each TX ring
>> separately in nxge.  The fork I made of it for myri10ge certainly
>> works that way.
> 
> The serializer is only for use by the nxge driver which has an 
> inefficient TX path locking implementation. We didn''t have the
resources
> to completely rewrite the nxge transmit path as part of the Crossbow 
> project so we moved the serialization implementation in MAC for that 
> driver. The serializer in MAC does serialization on a per-ring basis. 
> The serializer should not be used by any other driver.
krgopi said, in an earlier reply "mac_tx_serializer_mode() is used when 
you have a single Tx ring. nxge would not use that mode".  So I''m
confused.  From the source, it looks like nxge is using that mode
(MAC_VIRT_SERIALIZE |''ed into mi_v12n_level).

So I guess it is restricted to using only one of its hw tx rings, then?
> You don''t have to use the serializer to support multiple TX rings.
Keep
> your TX path lean and mean, apply good design principles, e.g. avoid 
> holding locks for too long on your data-path, and you should be fine.
FWIW. my tx path is very "lean and mean".  The only time
locks are held are when writing the tx descriptors to the NIC,
and when allocating a dma handle from a pre-allocated per-ring pool.

I thought the serializer was silly too, but PAE claimed a speedup
from it.  I think that  PAE claimed the speedup came from
never back-pressuring the stack when the host overran the
NIC. One of the "features" of the serializer was to always
block the calling thread if the tx queue was exhausted.

Have you done any packets-per-second benchmarks with your
fanout code?  I''m concerned that its very cache unfriendly
if you have a long run of packets all going to the same
destination. This is because you walk the mblk chain, reading
the packet headers and queue up a big chain.  If the chain
gets too long, the mblk and/or the packet headers will be
pushed out of cache by the time they make it to the driver''s
xmit routine.  So in this case you could have twice as many
cache misses as normal when things get really backed up.

Last, you (or somebody) mentioned there was interest in adding
a hook for a driver to do fanout.  Is there a bugid or something
for this?

Drew

rajagopal kunhappan

2009-May-14 18:54 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:> Nicolas Droux wrote:
>>
>> On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote:
>>
>>> Nitin Hande wrote:
>>>> Andrew Gallatin wrote:
>>>
>>>>> When looking at this, I noticed mac_tx_serializer_mode(). 
Am I
>>>>> reading
>>>>> this right, in that is serializes a single queue?  That
seems lacking,
>>>>> compared to the nxge_serialize stuff it replaces.
>>>> Yes. This part was done for nxge and as far as I remember
recent
>>>> performance of this scheme was very close to  that of the
previous
>>>> scheme. I think Gopi can comment more on this. What part do 
you
>>>> think is missing here ?
>>>
>>> Perhaps I''m missing something..  Doesn''t nxge
support multiple TX rings?
>>> If so, does the existing serialization serialize all traffic to a
>>> single ring, or is mac_tx_serializer_mode() applied after 
>>> mac_tx_fanout_mode()?
>>>
>>> I had thought the original nxge serializer serialized each TX ring
>>> separately in nxge.  The fork I made of it for myri10ge certainly
>>> works that way.
>>
>> The serializer is only for use by the nxge driver which has an 
>> inefficient TX path locking implementation. We didn''t have the
>> resources to completely rewrite the nxge transmit path as part of the 
>> Crossbow project so we moved the serialization implementation in MAC 
>> for that driver. The serializer in MAC does serialization on a 
>> per-ring basis. The serializer should not be used by any other driver.
> 
> krgopi said, in an earlier reply "mac_tx_serializer_mode() is used
when
> you have a single Tx ring. nxge would not use that mode".  So
I''m
> confused.  From the source, it looks like nxge is using that mode
> (MAC_VIRT_SERIALIZE |''ed into mi_v12n_level).
Hi Andrew,

I think the confusion is in the name. mac_tx_serializer_mode() is used 
when you have single ring. nxge exposes multiple rings. When multiple Tx 
rings are present, mac_tx_fanout_mode() get called. In this mode, each 
Tx ring will have a soft ring associated with it. The soft rings 
themselves are stored in the Tx SRS. The packets coming into 
mac_tx_fanout_mode() will get fanned out to one of the Tx soft rings and 
mac_tx_soft_ring_process() gets called. mac_tx_soft_ring_process() can 
either queue up the packets or send directly to the NIC. In the case of 
nxge, packets get queued up in the soft ring and the soft ring worker 
thread sends them to the NIC.

Hope this clarifies.
>> You don''t have to use the serializer to support multiple TX
rings.
>> Keep your TX path lean and mean, apply good design principles, e.g. 
>> avoid holding locks for too long on your data-path, and you should be 
>> fine.
> 
> FWIW. my tx path is very "lean and mean".  The only time
> locks are held are when writing the tx descriptors to the NIC,
> and when allocating a dma handle from a pre-allocated per-ring pool.
> 
> I thought the serializer was silly too, but PAE claimed a speedup
> from it.  I think that  PAE claimed the speedup came from
> never back-pressuring the stack when the host overran the
> NIC. One of the "features" of the serializer was to always
> block the calling thread if the tx queue was exhausted.
> 
> Have you done any packets-per-second benchmarks with your
> fanout code?  I''m concerned that its very cache unfriendly
With nxge we get line rate with MTU sized packets with 8 Tx rings. The 
numbers are similar to what is was with nxge serializer in place.
> if you have a long run of packets all going to the same
> destination. This is because you walk the mblk chain, reading
> the packet headers and queue up a big chain.  If the chain
> gets too long, the mblk and/or the packet headers will be
> pushed out of cache by the time they make it to the driver''s
> xmit routine.  So in this case you could have twice as many
> cache misses as normal when things get really backed up.
We would like to have the drivers operate in non-serialized mode. But if 
for whatever reason, you want to use serialized mode, and there are 
issues, we can look into that,

Thanks,
-krgopi
> 
> Last, you (or somebody) mentioned there was interest in adding
> a hook for a driver to do fanout.  Is there a bugid or something
> for this?
> 
> Drew
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

--

Andrew Gallatin

2009-May-14 19:06 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

rajagopal kunhappan wrote:> Andrew Gallatin wrote:
>> Nicolas Droux wrote:
>>>
>>> On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote:
>>>
>>>> Nitin Hande wrote:
>>>>> Andrew Gallatin wrote:
>>>>
>>>>>> When looking at this, I noticed
mac_tx_serializer_mode().  Am I
>>>>>> reading
>>>>>> this right, in that is serializes a single queue?  That
seems
>>>>>> lacking,
>>>>>> compared to the nxge_serialize stuff it replaces.
>>>>> Yes. This part was done for nxge and as far as I remember
recent
>>>>> performance of this scheme was very close to  that of the
previous
>>>>> scheme. I think Gopi can comment more on this. What part do
you
>>>>> think is missing here ?
>>>>
>>>> Perhaps I''m missing something..  Doesn''t nxge
support multiple TX
>>>> rings?
>>>> If so, does the existing serialization serialize all traffic to
a
>>>> single ring, or is mac_tx_serializer_mode() applied after 
>>>> mac_tx_fanout_mode()?
>>>>
>>>> I had thought the original nxge serializer serialized each TX
ring
>>>> separately in nxge.  The fork I made of it for myri10ge
certainly
>>>> works that way.
>>>
>>> The serializer is only for use by the nxge driver which has an 
>>> inefficient TX path locking implementation. We didn''t have
the
>>> resources to completely rewrite the nxge transmit path as part of
the
>>> Crossbow project so we moved the serialization implementation in
MAC
>>> for that driver. The serializer in MAC does serialization on a 
>>> per-ring basis. The serializer should not be used by any other
driver.
>>
>> krgopi said, in an earlier reply "mac_tx_serializer_mode() is used
>> when you have a single Tx ring. nxge would not use that mode".  So
I''m
>> confused.  From the source, it looks like nxge is using that mode
>> (MAC_VIRT_SERIALIZE |''ed into mi_v12n_level).
> 
> Hi Andrew,
> 
> I think the confusion is in the name. mac_tx_serializer_mode() is used 
> when you have single ring. nxge exposes multiple rings. When multiple Tx 
> rings are present, mac_tx_fanout_mode() get called. In this mode, each 
> Tx ring will have a soft ring associated with it. The soft rings 
> themselves are stored in the Tx SRS. The packets coming into 
> mac_tx_fanout_mode() will get fanned out to one of the Tx soft rings and 
> mac_tx_soft_ring_process() gets called. mac_tx_soft_ring_process() can 
> either queue up the packets or send directly to the NIC. In the case of 
> nxge, packets get queued up in the soft ring and the soft ring worker 
> thread sends them to the NIC.
> 
> Hope this clarifies.
> 
>>> You don''t have to use the serializer to support multiple
TX rings.
>>> Keep your TX path lean and mean, apply good design principles, e.g.
>>> avoid holding locks for too long on your data-path, and you should
be
>>> fine.
>>
>> FWIW. my tx path is very "lean and mean".  The only time
>> locks are held are when writing the tx descriptors to the NIC,
>> and when allocating a dma handle from a pre-allocated per-ring pool.
>>
>> I thought the serializer was silly too, but PAE claimed a speedup
>> from it.  I think that  PAE claimed the speedup came from
>> never back-pressuring the stack when the host overran the
>> NIC. One of the "features" of the serializer was to always
>> block the calling thread if the tx queue was exhausted.
>>
>> Have you done any packets-per-second benchmarks with your
>> fanout code?  I''m concerned that its very cache unfriendly
> 
> With nxge we get line rate with MTU sized packets with 8 Tx rings. The 
> numbers are similar to what is was with nxge serializer in place.
> 
>> if you have a long run of packets all going to the same
>> destination. This is because you walk the mblk chain, reading
>> the packet headers and queue up a big chain.  If the chain
>> gets too long, the mblk and/or the packet headers will be
>> pushed out of cache by the time they make it to the driver''s
>> xmit routine.  So in this case you could have twice as many
>> cache misses as normal when things get really backed up.
> 
> We would like to have the drivers operate in non-serialized mode. But if 
> for whatever reason, you want to use serialized mode, and there are 
> issues, we can look into that,
The only reason I care about the serializer is the pre-crossbow
feedback from PAE that the original serializer avoided
putting backpressure on the stack when the TX rings fill up.
I''m happy using the normal fanout (with some caveats below) as
long as PAE doesn''t complain about it later.

The caveats being that I want an fanout mode that uses a
standard Toeplitz hash so as to maintain CPU locality.
Or a hook so I can implement my own tx side hashing.

Drew

rajagopal kunhappan

2009-May-14 21:38 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:>>>
>>> FWIW. my tx path is very "lean and mean".  The only time
>>> locks are held are when writing the tx descriptors to the NIC,
>>> and when allocating a dma handle from a pre-allocated per-ring
pool.
>>>
>>> I thought the serializer was silly too, but PAE claimed a speedup
>>> from it.  I think that  PAE claimed the speedup came from
>>> never back-pressuring the stack when the host overran the
>>> NIC. One of the "features" of the serializer was to
always
>>> block the calling thread if the tx queue was exhausted.
>>>
>>> Have you done any packets-per-second benchmarks with your
>>> fanout code?  I''m concerned that its very cache unfriendly
>>
>> With nxge we get line rate with MTU sized packets with 8 Tx rings. The 
>> numbers are similar to what is was with nxge serializer in place.
>>
>>> if you have a long run of packets all going to the same
>>> destination. This is because you walk the mblk chain, reading
>>> the packet headers and queue up a big chain.  If the chain
>>> gets too long, the mblk and/or the packet headers will be
>>> pushed out of cache by the time they make it to the
driver''s
>>> xmit routine.  So in this case you could have twice as many
>>> cache misses as normal when things get really backed up.
>>
>> We would like to have the drivers operate in non-serialized mode. But 
>> if for whatever reason, you want to use serialized mode, and there are 
>> issues, we can look into that,
> 
> The only reason I care about the serializer is the pre-crossbow
> feedback from PAE that the original serializer avoided
> putting backpressure on the stack when the TX rings fill up.
Yes, pre-crossbow putting back pressure would mean queue''ing up packets
in DLD. Thus all packets get queued in DLD until the driver relieved the 
flow control. By then thousands of packets would be sitting in DLD (This 
is because TCP does not check for STREAM QFULL condition on the DLD 
write queue and keeps on sending packets). After flow control is 
relieved, the queued up packets are drained by dld_wsrv() (in single 
threaded mode). Single thread is no good on a 10gig link and thus caused 
performance issues.
> I''m happy using the normal fanout (with some caveats below) as
> long as PAE doesn''t complain about it later.
> 
> The caveats being that I want an fanout mode that uses a
> standard Toeplitz hash so as to maintain CPU locality.
I am curious as to how you maintain CPU locality for Tx traffic. Can you 
give some details?

On Solaris stack, if you have a bunch of say TCP connections sending 
traffic, they can come from any CPU on the system. By this I mean what 
CPU an application runs on is completely random unless you do CPU binding.

I can see tying Rx traffic to a specific Rx ring and CPU. If it is a 
forwarding case, then one can tie an Rx ring to a Tx ring.

Thanks,
-krgopi
> Or a hook so I can implement my own tx side hashing.
> 
> Drew

--

Andrew Gallatin

2009-May-15 13:32 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

rajagopal kunhappan wrote:> Andrew Gallatin wrote:
>>>>
>>>> FWIW. my tx path is very "lean and mean".  The only
time
>>>> locks are held are when writing the tx descriptors to the NIC,
>>>> and when allocating a dma handle from a pre-allocated per-ring
pool.
>>>>
>>>> I thought the serializer was silly too, but PAE claimed a
speedup
>>>> from it.  I think that  PAE claimed the speedup came from
>>>> never back-pressuring the stack when the host overran the
>>>> NIC. One of the "features" of the serializer was to
always
>>>> block the calling thread if the tx queue was exhausted.
>>>>
>>>> Have you done any packets-per-second benchmarks with your
>>>> fanout code?  I''m concerned that its very cache
unfriendly
>>>
>>> With nxge we get line rate with MTU sized packets with 8 Tx rings. 
>>> The numbers are similar to what is was with nxge serializer in
place.
>>>
>>>> if you have a long run of packets all going to the same
>>>> destination. This is because you walk the mblk chain, reading
>>>> the packet headers and queue up a big chain.  If the chain
>>>> gets too long, the mblk and/or the packet headers will be
>>>> pushed out of cache by the time they make it to the
driver''s
>>>> xmit routine.  So in this case you could have twice as many
>>>> cache misses as normal when things get really backed up.
>>>
>>> We would like to have the drivers operate in non-serialized mode.
But
>>> if for whatever reason, you want to use serialized mode, and there 
>>> are issues, we can look into that,
>>
>> The only reason I care about the serializer is the pre-crossbow
>> feedback from PAE that the original serializer avoided
>> putting backpressure on the stack when the TX rings fill up.
> 
> Yes, pre-crossbow putting back pressure would mean queue''ing up
packets
> in DLD. Thus all packets get queued in DLD until the driver relieved the 
> flow control. By then thousands of packets would be sitting in DLD (This 
> is because TCP does not check for STREAM QFULL condition on the DLD 
> write queue and keeps on sending packets). After flow control is 
> relieved, the queued up packets are drained by dld_wsrv() (in single 
> threaded mode). Single thread is no good on a 10gig link and thus caused 
> performance issues.
And crossbow addresses this?
>> I''m happy using the normal fanout (with some caveats below) as
>> long as PAE doesn''t complain about it later.
>>
>> The caveats being that I want an fanout mode that uses a
>> standard Toeplitz hash so as to maintain CPU locality.
> 
> I am curious as to how you maintain CPU locality for Tx traffic. Can you 
> give some details?
> 
> On Solaris stack, if you have a bunch of say TCP connections sending 
> traffic, they can come from any CPU on the system. By this I mean what 
> CPU an application runs on is completely random unless you do CPU binding.
> 
> I can see tying Rx traffic to a specific Rx ring and CPU. If it is a 
> forwarding case, then one can tie an Rx ring to a Tx ring.

On OSes other than windows, this helps mainly on the TCP receive side,
in that acks will flow out the same CPU that handled the receive
(assuming a direct dispatch from ISR through to the TCP/IP stack).

AFAIK, only Windows can really control affinity to a fine level, since
they require a toeplitz hash, and you must provide hooks to use their
"key", and to update your indirection table.  This means they can
control affinities for connections (or at least sets of connections)
and update them on the fly to match the application''s affinity.
According to our Windows guy, they really use this stuff.

But all of this depends on the OS and the NIC agreeing on the hash.
Is there any reason (patents?  complexity?  perception that the
windows solution is inferior?) that crossbow does not try to take
the windows approach?  Essentially all NICs available today that support
multiple RX queues also support all this other stuff that
Windows requires.  Why not take advantage of it?

Drew

Paul Durrant

2009-May-15 13:43 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:> 
> But all of this depends on the OS and the NIC agreeing on the hash.
> Is there any reason (patents?  complexity?  perception that the
> windows solution is inferior?) that crossbow does not try to take
> the windows approach?  Essentially all NICs available today that support
> multiple RX queues also support all this other stuff that
> Windows requires.  Why not take advantage of it?
> 
Quite. I asked this question well over a year ago and never got an answer.
Windows receive-side scaling works well and, because the hash is cached 
in the stack''s connection data structure, and passed down the TX side
no
s/w calculation of the hash is required.

   Paul

Andrew Gallatin

2009-May-15 13:55 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Paul Durrant wrote:> Andrew Gallatin wrote:
>>
>> But all of this depends on the OS and the NIC agreeing on the hash.
>> Is there any reason (patents?  complexity?  perception that the
>> windows solution is inferior?) that crossbow does not try to take
>> the windows approach?  Essentially all NICs available today that
support
>> multiple RX queues also support all this other stuff that
>> Windows requires.  Why not take advantage of it?
>>
> 
> Quite. I asked this question well over a year ago and never got an answer.
> Windows receive-side scaling works well and, because the hash is cached 
> in the stack''s connection data structure, and passed down the TX
side no
> s/w calculation of the hash is required.
And if your goal is to just avoid tx hashing, FreeBSD does that now.
It has no fine grained affinity control, but it does cache the
hash in the stack, so no expensive tx hashing is required.
The changes in FreeBSD are trivial, compared to TX hashing.  See
for example:

http://svn.freebsd.org/viewvc/base?view=revision&revision=190880

After this, packets come from the stack with m->m_pkthdr.flowid
set, and all you need to do is mask based on the flowid to pick
a tx queue


Drew

Nicolas Droux

2009-May-15 18:21 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

On May 15, 2009, at 7:55 AM, Andrew Gallatin wrote:
> And if your goal is to just avoid tx hashing, FreeBSD does that now.
> It has no fine grained affinity control, but it does cache the
> hash in the stack, so no expensive tx hashing is required.

We currently do a hashing on the connection structure address which is  
not as expensive as parsing and hashing the headers on a per-packet  
basis. But still, we are already working on changes that will allow  
the TX ring to be selected by the MAC layer through a handle which  
will avoid that remaining hash operation.

Nicolas.

-- 
Nicolas Droux - Solaris Kernel Networking - Sun Microsystems, Inc.
nicolas.droux at sun.com - http://blogs.sun.com/droux

rajagopal kunhappan

2009-May-15 18:59 UTC

head link

[crossbow-discuss] transmit side hashing for scaling?

Andrew Gallatin wrote:> rajagopal kunhappan wrote:
>> Andrew Gallatin wrote:
>>>>>
>>>>> FWIW. my tx path is very "lean and mean".  The
only time
>>>>> locks are held are when writing the tx descriptors to the
NIC,
>>>>> and when allocating a dma handle from a pre-allocated
per-ring pool.
>>>>>
>>>>> I thought the serializer was silly too, but PAE claimed a
speedup
>>>>> from it.  I think that  PAE claimed the speedup came from
>>>>> never back-pressuring the stack when the host overran the
>>>>> NIC. One of the "features" of the serializer was
to always
>>>>> block the calling thread if the tx queue was exhausted.
>>>>>
>>>>> Have you done any packets-per-second benchmarks with your
>>>>> fanout code?  I''m concerned that its very cache
unfriendly
>>>>
>>>> With nxge we get line rate with MTU sized packets with 8 Tx
rings.
>>>> The numbers are similar to what is was with nxge serializer in
place.
>>>>
>>>>> if you have a long run of packets all going to the same
>>>>> destination. This is because you walk the mblk chain,
reading
>>>>> the packet headers and queue up a big chain.  If the chain
>>>>> gets too long, the mblk and/or the packet headers will be
>>>>> pushed out of cache by the time they make it to the
driver''s
>>>>> xmit routine.  So in this case you could have twice as many
>>>>> cache misses as normal when things get really backed up.
>>>>
>>>> We would like to have the drivers operate in non-serialized
mode.
>>>> But if for whatever reason, you want to use serialized mode,
and
>>>> there are issues, we can look into that,
>>>
>>> The only reason I care about the serializer is the pre-crossbow
>>> feedback from PAE that the original serializer avoided
>>> putting backpressure on the stack when the TX rings fill up.
>>
>> Yes, pre-crossbow putting back pressure would mean queue''ing
up
>> packets in DLD. Thus all packets get queued in DLD until the driver 
>> relieved the flow control. By then thousands of packets would be 
>> sitting in DLD (This is because TCP does not check for STREAM QFULL 
>> condition on the DLD write queue and keeps on sending packets). After 
>> flow control is relieved, the queued up packets are drained by 
>> dld_wsrv() (in single threaded mode). Single thread is no good on a 
>> 10gig link and thus caused performance issues.
> 
> And crossbow addresses this?
Yes. Each Tx ring has its own queue (soft ring). If a Tx ring is flow 
controlled, the packets gets backed up in the soft ring associated with 
that Tx ring. Thus other Tx rings can continue to send out traffic.
>>> I''m happy using the normal fanout (with some caveats
below) as
>>> long as PAE doesn''t complain about it later.
>>>
>>> The caveats being that I want an fanout mode that uses a
>>> standard Toeplitz hash so as to maintain CPU locality.
>>
>> I am curious as to how you maintain CPU locality for Tx traffic. Can 
>> you give some details?
>>
>> On Solaris stack, if you have a bunch of say TCP connections sending 
>> traffic, they can come from any CPU on the system. By this I mean what 
>> CPU an application runs on is completely random unless you do CPU 
>> binding.
>>
>> I can see tying Rx traffic to a specific Rx ring and CPU. If it is a 
>> forwarding case, then one can tie an Rx ring to a Tx ring.
> 
> 
> On OSes other than windows, this helps mainly on the TCP receive side,
> in that acks will flow out the same CPU that handled the receive
> (assuming a direct dispatch from ISR through to the TCP/IP stack).
> 
> AFAIK, only Windows can really control affinity to a fine level, since
> they require a toeplitz hash, and you must provide hooks to use their
> "key", and to update your indirection table.  This means they can
> control affinities for connections (or at least sets of connections)
> and update them on the fly to match the application''s affinity.
> According to our Windows guy, they really use this stuff.
> 
> But all of this depends on the OS and the NIC agreeing on the hash.
> Is there any reason (patents?  complexity?  perception that the
> windows solution is inferior?) that crossbow does not try to take
> the windows approach?  Essentially all NICs available today that support
> multiple RX queues also support all this other stuff that
> Windows requires.  Why not take advantage of it?
We take advantage of this though not through Toeplitz hash.

There are some things that are missing like being able to retarget an 
MSI-x interrupt to a CPU of our choice. Work is underway to have APIs to 
do this. Once we have this, we can have the poll thread run on the same 
CPU as the MSI-x interrupt that is associated with an Rx ring. We can 
further align other threads that take part in processing the incoming Rx 
traffic to use CPUs that belong to the same socket (same socket meaning 
CPUs sharing common l2 cache).

-krgopi

--

crossbow discuss - May 2009 - transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?

[crossbow-discuss] transmit side hashing for scaling?