Andrew Gallatin
2009-May-13 13:12 UTC
[crossbow-discuss] transmit side hashing for scaling?
Hi, Can somebody shed some light on how crossbow hashes outgoing packets to different transmit rings (not ring groups)? My 10GbE driver has multiple rings (and a single group). Each transmit ring shares an interrupt with a corresponding receive ring. We call a set of 1 TX ring, 1 RX ring, and interrupt handler state a "slice". Transmit completions are handled from the interrupt handler. On OSes which support multiple transmit routes, we''ve found that ensuring that a particular connection is always hashed to the same slice by the host and the NIC helps quite a bit with performance (improves CPU locality, reduces cache misses, decreases power consumption). Some OSes (like FreeBSD) allow a driver to assist in tagging a connection so as to ensure that it is easy to hash traffic for the same connection into the same slice in the host and the NIC. Others (Linux, S10) allow the driver to hash the outgoing packets to provide this locality. So.. Where is the transmit hashing done in crossbow? Is it tunable? Is there a hook where I can do provide a hash routine (like Linux)? Can I tag packets (like FreeBSD)? Is it at least something standard like Toeplitz? Drew
Andrew Gallatin wrote:> Hi, > > Can somebody shed some light on how crossbow hashes outgoing > packets to different transmit rings (not ring groups)? > > My 10GbE driver has multiple rings (and a single group). Each > transmit ring shares an interrupt with a corresponding receive > ring. We call a set of 1 TX ring, 1 RX ring, and interrupt handler > state a "slice". Transmit completions are handled from the interrupt > handler. On OSes which support multiple transmit routes, > we''ve found that ensuring that a particular connection is always > hashed to the same slice by the host and the NIC helps quite a bit > with performance (improves CPU locality, reduces cache misses, decreases > power consumption). > > Some OSes (like FreeBSD) allow a driver to assist in tagging a > connection so as to ensure that it is easy to hash > traffic for the same connection into the same slice in the host > and the NIC. Others (Linux, S10) allow the driver to hash the > outgoing packets to provide this locality. > > So.. Where is the transmit hashing done in crossbow? Is it tunable? > Is there a hook where I can do provide a hash routine (like Linux)? > Can I tag packets (like FreeBSD)? Is it at least something standard > like Toeplitz?If your driver has advertised multiple tx rings, then look for mac_tx_fanout_mode() which in turn computes the hash on fanout hint passed from ip. Providing hooks for additional hash routines has been suggested. Nitin> > Drew > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
Andrew Gallatin
2009-May-13 17:27 UTC
[crossbow-discuss] transmit side hashing for scaling?
Nitin Hande wrote:> Andrew Gallatin wrote: >> Hi, >> >> Can somebody shed some light on how crossbow hashes outgoing >> packets to different transmit rings (not ring groups)? >> >> My 10GbE driver has multiple rings (and a single group). Each >> transmit ring shares an interrupt with a corresponding receive >> ring. We call a set of 1 TX ring, 1 RX ring, and interrupt handler >> state a "slice". Transmit completions are handled from the interrupt >> handler. On OSes which support multiple transmit routes, >> we''ve found that ensuring that a particular connection is always >> hashed to the same slice by the host and the NIC helps quite a bit >> with performance (improves CPU locality, reduces cache misses, decreases >> power consumption). >> >> Some OSes (like FreeBSD) allow a driver to assist in tagging a >> connection so as to ensure that it is easy to hash >> traffic for the same connection into the same slice in the host >> and the NIC. Others (Linux, S10) allow the driver to hash the >> outgoing packets to provide this locality. >> >> So.. Where is the transmit hashing done in crossbow? Is it tunable? >> Is there a hook where I can do provide a hash routine (like Linux)? >> Can I tag packets (like FreeBSD)? Is it at least something standard >> like Toeplitz? > > If your driver has advertised multiple tx rings, then look for > mac_tx_fanout_mode() which in turn computes the hash on fanout hint > passed from ip. Providing hooks for additional hash routines has been > suggested.I guess my best bet might be to lie, and say I have only one TX ring, then fanout things myself, like I used to before Crossbow. Is there any non-obvious disadvantage to that? When looking at this, I noticed mac_tx_serializer_mode(). Am I reading this right, in that is serializes a single queue? That seems lacking, compared to the nxge_serialize stuff it replaces. Drew
rajagopal kunhappan
2009-May-13 18:07 UTC
[crossbow-discuss] transmit side hashing for scaling?
Andrew Gallatin wrote:> Nitin Hande wrote: >> Andrew Gallatin wrote: >>> Hi, >>> >>> Can somebody shed some light on how crossbow hashes outgoing >>> packets to different transmit rings (not ring groups)? >>> >>> My 10GbE driver has multiple rings (and a single group). Each >>> transmit ring shares an interrupt with a corresponding receive >>> ring. We call a set of 1 TX ring, 1 RX ring, and interrupt handler >>> state a "slice". Transmit completions are handled from the interrupt >>> handler. On OSes which support multiple transmit routes, >>> we''ve found that ensuring that a particular connection is always >>> hashed to the same slice by the host and the NIC helps quite a bit >>> with performance (improves CPU locality, reduces cache misses, decreases >>> power consumption). >>> >>> Some OSes (like FreeBSD) allow a driver to assist in tagging a >>> connection so as to ensure that it is easy to hash >>> traffic for the same connection into the same slice in the host >>> and the NIC. Others (Linux, S10) allow the driver to hash the >>> outgoing packets to provide this locality. >>> >>> So.. Where is the transmit hashing done in crossbow? Is it tunable? >>> Is there a hook where I can do provide a hash routine (like Linux)? >>> Can I tag packets (like FreeBSD)? Is it at least something standard >>> like Toeplitz? >> >> If your driver has advertised multiple tx rings, then look for >> mac_tx_fanout_mode() which in turn computes the hash on fanout hint >> passed from ip. Providing hooks for additional hash routines has been >> suggested. > > I guess my best bet might be to lie, and say I have only one TX > ring, then fanout things myself, like I used to before Crossbow. > Is there any non-obvious disadvantage to that?I don''t know if this is non-obvious but exposing Tx rings to mac layer would help in: 1) virtualization like VNICs getting their own Tx rings. 2) Flow control when multiple Tx rings is present (presently done for UDP). If a Tx ring is blocked (out of desc), then only conn_t''s sending on that Tx rings gets blocked. We want all the NICs to expose their Tx rings to the MAC layer.> When looking at this, I noticed mac_tx_serializer_mode(). Am I reading > this right, in that is serializes a single queue? That seems lacking, > compared to the nxge_serialize stuff it replaces.It is a generic solution for NICs that do not have good locking on the Tx side. mac_tx_serializer_mode() is used when you have a single Tx ring. nxge would not use that mode. It exposes multiple Tx rings. When multiple Tx rings are present, mac_tx_fanout_mode() is used. mac_tx_fanout_mode() can operate in serialized mode also in which case there would be a serializer (soft ring) for each Tx ring. Nxge uses that mode. -krgopi> > Drew > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss--
Andrew Gallatin wrote:> Nitin Hande wrote: >> Andrew Gallatin wrote: >>> Hi, >>> >>> Can somebody shed some light on how crossbow hashes outgoing >>> packets to different transmit rings (not ring groups)? >>> >>> My 10GbE driver has multiple rings (and a single group). Each >>> transmit ring shares an interrupt with a corresponding receive >>> ring. We call a set of 1 TX ring, 1 RX ring, and interrupt handler >>> state a "slice". Transmit completions are handled from the interrupt >>> handler. On OSes which support multiple transmit routes, >>> we''ve found that ensuring that a particular connection is always >>> hashed to the same slice by the host and the NIC helps quite a bit >>> with performance (improves CPU locality, reduces cache misses, >>> decreases >>> power consumption). >>> >>> Some OSes (like FreeBSD) allow a driver to assist in tagging a >>> connection so as to ensure that it is easy to hash >>> traffic for the same connection into the same slice in the host >>> and the NIC. Others (Linux, S10) allow the driver to hash the >>> outgoing packets to provide this locality. >>> >>> So.. Where is the transmit hashing done in crossbow? Is it tunable? >>> Is there a hook where I can do provide a hash routine (like Linux)? >>> Can I tag packets (like FreeBSD)? Is it at least something standard >>> like Toeplitz? >> >> If your driver has advertised multiple tx rings, then look for >> mac_tx_fanout_mode() which in turn computes the hash on fanout hint >> passed from ip. Providing hooks for additional hash routines has been >> suggested. > > I guess my best bet might be to lie, and say I have only one TX > ring, then fanout things myself, like I used to before Crossbow. > Is there any non-obvious disadvantage to that?If you advertise a single ring, then the tx path will end up in mac_tx_single_ring_mode() , they way it does for an e1000g driver. I think in that case the entry point in the driver is through older xxx_m_tx(), you may have to pay attention to that in your driver. There could be slight variance in both the schemes. In case of single_ring_mode(), if you get backpressured from the driver on the tx side due to lack of descriptors, then packets will be enqueued at the tx srs. At that point, if there are multiple threads trying to send additional packets, all the packets will end up getting queued, whereas there will be only one worker thread trying to clear up the queue build-up. At high packet rates its difficult for this one thread to catch up (Additionally also look at MAC_DROP_ON_NO_DESC flag in .mac_tx_srs_no_desc() which can drop the packets rather than queuing). Versus in mac_tx_fanout_mode() each tx ring gets its own softring in case of backpressure and its own worker thread.> > When looking at this, I noticed mac_tx_serializer_mode(). Am I reading > this right, in that is serializes a single queue? That seems lacking, > compared to the nxge_serialize stuff it replaces.Yes. This part was done for nxge and as far as I remember recent performance of this scheme was very close to that of the previous scheme. I think Gopi can comment more on this. What part do you think is missing here ? Nitin> > Drew
Andrew Gallatin
2009-May-14 16:51 UTC
[crossbow-discuss] transmit side hashing for scaling?
Nitin Hande wrote:> Andrew Gallatin wrote:>> When looking at this, I noticed mac_tx_serializer_mode(). Am I reading >> this right, in that is serializes a single queue? That seems lacking, >> compared to the nxge_serialize stuff it replaces. > > Yes. This part was done for nxge and as far as I remember recent > performance of this scheme was very close to that of the previous > scheme. I think Gopi can comment more on this. What part do you think > is missing here ?Perhaps I''m missing something.. Doesn''t nxge support multiple TX rings? If so, does the existing serialization serialize all traffic to a single ring, or is mac_tx_serializer_mode() applied after mac_tx_fanout_mode()? I had thought the original nxge serializer serialized each TX ring separately in nxge. The fork I made of it for myri10ge certainly works that way. Drew
rajagopal kunhappan
2009-May-14 17:20 UTC
[crossbow-discuss] transmit side hashing for scaling?
Andrew Gallatin wrote:> Nitin Hande wrote: >> Andrew Gallatin wrote: > >>> When looking at this, I noticed mac_tx_serializer_mode(). Am I reading >>> this right, in that is serializes a single queue? That seems lacking, >>> compared to the nxge_serialize stuff it replaces. >> >> Yes. This part was done for nxge and as far as I remember recent >> performance of this scheme was very close to that of the previous >> scheme. I think Gopi can comment more on this. What part do you >> think is missing here ? > > Perhaps I''m missing something.. Doesn''t nxge support multiple TX rings?yes.> If so, does the existing serialization serialize all traffic to a > single ring, or is mac_tx_serializer_mode() applied after > mac_tx_fanout_mode()?Neither. Cutting and pasting from my previous reply: ... mac_tx_serializer_mode() is used when you have a single Tx ring. nxge would not use that mode. It exposes multiple Tx rings. When multiple Tx rings are present, mac_tx_fanout_mode() is used. mac_tx_fanout_mode() can operate in serialized mode also in which case there would be a serializer (soft ring) for each Tx ring. Nxge uses that mode. ... -krgopi> > I had thought the original nxge serializer serialized each TX ring > separately in nxge. The fork I made of it for myri10ge certainly > works that way. > > Drew > > > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss--
On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote:> Nitin Hande wrote: >> Andrew Gallatin wrote: > >>> When looking at this, I noticed mac_tx_serializer_mode(). Am I >>> reading >>> this right, in that is serializes a single queue? That seems >>> lacking, >>> compared to the nxge_serialize stuff it replaces. >> Yes. This part was done for nxge and as far as I remember recent >> performance of this scheme was very close to that of the previous >> scheme. I think Gopi can comment more on this. What part do you >> think is missing here ? > > Perhaps I''m missing something.. Doesn''t nxge support multiple TX > rings? > If so, does the existing serialization serialize all traffic to a > single ring, or is mac_tx_serializer_mode() applied after > mac_tx_fanout_mode()? > > I had thought the original nxge serializer serialized each TX ring > separately in nxge. The fork I made of it for myri10ge certainly > works that way.The serializer is only for use by the nxge driver which has an inefficient TX path locking implementation. We didn''t have the resources to completely rewrite the nxge transmit path as part of the Crossbow project so we moved the serialization implementation in MAC for that driver. The serializer in MAC does serialization on a per- ring basis. The serializer should not be used by any other driver. You don''t have to use the serializer to support multiple TX rings. Keep your TX path lean and mean, apply good design principles, e.g. avoid holding locks for too long on your data-path, and you should be fine. Nicolas.> > > > Drew > > > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss-- Nicolas Droux - Solaris Kernel Networking - Sun Microsystems, Inc. nicolas.droux at sun.com - http://blogs.sun.com/droux
Andrew Gallatin
2009-May-14 18:34 UTC
[crossbow-discuss] transmit side hashing for scaling?
Nicolas Droux wrote:> > On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote: > >> Nitin Hande wrote: >>> Andrew Gallatin wrote: >> >>>> When looking at this, I noticed mac_tx_serializer_mode(). Am I reading >>>> this right, in that is serializes a single queue? That seems lacking, >>>> compared to the nxge_serialize stuff it replaces. >>> Yes. This part was done for nxge and as far as I remember recent >>> performance of this scheme was very close to that of the previous >>> scheme. I think Gopi can comment more on this. What part do you >>> think is missing here ? >> >> Perhaps I''m missing something.. Doesn''t nxge support multiple TX rings? >> If so, does the existing serialization serialize all traffic to a >> single ring, or is mac_tx_serializer_mode() applied after >> mac_tx_fanout_mode()? >> >> I had thought the original nxge serializer serialized each TX ring >> separately in nxge. The fork I made of it for myri10ge certainly >> works that way. > > The serializer is only for use by the nxge driver which has an > inefficient TX path locking implementation. We didn''t have the resources > to completely rewrite the nxge transmit path as part of the Crossbow > project so we moved the serialization implementation in MAC for that > driver. The serializer in MAC does serialization on a per-ring basis. > The serializer should not be used by any other driver.krgopi said, in an earlier reply "mac_tx_serializer_mode() is used when you have a single Tx ring. nxge would not use that mode". So I''m confused. From the source, it looks like nxge is using that mode (MAC_VIRT_SERIALIZE |''ed into mi_v12n_level). So I guess it is restricted to using only one of its hw tx rings, then?> You don''t have to use the serializer to support multiple TX rings. Keep > your TX path lean and mean, apply good design principles, e.g. avoid > holding locks for too long on your data-path, and you should be fine.FWIW. my tx path is very "lean and mean". The only time locks are held are when writing the tx descriptors to the NIC, and when allocating a dma handle from a pre-allocated per-ring pool. I thought the serializer was silly too, but PAE claimed a speedup from it. I think that PAE claimed the speedup came from never back-pressuring the stack when the host overran the NIC. One of the "features" of the serializer was to always block the calling thread if the tx queue was exhausted. Have you done any packets-per-second benchmarks with your fanout code? I''m concerned that its very cache unfriendly if you have a long run of packets all going to the same destination. This is because you walk the mblk chain, reading the packet headers and queue up a big chain. If the chain gets too long, the mblk and/or the packet headers will be pushed out of cache by the time they make it to the driver''s xmit routine. So in this case you could have twice as many cache misses as normal when things get really backed up. Last, you (or somebody) mentioned there was interest in adding a hook for a driver to do fanout. Is there a bugid or something for this? Drew
rajagopal kunhappan
2009-May-14 18:54 UTC
[crossbow-discuss] transmit side hashing for scaling?
Andrew Gallatin wrote:> Nicolas Droux wrote: >> >> On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote: >> >>> Nitin Hande wrote: >>>> Andrew Gallatin wrote: >>> >>>>> When looking at this, I noticed mac_tx_serializer_mode(). Am I >>>>> reading >>>>> this right, in that is serializes a single queue? That seems lacking, >>>>> compared to the nxge_serialize stuff it replaces. >>>> Yes. This part was done for nxge and as far as I remember recent >>>> performance of this scheme was very close to that of the previous >>>> scheme. I think Gopi can comment more on this. What part do you >>>> think is missing here ? >>> >>> Perhaps I''m missing something.. Doesn''t nxge support multiple TX rings? >>> If so, does the existing serialization serialize all traffic to a >>> single ring, or is mac_tx_serializer_mode() applied after >>> mac_tx_fanout_mode()? >>> >>> I had thought the original nxge serializer serialized each TX ring >>> separately in nxge. The fork I made of it for myri10ge certainly >>> works that way. >> >> The serializer is only for use by the nxge driver which has an >> inefficient TX path locking implementation. We didn''t have the >> resources to completely rewrite the nxge transmit path as part of the >> Crossbow project so we moved the serialization implementation in MAC >> for that driver. The serializer in MAC does serialization on a >> per-ring basis. The serializer should not be used by any other driver. > > krgopi said, in an earlier reply "mac_tx_serializer_mode() is used when > you have a single Tx ring. nxge would not use that mode". So I''m > confused. From the source, it looks like nxge is using that mode > (MAC_VIRT_SERIALIZE |''ed into mi_v12n_level).Hi Andrew, I think the confusion is in the name. mac_tx_serializer_mode() is used when you have single ring. nxge exposes multiple rings. When multiple Tx rings are present, mac_tx_fanout_mode() get called. In this mode, each Tx ring will have a soft ring associated with it. The soft rings themselves are stored in the Tx SRS. The packets coming into mac_tx_fanout_mode() will get fanned out to one of the Tx soft rings and mac_tx_soft_ring_process() gets called. mac_tx_soft_ring_process() can either queue up the packets or send directly to the NIC. In the case of nxge, packets get queued up in the soft ring and the soft ring worker thread sends them to the NIC. Hope this clarifies.>> You don''t have to use the serializer to support multiple TX rings. >> Keep your TX path lean and mean, apply good design principles, e.g. >> avoid holding locks for too long on your data-path, and you should be >> fine. > > FWIW. my tx path is very "lean and mean". The only time > locks are held are when writing the tx descriptors to the NIC, > and when allocating a dma handle from a pre-allocated per-ring pool. > > I thought the serializer was silly too, but PAE claimed a speedup > from it. I think that PAE claimed the speedup came from > never back-pressuring the stack when the host overran the > NIC. One of the "features" of the serializer was to always > block the calling thread if the tx queue was exhausted. > > Have you done any packets-per-second benchmarks with your > fanout code? I''m concerned that its very cache unfriendlyWith nxge we get line rate with MTU sized packets with 8 Tx rings. The numbers are similar to what is was with nxge serializer in place.> if you have a long run of packets all going to the same > destination. This is because you walk the mblk chain, reading > the packet headers and queue up a big chain. If the chain > gets too long, the mblk and/or the packet headers will be > pushed out of cache by the time they make it to the driver''s > xmit routine. So in this case you could have twice as many > cache misses as normal when things get really backed up.We would like to have the drivers operate in non-serialized mode. But if for whatever reason, you want to use serialized mode, and there are issues, we can look into that, Thanks, -krgopi> > Last, you (or somebody) mentioned there was interest in adding > a hook for a driver to do fanout. Is there a bugid or something > for this? > > Drew > _______________________________________________ > crossbow-discuss mailing list > crossbow-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss--
Andrew Gallatin
2009-May-14 19:06 UTC
[crossbow-discuss] transmit side hashing for scaling?
rajagopal kunhappan wrote:> Andrew Gallatin wrote: >> Nicolas Droux wrote: >>> >>> On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote: >>> >>>> Nitin Hande wrote: >>>>> Andrew Gallatin wrote: >>>> >>>>>> When looking at this, I noticed mac_tx_serializer_mode(). Am I >>>>>> reading >>>>>> this right, in that is serializes a single queue? That seems >>>>>> lacking, >>>>>> compared to the nxge_serialize stuff it replaces. >>>>> Yes. This part was done for nxge and as far as I remember recent >>>>> performance of this scheme was very close to that of the previous >>>>> scheme. I think Gopi can comment more on this. What part do you >>>>> think is missing here ? >>>> >>>> Perhaps I''m missing something.. Doesn''t nxge support multiple TX >>>> rings? >>>> If so, does the existing serialization serialize all traffic to a >>>> single ring, or is mac_tx_serializer_mode() applied after >>>> mac_tx_fanout_mode()? >>>> >>>> I had thought the original nxge serializer serialized each TX ring >>>> separately in nxge. The fork I made of it for myri10ge certainly >>>> works that way. >>> >>> The serializer is only for use by the nxge driver which has an >>> inefficient TX path locking implementation. We didn''t have the >>> resources to completely rewrite the nxge transmit path as part of the >>> Crossbow project so we moved the serialization implementation in MAC >>> for that driver. The serializer in MAC does serialization on a >>> per-ring basis. The serializer should not be used by any other driver. >> >> krgopi said, in an earlier reply "mac_tx_serializer_mode() is used >> when you have a single Tx ring. nxge would not use that mode". So I''m >> confused. From the source, it looks like nxge is using that mode >> (MAC_VIRT_SERIALIZE |''ed into mi_v12n_level). > > Hi Andrew, > > I think the confusion is in the name. mac_tx_serializer_mode() is used > when you have single ring. nxge exposes multiple rings. When multiple Tx > rings are present, mac_tx_fanout_mode() get called. In this mode, each > Tx ring will have a soft ring associated with it. The soft rings > themselves are stored in the Tx SRS. The packets coming into > mac_tx_fanout_mode() will get fanned out to one of the Tx soft rings and > mac_tx_soft_ring_process() gets called. mac_tx_soft_ring_process() can > either queue up the packets or send directly to the NIC. In the case of > nxge, packets get queued up in the soft ring and the soft ring worker > thread sends them to the NIC. > > Hope this clarifies. > >>> You don''t have to use the serializer to support multiple TX rings. >>> Keep your TX path lean and mean, apply good design principles, e.g. >>> avoid holding locks for too long on your data-path, and you should be >>> fine. >> >> FWIW. my tx path is very "lean and mean". The only time >> locks are held are when writing the tx descriptors to the NIC, >> and when allocating a dma handle from a pre-allocated per-ring pool. >> >> I thought the serializer was silly too, but PAE claimed a speedup >> from it. I think that PAE claimed the speedup came from >> never back-pressuring the stack when the host overran the >> NIC. One of the "features" of the serializer was to always >> block the calling thread if the tx queue was exhausted. >> >> Have you done any packets-per-second benchmarks with your >> fanout code? I''m concerned that its very cache unfriendly > > With nxge we get line rate with MTU sized packets with 8 Tx rings. The > numbers are similar to what is was with nxge serializer in place. > >> if you have a long run of packets all going to the same >> destination. This is because you walk the mblk chain, reading >> the packet headers and queue up a big chain. If the chain >> gets too long, the mblk and/or the packet headers will be >> pushed out of cache by the time they make it to the driver''s >> xmit routine. So in this case you could have twice as many >> cache misses as normal when things get really backed up. > > We would like to have the drivers operate in non-serialized mode. But if > for whatever reason, you want to use serialized mode, and there are > issues, we can look into that,The only reason I care about the serializer is the pre-crossbow feedback from PAE that the original serializer avoided putting backpressure on the stack when the TX rings fill up. I''m happy using the normal fanout (with some caveats below) as long as PAE doesn''t complain about it later. The caveats being that I want an fanout mode that uses a standard Toeplitz hash so as to maintain CPU locality. Or a hook so I can implement my own tx side hashing. Drew
rajagopal kunhappan
2009-May-14 21:38 UTC
[crossbow-discuss] transmit side hashing for scaling?
Andrew Gallatin wrote:>>> >>> FWIW. my tx path is very "lean and mean". The only time >>> locks are held are when writing the tx descriptors to the NIC, >>> and when allocating a dma handle from a pre-allocated per-ring pool. >>> >>> I thought the serializer was silly too, but PAE claimed a speedup >>> from it. I think that PAE claimed the speedup came from >>> never back-pressuring the stack when the host overran the >>> NIC. One of the "features" of the serializer was to always >>> block the calling thread if the tx queue was exhausted. >>> >>> Have you done any packets-per-second benchmarks with your >>> fanout code? I''m concerned that its very cache unfriendly >> >> With nxge we get line rate with MTU sized packets with 8 Tx rings. The >> numbers are similar to what is was with nxge serializer in place. >> >>> if you have a long run of packets all going to the same >>> destination. This is because you walk the mblk chain, reading >>> the packet headers and queue up a big chain. If the chain >>> gets too long, the mblk and/or the packet headers will be >>> pushed out of cache by the time they make it to the driver''s >>> xmit routine. So in this case you could have twice as many >>> cache misses as normal when things get really backed up. >> >> We would like to have the drivers operate in non-serialized mode. But >> if for whatever reason, you want to use serialized mode, and there are >> issues, we can look into that, > > The only reason I care about the serializer is the pre-crossbow > feedback from PAE that the original serializer avoided > putting backpressure on the stack when the TX rings fill up.Yes, pre-crossbow putting back pressure would mean queue''ing up packets in DLD. Thus all packets get queued in DLD until the driver relieved the flow control. By then thousands of packets would be sitting in DLD (This is because TCP does not check for STREAM QFULL condition on the DLD write queue and keeps on sending packets). After flow control is relieved, the queued up packets are drained by dld_wsrv() (in single threaded mode). Single thread is no good on a 10gig link and thus caused performance issues.> I''m happy using the normal fanout (with some caveats below) as > long as PAE doesn''t complain about it later. > > The caveats being that I want an fanout mode that uses a > standard Toeplitz hash so as to maintain CPU locality.I am curious as to how you maintain CPU locality for Tx traffic. Can you give some details? On Solaris stack, if you have a bunch of say TCP connections sending traffic, they can come from any CPU on the system. By this I mean what CPU an application runs on is completely random unless you do CPU binding. I can see tying Rx traffic to a specific Rx ring and CPU. If it is a forwarding case, then one can tie an Rx ring to a Tx ring. Thanks, -krgopi> Or a hook so I can implement my own tx side hashing.> > Drew--
Andrew Gallatin
2009-May-15 13:32 UTC
[crossbow-discuss] transmit side hashing for scaling?
rajagopal kunhappan wrote:> Andrew Gallatin wrote: >>>> >>>> FWIW. my tx path is very "lean and mean". The only time >>>> locks are held are when writing the tx descriptors to the NIC, >>>> and when allocating a dma handle from a pre-allocated per-ring pool. >>>> >>>> I thought the serializer was silly too, but PAE claimed a speedup >>>> from it. I think that PAE claimed the speedup came from >>>> never back-pressuring the stack when the host overran the >>>> NIC. One of the "features" of the serializer was to always >>>> block the calling thread if the tx queue was exhausted. >>>> >>>> Have you done any packets-per-second benchmarks with your >>>> fanout code? I''m concerned that its very cache unfriendly >>> >>> With nxge we get line rate with MTU sized packets with 8 Tx rings. >>> The numbers are similar to what is was with nxge serializer in place. >>> >>>> if you have a long run of packets all going to the same >>>> destination. This is because you walk the mblk chain, reading >>>> the packet headers and queue up a big chain. If the chain >>>> gets too long, the mblk and/or the packet headers will be >>>> pushed out of cache by the time they make it to the driver''s >>>> xmit routine. So in this case you could have twice as many >>>> cache misses as normal when things get really backed up. >>> >>> We would like to have the drivers operate in non-serialized mode. But >>> if for whatever reason, you want to use serialized mode, and there >>> are issues, we can look into that, >> >> The only reason I care about the serializer is the pre-crossbow >> feedback from PAE that the original serializer avoided >> putting backpressure on the stack when the TX rings fill up. > > Yes, pre-crossbow putting back pressure would mean queue''ing up packets > in DLD. Thus all packets get queued in DLD until the driver relieved the > flow control. By then thousands of packets would be sitting in DLD (This > is because TCP does not check for STREAM QFULL condition on the DLD > write queue and keeps on sending packets). After flow control is > relieved, the queued up packets are drained by dld_wsrv() (in single > threaded mode). Single thread is no good on a 10gig link and thus caused > performance issues.And crossbow addresses this?>> I''m happy using the normal fanout (with some caveats below) as >> long as PAE doesn''t complain about it later. >> >> The caveats being that I want an fanout mode that uses a >> standard Toeplitz hash so as to maintain CPU locality. > > I am curious as to how you maintain CPU locality for Tx traffic. Can you > give some details? > > On Solaris stack, if you have a bunch of say TCP connections sending > traffic, they can come from any CPU on the system. By this I mean what > CPU an application runs on is completely random unless you do CPU binding. > > I can see tying Rx traffic to a specific Rx ring and CPU. If it is a > forwarding case, then one can tie an Rx ring to a Tx ring.On OSes other than windows, this helps mainly on the TCP receive side, in that acks will flow out the same CPU that handled the receive (assuming a direct dispatch from ISR through to the TCP/IP stack). AFAIK, only Windows can really control affinity to a fine level, since they require a toeplitz hash, and you must provide hooks to use their "key", and to update your indirection table. This means they can control affinities for connections (or at least sets of connections) and update them on the fly to match the application''s affinity. According to our Windows guy, they really use this stuff. But all of this depends on the OS and the NIC agreeing on the hash. Is there any reason (patents? complexity? perception that the windows solution is inferior?) that crossbow does not try to take the windows approach? Essentially all NICs available today that support multiple RX queues also support all this other stuff that Windows requires. Why not take advantage of it? Drew
Andrew Gallatin wrote:> > But all of this depends on the OS and the NIC agreeing on the hash. > Is there any reason (patents? complexity? perception that the > windows solution is inferior?) that crossbow does not try to take > the windows approach? Essentially all NICs available today that support > multiple RX queues also support all this other stuff that > Windows requires. Why not take advantage of it? >Quite. I asked this question well over a year ago and never got an answer. Windows receive-side scaling works well and, because the hash is cached in the stack''s connection data structure, and passed down the TX side no s/w calculation of the hash is required. Paul
Andrew Gallatin
2009-May-15 13:55 UTC
[crossbow-discuss] transmit side hashing for scaling?
Paul Durrant wrote:> Andrew Gallatin wrote: >> >> But all of this depends on the OS and the NIC agreeing on the hash. >> Is there any reason (patents? complexity? perception that the >> windows solution is inferior?) that crossbow does not try to take >> the windows approach? Essentially all NICs available today that support >> multiple RX queues also support all this other stuff that >> Windows requires. Why not take advantage of it? >> > > Quite. I asked this question well over a year ago and never got an answer. > Windows receive-side scaling works well and, because the hash is cached > in the stack''s connection data structure, and passed down the TX side no > s/w calculation of the hash is required.And if your goal is to just avoid tx hashing, FreeBSD does that now. It has no fine grained affinity control, but it does cache the hash in the stack, so no expensive tx hashing is required. The changes in FreeBSD are trivial, compared to TX hashing. See for example: http://svn.freebsd.org/viewvc/base?view=revision&revision=190880 After this, packets come from the stack with m->m_pkthdr.flowid set, and all you need to do is mask based on the flowid to pick a tx queue Drew
On May 15, 2009, at 7:55 AM, Andrew Gallatin wrote:> And if your goal is to just avoid tx hashing, FreeBSD does that now. > It has no fine grained affinity control, but it does cache the > hash in the stack, so no expensive tx hashing is required.We currently do a hashing on the connection structure address which is not as expensive as parsing and hashing the headers on a per-packet basis. But still, we are already working on changes that will allow the TX ring to be selected by the MAC layer through a handle which will avoid that remaining hash operation. Nicolas. -- Nicolas Droux - Solaris Kernel Networking - Sun Microsystems, Inc. nicolas.droux at sun.com - http://blogs.sun.com/droux
rajagopal kunhappan
2009-May-15 18:59 UTC
[crossbow-discuss] transmit side hashing for scaling?
Andrew Gallatin wrote:> rajagopal kunhappan wrote: >> Andrew Gallatin wrote: >>>>> >>>>> FWIW. my tx path is very "lean and mean". The only time >>>>> locks are held are when writing the tx descriptors to the NIC, >>>>> and when allocating a dma handle from a pre-allocated per-ring pool. >>>>> >>>>> I thought the serializer was silly too, but PAE claimed a speedup >>>>> from it. I think that PAE claimed the speedup came from >>>>> never back-pressuring the stack when the host overran the >>>>> NIC. One of the "features" of the serializer was to always >>>>> block the calling thread if the tx queue was exhausted. >>>>> >>>>> Have you done any packets-per-second benchmarks with your >>>>> fanout code? I''m concerned that its very cache unfriendly >>>> >>>> With nxge we get line rate with MTU sized packets with 8 Tx rings. >>>> The numbers are similar to what is was with nxge serializer in place. >>>> >>>>> if you have a long run of packets all going to the same >>>>> destination. This is because you walk the mblk chain, reading >>>>> the packet headers and queue up a big chain. If the chain >>>>> gets too long, the mblk and/or the packet headers will be >>>>> pushed out of cache by the time they make it to the driver''s >>>>> xmit routine. So in this case you could have twice as many >>>>> cache misses as normal when things get really backed up. >>>> >>>> We would like to have the drivers operate in non-serialized mode. >>>> But if for whatever reason, you want to use serialized mode, and >>>> there are issues, we can look into that, >>> >>> The only reason I care about the serializer is the pre-crossbow >>> feedback from PAE that the original serializer avoided >>> putting backpressure on the stack when the TX rings fill up. >> >> Yes, pre-crossbow putting back pressure would mean queue''ing up >> packets in DLD. Thus all packets get queued in DLD until the driver >> relieved the flow control. By then thousands of packets would be >> sitting in DLD (This is because TCP does not check for STREAM QFULL >> condition on the DLD write queue and keeps on sending packets). After >> flow control is relieved, the queued up packets are drained by >> dld_wsrv() (in single threaded mode). Single thread is no good on a >> 10gig link and thus caused performance issues. > > And crossbow addresses this?Yes. Each Tx ring has its own queue (soft ring). If a Tx ring is flow controlled, the packets gets backed up in the soft ring associated with that Tx ring. Thus other Tx rings can continue to send out traffic.>>> I''m happy using the normal fanout (with some caveats below) as >>> long as PAE doesn''t complain about it later. >>> >>> The caveats being that I want an fanout mode that uses a >>> standard Toeplitz hash so as to maintain CPU locality. >> >> I am curious as to how you maintain CPU locality for Tx traffic. Can >> you give some details? >> >> On Solaris stack, if you have a bunch of say TCP connections sending >> traffic, they can come from any CPU on the system. By this I mean what >> CPU an application runs on is completely random unless you do CPU >> binding. >> >> I can see tying Rx traffic to a specific Rx ring and CPU. If it is a >> forwarding case, then one can tie an Rx ring to a Tx ring. > > > On OSes other than windows, this helps mainly on the TCP receive side, > in that acks will flow out the same CPU that handled the receive > (assuming a direct dispatch from ISR through to the TCP/IP stack). > > AFAIK, only Windows can really control affinity to a fine level, since > they require a toeplitz hash, and you must provide hooks to use their > "key", and to update your indirection table. This means they can > control affinities for connections (or at least sets of connections) > and update them on the fly to match the application''s affinity. > According to our Windows guy, they really use this stuff. > > But all of this depends on the OS and the NIC agreeing on the hash. > Is there any reason (patents? complexity? perception that the > windows solution is inferior?) that crossbow does not try to take > the windows approach? Essentially all NICs available today that support > multiple RX queues also support all this other stuff that > Windows requires. Why not take advantage of it?We take advantage of this though not through Toeplitz hash. There are some things that are missing like being able to retarget an MSI-x interrupt to a CPU of our choice. Work is underway to have APIs to do this. Once we have this, we can have the poll thread run on the same CPU as the MSI-x interrupt that is associated with an Rx ring. We can further align other threads that take part in processing the incoming Rx traffic to use CPUs that belong to the same socket (same socket meaning CPUs sharing common l2 cache). -krgopi --