Wei Liu
2013-Sep-06 10:16 UTC
TSQ accounting skb->truesize degrades throughput for large packets
Hi Eric I have some questions regarding TSQ and I hope you can shed some light on this. Our observation is that with the default TSQ limit (128K), throughput for Xen network driver for large packets degrades. That''s because we now only have 1 packet in queue. I double-checked that skb->len is indeed <64K. Then I discovered that TSQ actually accounts for skb->truesize and the packets generated had skb->truesize > 64K which effectively prevented us from putting 2 packets in queue. There seems to be no way to limit skb->truesize inside driver -- the skb is already constructed when it comes to xen-netfront. My questions are: 1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc accounts skb truesize, including skb overhead. But thats OK", I don''t quite understand why it is OK. 2) presumably other drivers will suffer from this as well, is it possible to account for skb->len instead of skb->truesize? 3) if accounting skb->truesize is on purpose, does that mean we only need to tune that value instead of trying to fix our driver (if there is a way to)? Thanks Wei.
Eric Dumazet
2013-Sep-06 12:57 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote:> Hi Eric > > I have some questions regarding TSQ and I hope you can shed some light > on this. > > Our observation is that with the default TSQ limit (128K), throughput > for Xen network driver for large packets degrades. That''s because we now > only have 1 packet in queue. > > I double-checked that skb->len is indeed <64K. Then I discovered that > TSQ actually accounts for skb->truesize and the packets generated had > skb->truesize > 64K which effectively prevented us from putting 2 > packets in queue. > > There seems to be no way to limit skb->truesize inside driver -- the skb > is already constructed when it comes to xen-netfront. >What is the skb->truesize value then ? It must be huge, and its clearly a problem, because the tcp _receiver_ will also grow its window slower, if packet is looped back.> My questions are: > 1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc > accounts skb truesize, including skb overhead. But thats OK", I > don''t quite understand why it is OK. > 2) presumably other drivers will suffer from this as well, is it > possible to account for skb->len instead of skb->truesize?Well, I have no problem to get line rate on 20Gb with a single flow, so other drivers have no problem.> 3) if accounting skb->truesize is on purpose, does that mean we only > need to tune that value instead of trying to fix our driver (if > there is a way to)?The check in TCP allows for two packets at least, unless a single skb truesize is 128K ? if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); break; } So if a skb->truesize is 100K, this condition allows two packets, before throttling the third packet. Its actually hard to account for skb->len, because sk_wmem_alloc accounts for skb->truesize : I do not want to add another sk->sk_wbytes_alloc new atomic field.
Wei Liu
2013-Sep-06 13:12 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Fri, Sep 06, 2013 at 05:57:48AM -0700, Eric Dumazet wrote:> On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote: > > Hi Eric > > > > I have some questions regarding TSQ and I hope you can shed some light > > on this. > > > > Our observation is that with the default TSQ limit (128K), throughput > > for Xen network driver for large packets degrades. That''s because we now > > only have 1 packet in queue. > > > > I double-checked that skb->len is indeed <64K. Then I discovered that > > TSQ actually accounts for skb->truesize and the packets generated had > > skb->truesize > 64K which effectively prevented us from putting 2 > > packets in queue. > > > > There seems to be no way to limit skb->truesize inside driver -- the skb > > is already constructed when it comes to xen-netfront. > > > > What is the skb->truesize value then ? It must be huge, and its clearly > a problem, because the tcp _receiver_ will also grow its window slower, > if packet is looped back. >It''s ~66KB.> > My questions are: > > 1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc > > accounts skb truesize, including skb overhead. But thats OK", I > > don''t quite understand why it is OK. > > 2) presumably other drivers will suffer from this as well, is it > > possible to account for skb->len instead of skb->truesize? > > Well, I have no problem to get line rate on 20Gb with a single flow, so > other drivers have no problem. >OK, good to know this.> > 3) if accounting skb->truesize is on purpose, does that mean we only > > need to tune that value instead of trying to fix our driver (if > > there is a way to)? > > The check in TCP allows for two packets at least, unless a single skb > truesize is 128K ? > > > if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { > set_bit(TSQ_THROTTLED, &tp->tsq_flags); > break; > } > > So if a skb->truesize is 100K, this condition allows two packets, before > throttling the third packet. >OK. I need to check why we''re getting only 1 then. Thanks for your reply. Wei.> Its actually hard to account for skb->len, because sk_wmem_alloc > accounts for skb->truesize : I do not want to add another > sk->sk_wbytes_alloc new atomic field. > >
Zoltan Kiss
2013-Sep-06 16:36 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On 06/09/13 13:57, Eric Dumazet wrote:> Well, I have no problem to get line rate on 20Gb with a single flow, so > other drivers have no problem.I''ve made some tests on bare metal: Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 ixgbe (TSO, GSO on), iperf 2.0.5 Transmitting packets toward the remote end (so running iperf -c on this host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% (softint percentage in top also increased from ~3 to ~5%) So I guess it would be good to revisit the default value of this setting. What hw you used Eric for your 20Gb results? Regards, Zoli
Eric Dumazet
2013-Sep-06 16:56 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:> On 06/09/13 13:57, Eric Dumazet wrote: > > Well, I have no problem to get line rate on 20Gb with a single flow, so > > other drivers have no problem. > I''ve made some tests on bare metal: > Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 > ixgbe (TSO, GSO on), iperf 2.0.5 > Transmitting packets toward the remote end (so running iperf -c on this > host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. > When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped > to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% > (softint percentage in top also increased from ~3 to ~5%)Typical tradeoff between latency and throughput If you favor throughput, then you can increase tcp_limit_output_bytes The default is quite reasonable IMHO.> So I guess it would be good to revisit the default value of this > setting. What hw you used Eric for your 20Gb results?Mellanox CX-3 Make sure your NIC doesn''t hold TX packets in TX ring too long before signaling an interrupt for TX completion. For example I had to patch mellanox : commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7 Author: Eric Dumazet <edumazet@google.com> Date: Mon Nov 5 16:20:42 2012 +0000 mlx4: change TX coalescing defaults mlx4 currently uses a too high tx coalescing setting, deferring TX completion interrupts by up to 128 us. With the recent skb_orphan() removal in commit 8112ec3b872, performance of a single TCP flow is capped to ~4 Gbps, unless we increase tcp_limit_output_bytes. I suggest using 16 us instead of 128 us, allowing a finer control. Performance of a single TCP flow is restored to previous levels, while keeping TCP small queues fully enabled with default sysctl. This patch is also a BQL prereq. Reported-by: Vimalkumar <j.vimal@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yevgeny Petrilin <yevgenyp@mellanox.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Acked-by: Amir Vadai <amirv@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet
2013-Sep-06 17:00 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:> So I guess it would be good to revisit the default value of this > setting.If ixgbe requires 3 TSO packets in TX ring to get line rate, you also can tweak dev->gso_max_size from 65535 to 64000.
Eric Dumazet
2013-Sep-07 17:21 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: > > > So I guess it would be good to revisit the default value of this > > setting. > > If ixgbe requires 3 TSO packets in TX ring to get line rate, you also > can tweak dev->gso_max_size from 65535 to 64000.Another idea would be to no longer use tcp_limit_output_bytes but max(sk_pacing_rate / 1000, 2*MSS) This means that number of packets in FQ would be limited to the equivalent of 1ms, so TCP could have faster response to packet losses : Retransmitted packets would not have to wait for prior packets being drained from FQ For a 8Gbps flow, 1Gbyte/s, sk_pacing_rate would be 2Gbyte, this would translate to ~2 Mbytes in Qdisc/TX ring. sk_pacing_rate was introduced in linux-3.12, but could be backported easily.
Jason Wang
2013-Sep-09 09:27 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On 09/07/2013 12:56 AM, Eric Dumazet wrote:> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: >> On 06/09/13 13:57, Eric Dumazet wrote: >>> Well, I have no problem to get line rate on 20Gb with a single flow, so >>> other drivers have no problem. >> I''ve made some tests on bare metal: >> Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 >> ixgbe (TSO, GSO on), iperf 2.0.5 >> Transmitting packets toward the remote end (so running iperf -c on this >> host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. >> When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped >> to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% >> (softint percentage in top also increased from ~3 to ~5%) > Typical tradeoff between latency and throughput > > If you favor throughput, then you can increase tcp_limit_output_bytes > > The default is quite reasonable IMHO. > >> So I guess it would be good to revisit the default value of this >> setting. What hw you used Eric for your 20Gb results? > Mellanox CX-3 > > Make sure your NIC doesn''t hold TX packets in TX ring too long before > signaling an interrupt for TX completion.Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle packets in device accurately, and it also can''t do BQL. Does this means TSQ should be disabled for virtio-net?
Eric Dumazet
2013-Sep-09 13:47 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote:> Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle > packets in device accurately, and it also can''t do BQL. Does this means > TSQ should be disabled for virtio-net? >If skb are orphaned, there is no way TSQ can work at all. It is already disabled, so why do you want to disable it ?
Zoltan Kiss
2013-Sep-09 21:41 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On 07/09/13 18:21, Eric Dumazet wrote:> On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote: >> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: >> >>> So I guess it would be good to revisit the default value of this >>> setting. >> >> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also >> can tweak dev->gso_max_size from 65535 to 64000. > > Another idea would be to no longer use tcp_limit_output_bytes but > > max(sk_pacing_rate / 1000, 2*MSS)I''ve tried this on a freshly updated upstream, and it solved my problem on ixgbe: - if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { + if (atomic_read(&sk->sk_wmem_alloc) >= max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){ Now I can get proper line rate. Btw. I''ve tried to decrease dev->gso_max_size to 60K or 32K, both was ineffective. Regards, Zoli
Eric Dumazet
2013-Sep-09 21:56 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Mon, 2013-09-09 at 22:41 +0100, Zoltan Kiss wrote:> On 07/09/13 18:21, Eric Dumazet wrote: > > On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote: > >> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote: > >> > >>> So I guess it would be good to revisit the default value of this > >>> setting. > >> > >> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also > >> can tweak dev->gso_max_size from 65535 to 64000. > > > > Another idea would be to no longer use tcp_limit_output_bytes but > > > > max(sk_pacing_rate / 1000, 2*MSS) > > I''ve tried this on a freshly updated upstream, and it solved my problem > on ixgbe: > > - if (atomic_read(&sk->sk_wmem_alloc) >= > sysctl_tcp_limit_output_bytes) { > + if (atomic_read(&sk->sk_wmem_alloc) >= > max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){ > > Now I can get proper line rate. Btw. I''ve tried to decrease > dev->gso_max_size to 60K or 32K, both was ineffective.Yeah, my own test was more like the following diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 7c83cb8..07dc77a 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -1872,7 +1872,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle, /* TSQ : sk_wmem_alloc accounts skb truesize, * including skb overhead. But thats OK. */ - if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes) { + if (atomic_read(&sk->sk_wmem_alloc) >= max(2 * mss_now, + sk->sk_pacing_rate >> 8)) { set_bit(TSQ_THROTTLED, &tp->tsq_flags); break; } Note that it also seems to make Hystart happier. I will send patches when all tests are green.
Jason Wang
2013-Sep-10 07:45 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On 09/09/2013 09:47 PM, Eric Dumazet wrote:> On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote: > >> Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle >> packets in device accurately, and it also can''t do BQL. Does this means >> TSQ should be disabled for virtio-net? >> > If skb are orphaned, there is no way TSQ can work at all.For example, virtio-net will stop the tx queue when it finds the tx queue may full and enable the queue when some packets were sent. In this case, tsq works and throttles the total bytes queued in qdisc. This usually happen during heavy network load such as two sessions of netperf.> > It is already disabled, so why do you want to disable it ? > >We notice a regression, and bisect shows it was introduced by TSQ.
Eric Dumazet
2013-Sep-10 12:35 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Tue, 2013-09-10 at 15:45 +0800, Jason Wang wrote:> For example, virtio-net will stop the tx queue when it finds the tx > queue may full and enable the queue when some packets were sent. In this > case, tsq works and throttles the total bytes queued in qdisc. This > usually happen during heavy network load such as two sessions of netperf.You told me skb were _orphaned_. This automatically _disables_ TSQ, after packets leave Qdisc. So you have a problem because your skb orphaning is only working when packets leave Qdisc. If you cant afford sockets being throttled, make sure you have no Qdisc !> We notice a regression, and bisect shows it was introduced by TSQ.You do realize TSQ is a balance between throughput and latencies ? In case of TSQ, it was very clear that limiting amount of outstanding bytes in queues could have an impact on bandwidth. Pushing Megabytes of TCP packets with identical TCP timestamps is bad, because it prevents us doing delay based congestion control and a single flow could fill the Qdisc with a thousand of packets. (Self induced delays, see BufferBloat discussions) One known problem in TCP stack is that sendmsg() locks the socket for the duration of the call. sendpage() do not have this problem. tcp_tsq_handler() is deferred if tcp_tasklet_func() finds a locked socket. The owner of socket will call tcp_tsq_handler() when socket is released. So if you use sendmsg() with large buffers or if copyin data from user land involves page faults, it may explain why you need larger number of in-flight bytes to sustain a given throughput. You could take a look at commit c9bee3b7fdecb0c1d070c ("tcp: TCP_NOTSENT_LOWAT socket option"), and play with /proc/sys/net/ipv4/tcp_notsent_lowat, to force sendmsg() to release the socket lock every hundreds of kbytes.
Cong Wang
2013-Sep-21 03:00 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
Eric Dumazet <eric.dumazet <at> gmail.com> writes:> > Yeah, my own test was more like the following >...> > Note that it also seems to make Hystart happier. > > I will send patches when all tests are green. >How is this going? I don''t see any patch posted to netdev. Thanks!
Wei Liu
2013-Sep-21 15:03 UTC
Re: TSQ accounting skb->truesize degrades throughput for large packets
On Sat, Sep 21, 2013 at 03:00:26AM +0000, Cong Wang wrote:> Eric Dumazet <eric.dumazet <at> gmail.com> writes: > > > > > Yeah, my own test was more like the following > > > ... > > > > Note that it also seems to make Hystart happier. > > > > I will send patches when all tests are green. > > > > How is this going? I don''t see any patch posted to netdev. >I''m afraid you forgot to CC any relevant people in thie email. :-) Wei.> > Thanks! > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel
Cong Wang
2013-Sep-22 02:36 UTC
Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets
On Sat, Sep 21, 2013 at 11:03 PM, Wei Liu <wei.liu2@citrix.com> wrote:> On Sat, Sep 21, 2013 at 03:00:26AM +0000, Cong Wang wrote: >> Eric Dumazet <eric.dumazet <at> gmail.com> writes: >> >> > >> > Yeah, my own test was more like the following >> > >> ... >> > >> > Note that it also seems to make Hystart happier. >> > >> > I will send patches when all tests are green. >> > >> >> How is this going? I don''t see any patch posted to netdev. >> > > I''m afraid you forgot to CC any relevant people in thie email. :-) >I was replying via newsgroup, not mailing list. :) Anyway, adding Eric and netdev now.
Eric Dumazet
2013-Sep-22 14:58 UTC
Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets
On Sun, 2013-09-22 at 10:36 +0800, Cong Wang wrote:> > I was replying via newsgroup, not mailing list. :) > > Anyway, adding Eric and netdev now.Yes, dont worry, this will be done on Monday or Tuesday. I am still in New Orleans after LPC 2013.