thr3ads.net - Xen devel - TSQ accounting skb->truesize degrades throughput for large packets [Sep 2013]

If this information is useful, please help other people find it:
Share via:

Wei Liu

2013-Sep-06 10:16 UTC

TSQ accounting skb->truesize degrades throughput for large packets

Hi Eric

I have some questions regarding TSQ and I hope you can shed some light
on this.

Our observation is that with the default TSQ limit (128K), throughput
for Xen network driver for large packets degrades. That''s because we
now
only have 1 packet in queue.

I double-checked that skb->len is indeed <64K. Then I discovered that
TSQ actually accounts for skb->truesize and the packets generated had
skb->truesize > 64K which effectively prevented us from putting 2
packets in queue.

There seems to be no way to limit skb->truesize inside driver -- the skb
is already constructed when it comes to xen-netfront.

My questions are:
  1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
     accounts skb truesize, including skb overhead. But thats OK", I
     don''t quite understand why it is OK.
  2) presumably other drivers will suffer from this as well, is it
     possible to account for skb->len instead of skb->truesize?
  3) if accounting skb->truesize is on purpose, does that mean we only
     need to tune that value instead of trying to fix our driver (if
     there is a way to)?

Thanks
Wei.

Eric Dumazet

2013-Sep-06 12:57 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote:> Hi Eric
> 
> I have some questions regarding TSQ and I hope you can shed some light
> on this.
> 
> Our observation is that with the default TSQ limit (128K), throughput
> for Xen network driver for large packets degrades. That''s because
we now
> only have 1 packet in queue.
> 
> I double-checked that skb->len is indeed <64K. Then I discovered that
> TSQ actually accounts for skb->truesize and the packets generated had
> skb->truesize > 64K which effectively prevented us from putting 2
> packets in queue.
> 
> There seems to be no way to limit skb->truesize inside driver -- the skb
> is already constructed when it comes to xen-netfront.
> 
What is the skb->truesize value then ? It must be huge, and its clearly
a problem, because the tcp _receiver_ will also grow its window slower,
if packet is looped back.
> My questions are:
>   1) I see the comment in tcp_output.c saying: "TSQ : sk_wmem_alloc
>      accounts skb truesize, including skb overhead. But thats OK", I
>      don''t quite understand why it is OK.
>   2) presumably other drivers will suffer from this as well, is it
>      possible to account for skb->len instead of skb->truesize?
Well, I have no problem to get line rate on 20Gb with a single flow, so
other drivers have no problem.
>   3) if accounting skb->truesize is on purpose, does that mean we only
>      need to tune that value instead of trying to fix our driver (if
>      there is a way to)?
The check in TCP allows for two packets at least, unless a single skb
truesize is 128K ?


if (atomic_read(&sk->sk_wmem_alloc) >= sysctl_tcp_limit_output_bytes)
{
    set_bit(TSQ_THROTTLED, &tp->tsq_flags);
    break;
}

So if a skb->truesize is 100K, this condition allows two packets, before
throttling the third packet.

Its actually hard to account for skb->len, because sk_wmem_alloc
accounts for skb->truesize : I do not want to add another
sk->sk_wbytes_alloc new atomic field.

Wei Liu

2013-Sep-06 13:12 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Fri, Sep 06, 2013 at 05:57:48AM -0700, Eric Dumazet
wrote:> On Fri, 2013-09-06 at 11:16 +0100, Wei Liu wrote:
> > Hi Eric
> > 
> > I have some questions regarding TSQ and I hope you can shed some light
> > on this.
> > 
> > Our observation is that with the default TSQ limit (128K), throughput
> > for Xen network driver for large packets degrades. That''s
because we now
> > only have 1 packet in queue.
> > 
> > I double-checked that skb->len is indeed <64K. Then I discovered
that
> > TSQ actually accounts for skb->truesize and the packets generated
had
> > skb->truesize > 64K which effectively prevented us from putting
2
> > packets in queue.
> > 
> > There seems to be no way to limit skb->truesize inside driver --
the skb
> > is already constructed when it comes to xen-netfront.
> > 
> 
> What is the skb->truesize value then ? It must be huge, and its clearly
> a problem, because the tcp _receiver_ will also grow its window slower,
> if packet is looped back.
> 
It''s ~66KB.
> > My questions are:
> >   1) I see the comment in tcp_output.c saying: "TSQ :
sk_wmem_alloc
> >      accounts skb truesize, including skb overhead. But thats
OK", I
> >      don''t quite understand why it is OK.
> >   2) presumably other drivers will suffer from this as well, is it
> >      possible to account for skb->len instead of skb->truesize?
> 
> Well, I have no problem to get line rate on 20Gb with a single flow, so
> other drivers have no problem.
> 
OK, good to know this.
> >   3) if accounting skb->truesize is on purpose, does that mean we
only
> >      need to tune that value instead of trying to fix our driver (if
> >      there is a way to)?
> 
> The check in TCP allows for two packets at least, unless a single skb
> truesize is 128K ?
> 
> 
> if (atomic_read(&sk->sk_wmem_alloc) >=
sysctl_tcp_limit_output_bytes) {
>     set_bit(TSQ_THROTTLED, &tp->tsq_flags);
>     break;
> }
> 
> So if a skb->truesize is 100K, this condition allows two packets, before
> throttling the third packet.
> 
OK. I need to check why we''re getting only 1 then.

Thanks for your reply.

Wei.
> Its actually hard to account for skb->len, because sk_wmem_alloc
> accounts for skb->truesize : I do not want to add another
> sk->sk_wbytes_alloc new atomic field.
> 
>

Zoltan Kiss

2013-Sep-06 16:36 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On 06/09/13 13:57, Eric Dumazet wrote:> Well, I have no problem to get line rate on 20Gb with a single flow, so
> other drivers have no problem.I''ve made some tests on bare metal:
Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 
ixgbe (TSO, GSO on), iperf 2.0.5
Transmitting packets toward the remote end (so running iperf -c on this 
host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. 
When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped 
to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% 
(softint percentage in top also increased from ~3 to ~5%)
So I guess it would be good to revisit the default value of this 
setting. What hw you used Eric for your 20Gb results?

Regards,

Zoli

Eric Dumazet

2013-Sep-06 16:56 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:> On 06/09/13 13:57, Eric Dumazet wrote:
> > Well, I have no problem to get line rate on 20Gb with a single flow,
so
> > other drivers have no problem.
> I''ve made some tests on bare metal:
> Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 
> ixgbe (TSO, GSO on), iperf 2.0.5
> Transmitting packets toward the remote end (so running iperf -c on this 
> host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. 
> When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped 
> to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40% 
> (softint percentage in top also increased from ~3 to ~5%)
Typical tradeoff between latency and throughput

If you favor throughput, then you can increase tcp_limit_output_bytes

The default is quite reasonable IMHO.
> So I guess it would be good to revisit the default value of this 
> setting. What hw you used Eric for your 20Gb results?
Mellanox CX-3

Make sure your NIC doesn''t hold TX packets in TX ring too long before
signaling an interrupt for TX completion.

For example I had to patch mellanox :

commit ecfd2ce1a9d5e6376ff5c00b366345160abdbbb7
Author: Eric Dumazet <edumazet@google.com>
Date:   Mon Nov 5 16:20:42 2012 +0000

    mlx4: change TX coalescing defaults
    
    mlx4 currently uses a too high tx coalescing setting, deferring
    TX completion interrupts by up to 128 us.
    
    With the recent skb_orphan() removal in commit 8112ec3b872,
    performance of a single TCP flow is capped to ~4 Gbps, unless
    we increase tcp_limit_output_bytes.
    
    I suggest using 16 us instead of 128 us, allowing a finer control.
    
    Performance of a single TCP flow is restored to previous levels,
    while keeping TCP small queues fully enabled with default sysctl.
    
    This patch is also a BQL prereq.
    
    Reported-by: Vimalkumar <j.vimal@gmail.com>
    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Cc: Yevgeny Petrilin <yevgenyp@mellanox.com>
    Cc: Or Gerlitz <ogerlitz@mellanox.com>
    Acked-by: Amir Vadai <amirv@mellanox.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

Eric Dumazet

2013-Sep-06 17:00 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
> So I guess it would be good to revisit the default value of this 
> setting.
If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
can tweak dev->gso_max_size from 65535 to 64000.

Eric Dumazet

2013-Sep-07 17:21 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
> 
> > So I guess it would be good to revisit the default value of this 
> > setting.
> 
> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
> can tweak dev->gso_max_size from 65535 to 64000.
Another idea would be to no longer use tcp_limit_output_bytes but

max(sk_pacing_rate / 1000, 2*MSS)

This means that number of packets in FQ would be limited to the
equivalent of 1ms, so TCP could have faster response to packet losses : 

Retransmitted packets would not have to wait for prior packets being
drained from FQ

For a 8Gbps flow, 1Gbyte/s, sk_pacing_rate would be 2Gbyte, this would
translate to ~2 Mbytes in Qdisc/TX ring.

sk_pacing_rate was introduced in linux-3.12, but could be backported
easily.

Jason Wang

2013-Sep-09 09:27 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On 09/07/2013 12:56 AM, Eric Dumazet wrote:> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
>> On 06/09/13 13:57, Eric Dumazet wrote:
>>> Well, I have no problem to get line rate on 20Gb with a single
flow, so
>>> other drivers have no problem.
>> I''ve made some tests on bare metal:
>> Dell PE R815, Intel 82599EB 10Gb, 3.11-rc4 32 bit kernel with 3.17.3 
>> ixgbe (TSO, GSO on), iperf 2.0.5
>> Transmitting packets toward the remote end (so running iperf -c on this
>> host) can make 8.3 Gbps with the default 128k tcp_limit_output_bytes. 
>> When I increased this to 131.506 (128k + 434 bytes) suddenly it jumped 
>> to 9.4 Gbps. Iperf CPU usage also jumped a few percent from ~36 to ~40%
>> (softint percentage in top also increased from ~3 to ~5%)
> Typical tradeoff between latency and throughput
>
> If you favor throughput, then you can increase tcp_limit_output_bytes
>
> The default is quite reasonable IMHO.
>
>> So I guess it would be good to revisit the default value of this 
>> setting. What hw you used Eric for your 20Gb results?
> Mellanox CX-3
>
> Make sure your NIC doesn''t hold TX packets in TX ring too long
before
> signaling an interrupt for TX completion.
Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle
packets in device accurately, and it also can''t do BQL. Does this means
TSQ should be disabled for virtio-net?

Eric Dumazet

2013-Sep-09 13:47 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote:
> Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle
> packets in device accurately, and it also can''t do BQL. Does this
means
> TSQ should be disabled for virtio-net?
> 
If skb are orphaned, there is no way TSQ can work at all.

It is already disabled, so why do you want to disable it ?

Zoltan Kiss

2013-Sep-09 21:41 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On 07/09/13 18:21, Eric Dumazet wrote:> On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:
>> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
>>
>>> So I guess it would be good to revisit the default value of this
>>> setting.
>>
>> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
>> can tweak dev->gso_max_size from 65535 to 64000.
>
> Another idea would be to no longer use tcp_limit_output_bytes but
>
> max(sk_pacing_rate / 1000, 2*MSS)
I''ve tried this on a freshly updated upstream, and it solved my problem
on ixgbe:

-               if (atomic_read(&sk->sk_wmem_alloc) >= 
sysctl_tcp_limit_output_bytes) {
+               if (atomic_read(&sk->sk_wmem_alloc) >= 
max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){

Now I can get proper line rate. Btw. I''ve tried to decrease 
dev->gso_max_size to 60K or 32K, both was ineffective.

Regards,

Zoli

Eric Dumazet

2013-Sep-09 21:56 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Mon, 2013-09-09 at 22:41 +0100, Zoltan Kiss wrote:> On 07/09/13 18:21, Eric Dumazet wrote:
> > On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:
> >> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
> >>
> >>> So I guess it would be good to revisit the default value of
this
> >>> setting.
> >>
> >> If ixgbe requires 3 TSO packets in TX ring to get line rate, you
also
> >> can tweak dev->gso_max_size from 65535 to 64000.
> >
> > Another idea would be to no longer use tcp_limit_output_bytes but
> >
> > max(sk_pacing_rate / 1000, 2*MSS)
> 
> I''ve tried this on a freshly updated upstream, and it solved my
problem
> on ixgbe:
> 
> -               if (atomic_read(&sk->sk_wmem_alloc) >= 
> sysctl_tcp_limit_output_bytes) {
> +               if (atomic_read(&sk->sk_wmem_alloc) >= 
> max(sk->sk_pacing_rate / 1000, 2 * mss_now) ){
> 
> Now I can get proper line rate. Btw. I''ve tried to decrease 
> dev->gso_max_size to 60K or 32K, both was ineffective.
Yeah, my own test was more like the following


diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7c83cb8..07dc77a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1872,7 +1872,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int
mss_now, int nonagle,
 		/* TSQ : sk_wmem_alloc accounts skb truesize,
 		 * including skb overhead. But thats OK.
 		 */
-		if (atomic_read(&sk->sk_wmem_alloc) >=
sysctl_tcp_limit_output_bytes) {
+		if (atomic_read(&sk->sk_wmem_alloc) >= max(2 * mss_now,
+							   sk->sk_pacing_rate >> 8)) {
 			set_bit(TSQ_THROTTLED, &tp->tsq_flags);
 			break;
 		}


Note that it also seems to make Hystart happier.

I will send patches when all tests are green.

Jason Wang

2013-Sep-10 07:45 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On 09/09/2013 09:47 PM, Eric Dumazet wrote:> On Mon, 2013-09-09 at 17:27 +0800, Jason Wang wrote:
>
>> Virtio-net orphan the skb in .ndo_start_xmit() so TSQ can not throttle
>> packets in device accurately, and it also can''t do BQL. Does
this means
>> TSQ should be disabled for virtio-net?
>>
> If skb are orphaned, there is no way TSQ can work at all.
For example, virtio-net will stop the tx queue when it finds the tx
queue may full and enable the queue when some packets were sent. In this
case, tsq works and throttles the total bytes queued in qdisc. This
usually happen during heavy network load such as two sessions of
netperf.>
> It is already disabled, so why do you want to disable it ?
>
>
We notice a regression, and bisect shows it was introduced by TSQ.

Eric Dumazet

2013-Sep-10 12:35 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Tue, 2013-09-10 at 15:45 +0800, Jason Wang wrote:
> For example, virtio-net will stop the tx queue when it finds the tx
> queue may full and enable the queue when some packets were sent. In this
> case, tsq works and throttles the total bytes queued in qdisc. This
> usually happen during heavy network load such as two sessions of netperf.
You told me skb were _orphaned_.

This automatically _disables_ TSQ, after packets leave Qdisc.

So you have a problem because your skb orphaning is only working when
packets leave Qdisc.

If you cant afford sockets being throttled, make sure you have no
Qdisc !
> We notice a regression, and bisect shows it was introduced by TSQ.
You do realize TSQ is a balance between throughput and latencies ?

In case of TSQ, it was very clear that limiting amount of outstanding
bytes in queues could have an impact on bandwidth.

Pushing Megabytes of TCP packets with identical TCP timestamps is
bad, because it prevents us doing delay based congestion control and
a single flow could fill the Qdisc with a thousand of packets.
(Self induced delays, see BufferBloat discussions)

One known problem in TCP stack is that sendmsg() locks the socket for
the duration of the call. sendpage() do not have this problem.

tcp_tsq_handler() is deferred if tcp_tasklet_func() finds a locked
socket. The owner of socket will call tcp_tsq_handler() when socket is
released.

So if you use sendmsg() with large buffers or if copyin data from user
land involves page faults, it may explain why you need larger number of
in-flight bytes to sustain a given throughput.

You could take a look at commit c9bee3b7fdecb0c1d070c
("tcp: TCP_NOTSENT_LOWAT socket option"), and play
with /proc/sys/net/ipv4/tcp_notsent_lowat, to force sendmsg() to release
the socket lock every hundreds of kbytes.

Cong Wang

2013-Sep-21 03:00 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

Eric Dumazet <eric.dumazet <at> gmail.com> writes:
> 
> Yeah, my own test was more like the following
> 
...> 
> Note that it also seems to make Hystart happier.
> 
> I will send patches when all tests are green.
> 
How is this going? I don''t see any patch posted to netdev.


Thanks!

Wei Liu

2013-Sep-21 15:03 UTC

head link

Re: TSQ accounting skb->truesize degrades throughput for large packets

On Sat, Sep 21, 2013 at 03:00:26AM +0000, Cong Wang
wrote:> Eric Dumazet <eric.dumazet <at> gmail.com> writes:
> 
> > 
> > Yeah, my own test was more like the following
> > 
> ...
> > 
> > Note that it also seems to make Hystart happier.
> > 
> > I will send patches when all tests are green.
> > 
> 
> How is this going? I don''t see any patch posted to netdev.
> 
I''m afraid you forgot to CC any relevant people in thie email. :-)

Wei.
> 
> Thanks!
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

Cong Wang

2013-Sep-22 02:36 UTC

head link

Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets

On Sat, Sep 21, 2013 at 11:03 PM, Wei Liu <wei.liu2@citrix.com>
wrote:> On Sat, Sep 21, 2013 at 03:00:26AM +0000, Cong Wang wrote:
>> Eric Dumazet <eric.dumazet <at> gmail.com> writes:
>>
>> >
>> > Yeah, my own test was more like the following
>> >
>> ...
>> >
>> > Note that it also seems to make Hystart happier.
>> >
>> > I will send patches when all tests are green.
>> >
>>
>> How is this going? I don''t see any patch posted to netdev.
>>
>
> I''m afraid you forgot to CC any relevant people in thie email. :-)
>
I was replying via newsgroup, not mailing list. :)

Anyway, adding Eric and netdev now.

Eric Dumazet

2013-Sep-22 14:58 UTC

head link

Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets

On Sun, 2013-09-22 at 10:36 +0800, Cong Wang wrote:
> 
> I was replying via newsgroup, not mailing list. :)
> 
> Anyway, adding Eric and netdev now.
Yes, dont worry, this will be done on Monday or Tuesday.

I am still in New Orleans after LPC 2013.

Xen devel - Sep 2013 - TSQ accounting skb->truesize degrades throughput for large packets

TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: TSQ accounting skb->truesize degrades throughput for large packets

Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets

Re: [Xen-devel] TSQ accounting skb->truesize degrades throughput for large packets