thr3ads.net - crossbow discuss - [crossbow-discuss] Hwo to disable the polling function of mac

If this information is useful, please help other people find it:
Share via:

zhihui Chen

2009-Mar-31 14:27 UTC

[crossbow-discuss] Hwo to disable the polling function of mac_srs

In crossbow, each mac_srs has a kernel thread called
"mac_rx_srs_poll_ring"
to poll the hardware and crossbow will wakeup this thread to poll packets
from the hardware automatically. Does crossbow provide any method to disable
the polling mechanism, for example disabling the this kernel thread?
Thanks
Zhihui
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20090331/b2c2c2b8/attachment.html>

rajagopal kunhappan

2009-Mar-31 17:14 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

Hi Zhihui,
> In crossbow, each mac_srs has a kernel thread called
"mac_rx_srs_poll_ring"
> to poll the hardware and crossbow will wakeup this thread to poll packets
> from the hardware automatically. Does crossbow provide any method to
disable
> the polling mechanism, for example disabling the this kernel thread?
Presently no. Can we know why you would want to do that?

Thanks,
-krgopi
> Thanks
> Zhihui
> 
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> crossbow-discuss mailing list
> crossbow-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss

--

zhihui Chen

2009-Apr-01 03:45 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

During my test for 10GBE(Intel Ixgbe) with snv_110, I find the context
switch is a big problem for the performance. Benchmark: Netperf-2.4.4
Workload: TCP_STREAM (sending 8KB-size tcp packets from SUT to remote
machine)

Crossbow use two kernel threads ( mac_soft_ring_worker and
mac_rx_srs_poll_ring) to help send and recv packets in the kernel. On
multi-core or multi-processor system, these two threads and interrupt for
nic  can run on different CPU. Considering following scenario on my 4-core
system:
    mac_soft_ring_worker: CPU 1
    mac_rx_srs_poll_ring:  CPU 1
    Interrupt: CPU 2

I run the workload and bind the application to free CPU 0, then I get the
performance results at 8.8Gbps and mpstat output like following:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0    17    3   21    9    0 1093    0 134501    3  97   0
0
  1    0   0    0    29   13 56972    2    7  992    0     2    0  50   0
 50
  2   14   0    0 19473 19455   37    0    8    0    0   149    0  28   0
 72
  3    0   0    1   305  104  129    0    4    1    0     9    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0    14    2    2    7    0 1120    0 133511    3  97   0
0
  1    0   0    0    24   12 54501    2    6  971    0     2    0  48   0
 52
  2    0   0    0 19668 19648   45    0    9    0    0   149    0  28   0
 72
  3    0   0    0   306  104  128    0    6    0    0    11    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0    14    2   21    8    2 1107    0 134569    3  97   0
0
  1    0   0    0    32   16 57522    2    6  928    0     2    0  50   0
 50
  2    0   0    0 19564 19542   46    0   10    1    0   140    0  28   0
 72
  3    0   0    0   306  104  122    0    7    0    0    58    0   0   0 100


Next step, I just run one dtrace command: dtrace -n ''mac_tx:entry{@
[probefunc,stack()]=count();}''
Then I can get performance at 9.57Gbps and mpstat output like following:
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0    23    5 2055    9    4  529    0 142719    3  81   0
 15
  1    0   0    1    21    8 24343    0    2 2523    0     0    0  88   0
 12
  2   14   0    5 19678 19537   81    0    5    0    0   150    0  43   0
 57
  3    0   0    6   308  104   93    0    5    2    0   278    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    2    19    4 1998    9    6  556    0 142911    3  82   0
 16
  1    0   0    0    20    8 23543    1    2 2556    0     0    0  88   0
 12
  2    0   0    6 19647 19499  106    0    8    1    0   266    0  43   0
 57
  3    0   0    2   308  104   70    0    5    1    0    28    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0    21    3 1968   10    4  556    0 144547    3  82   0
 15
  1    0   0    0    20   10 23334    0    2 2622    0     0    0  90   0
 10
  2    0   0    9 19797 19658   92    0   10    1    0   274    0  44   0
 56
  3    0   0    0   307  104   95    0    6    2    0   182    0   0   0  99

I dont think dtrace can help improve the performance of nic. If you compare
the mpstat output, the biggest difference is that context switch has been
reduced very much from 55000 to 23000.   This leads to my point that two
much context switches hinders the performance of crossbow.
If I make these two kernel threads and interrupt run in totally different
cores, the performance can be reduced to about 7.8Gbps while context
switches will be increase to at about 80000 per second, but the interrupts
remains at about 19500/s

In crossbow, thread mac_soft_ring_worker will be wakeup by
mac_rx_srs_poll_ring and interrupt through calling the function
mac_soft_ring_worker_wakeup. I just think that if I can disable the polling
function, then context switches should be reduced.

Thanks
Zhihui



2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at sun.com>
> Hi Zhihui,
>
>
>  In crossbow, each mac_srs has a kernel thread called
>> "mac_rx_srs_poll_ring"
>> to poll the hardware and crossbow will wakeup this thread to poll
packets
>> from the hardware automatically. Does crossbow provide any method to
>> disable
>> the polling mechanism, for example disabling the this kernel thread?
>>
>
> Presently no. Can we know why you would want to do that?
>
> Thanks,
> -krgopi
>
>  Thanks
>> Zhihui
>>
>>
>>
>>
------------------------------------------------------------------------
>>
>> _______________________________________________
>> crossbow-discuss mailing list
>> crossbow-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
>>
>
>
> --
>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20090401/17ff04cd/attachment.html>

Sunay Tripathi

2009-Apr-01 04:45 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

Sure. Just run this and polling will get disabled
% dladm create-vnic -l ixgbe0 vnic1

Let us know what you get with polling disabled. We don''t have a
tunable to disable polling but since ixgbe can''t assign rx rings to
VNIC yet, it disable polling for primary NIC as well.

Cheers,
Sunay

zhihui Chen wrote:> During my test for 10GBE(Intel Ixgbe) with snv_110, I find the context 
> switch is a big problem for the performance. 
> Benchmark: Netperf-2.4.4
> Workload: TCP_STREAM (sending 8KB-size tcp packets from SUT to remote 
> machine)
> 
> Crossbow use two kernel threads ( mac_soft_ring_worker and 
> mac_rx_srs_poll_ring) to help send and recv packets in the kernel. On 
> multi-core or multi-processor system, these two threads and interrupt 
> for nic  can run on different CPU. Considering following scenario on my 
> 4-core system:
>     mac_soft_ring_worker: CPU 1
>     mac_rx_srs_poll_ring:  CPU 1
>     Interrupt: CPU 2
> 
> I run the workload and bind the application to free CPU 0, then I get 
> the performance results at 8.8Gbps and mpstat output like following: 
>      
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0    17    3   21    9    0 1093    0 134501    3  97   
> 0   0
>   1    0   0    0    29   13 56972    2    7  992    0     2    0  50   
> 0  50
>   2   14   0    0 19473 19455   37    0    8    0    0   149    0  28   
> 0  72
>   3    0   0    1   305  104  129    0    4    1    0     9    0   0   0
100
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0    14    2    2    7    0 1120    0 133511    3  97   
> 0   0
>   1    0   0    0    24   12 54501    2    6  971    0     2    0  48   
> 0  52
>   2    0   0    0 19668 19648   45    0    9    0    0   149    0  28   
> 0  72
>   3    0   0    0   306  104  128    0    6    0    0    11    0   0   0
100
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0    14    2   21    8    2 1107    0 134569    3  97   
> 0   0
>   1    0   0    0    32   16 57522    2    6  928    0     2    0  50   
> 0  50
>   2    0   0    0 19564 19542   46    0   10    1    0   140    0  28   
> 0  72
>   3    0   0    0   306  104  122    0    7    0    0    58    0   0   0
100
> 
> 
> Next step, I just run one dtrace command: dtrace -n 
> ''mac_tx:entry{@[probefunc,stack()]=count();}''
> Then I can get performance at 9.57Gbps and mpstat output like following:
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0    23    5 2055    9    4  529    0 142719    3  81   
> 0  15
>   1    0   0    1    21    8 24343    0    2 2523    0     0    0  88   
> 0  12
>   2   14   0    5 19678 19537   81    0    5    0    0   150    0  43   
> 0  57
>   3    0   0    6   308  104   93    0    5    2    0   278    0   0   0
100
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    2    19    4 1998    9    6  556    0 142911    3  82   
> 0  16
>   1    0   0    0    20    8 23543    1    2 2556    0     0    0  88   
> 0  12
>   2    0   0    6 19647 19499  106    0    8    1    0   266    0  43   
> 0  57
>   3    0   0    2   308  104   70    0    5    1    0    28    0   0   0
100
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0    21    3 1968   10    4  556    0 144547    3  82   
> 0  15
>   1    0   0    0    20   10 23334    0    2 2622    0     0    0  90   
> 0  10
>   2    0   0    9 19797 19658   92    0   10    1    0   274    0  44   
> 0  56
>   3    0   0    0   307  104   95    0    6    2    0   182    0   0   0 
99
> 
> I dont think dtrace can help improve the performance of nic. If you 
> compare the mpstat output, the biggest difference is that context switch 
> has been reduced very much from 55000 to 23000.   This leads to my point 
> that two much context switches hinders the performance of crossbow.
> If I make these two kernel threads and interrupt run in totally 
> different cores, the performance can be reduced to about 7.8Gbps while 
> context switches will be increase to at about 80000 per second, but the 
> interrupts remains at about 19500/s
> 
> In crossbow, thread mac_soft_ring_worker will be wakeup by 
> mac_rx_srs_poll_ring and interrupt through calling the function 
> mac_soft_ring_worker_wakeup. I just think that if I can disable the 
> polling function, then context switches should be reduced.
>  
> Thanks
> Zhihui
> 
> 
> 
> 2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at sun.com 
> <mailto:rajagopal.kunhappan at sun.com>>
> 
>     Hi Zhihui,
> 
> 
>         In crossbow, each mac_srs has a kernel thread called
>         "mac_rx_srs_poll_ring"
>         to poll the hardware and crossbow will wakeup this thread to
>         poll packets
>         from the hardware automatically. Does crossbow provide any
>         method to disable
>         the polling mechanism, for example disabling the this kernel
thread?
> 
> 
>     Presently no. Can we know why you would want to do that?
> 
>     Thanks,
>     -krgopi
> 
>         Thanks
>         Zhihui
> 
> 
> 
>        
------------------------------------------------------------------------
> 
>         _______________________________________________
>         crossbow-discuss mailing list
>         crossbow-discuss at opensolaris.org
>         <mailto:crossbow-discuss at opensolaris.org>
>         http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
> 
> 
> 
>     -- 
> 
>

zhihui Chen

2009-Apr-01 08:37 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

Thanks, I have tried your method and the poll function is disabled. After
that, the context switches is decreased very much, but the performance is
still remained at 8.8Gbps. Mpstat output like following:CPU minf mjf xcal
 intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0    33    1   15    8    0 1366    0 134084    3  97   0
0
  1   10   0    0    56   16 35208    6    8 1736    0     9    0  48   0
 52
  2    4   0   37 19646 19618   58    0   10  408    0   154    0  34   0
 66
  3    0   0    0   308  107  118    0    6    1    0   185    0   0   0 100


Then I use the same dtrace command again, the performance is improved to
9.5Gbps and context switches is also reduced from 35000 to 6300. Mpstat
output likes following:
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0    17    3 1858    8    3  605    0 142000    3  81   0
 16
  1    0   0    0    15    6 4472    0    2 2516    0     0    0  92   0   8
  2    0   0    0 19740 19679  126    0    6  253    0   289    0  46   0
 54
  3    0   0    9   509  208   66    0    4    0    0   272    0   0   0 100

Because this is TX heavy workload, maybe we should care more about thread
mac_soft_ring_worker???

Thanks
Zhihui

2009/4/1 Sunay Tripathi <Sunay.Tripathi at sun.com>
> Sure. Just run this and polling will get disabled
> % dladm create-vnic -l ixgbe0 vnic1
>
> Let us know what you get with polling disabled. We don''t have a
> tunable to disable polling but since ixgbe can''t assign rx rings
to
> VNIC yet, it disable polling for primary NIC as well.
>
> Cheers,
> Sunay
>
> zhihui Chen wrote:
>
>> During my test for 10GBE(Intel Ixgbe) with snv_110, I find the context
>> switch is a big problem for the performance. Benchmark: Netperf-2.4.4
>> Workload: TCP_STREAM (sending 8KB-size tcp packets from SUT to remote
>> machine)
>>
>> Crossbow use two kernel threads ( mac_soft_ring_worker and
>> mac_rx_srs_poll_ring) to help send and recv packets in the kernel. On
>> multi-core or multi-processor system, these two threads and interrupt
for
>> nic  can run on different CPU. Considering following scenario on my
4-core
>> system:
>>    mac_soft_ring_worker: CPU 1
>>    mac_rx_srs_poll_ring:  CPU 1
>>    Interrupt: CPU 2
>>
>> I run the workload and bind the application to free CPU 0, then I get
the
>> performance results at 8.8Gbps and mpstat output like following:    
CPU
>> minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>>  0    0   0    0    17    3   21    9    0 1093    0 134501    3  97  
0
>> 0
>>  1    0   0    0    29   13 56972    2    7  992    0     2    0  50  
0
>>  50
>>  2   14   0    0 19473 19455   37    0    8    0    0   149    0  28  
0
>>  72
>>  3    0   0    1   305  104  129    0    4    1    0     9    0   0   0
>> 100
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt
>> idl
>>  0    0   0    0    14    2    2    7    0 1120    0 133511    3  97  
0
>> 0
>>  1    0   0    0    24   12 54501    2    6  971    0     2    0  48  
0
>>  52
>>  2    0   0    0 19668 19648   45    0    9    0    0   149    0  28  
0
>>  72
>>  3    0   0    0   306  104  128    0    6    0    0    11    0   0   0
>> 100
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt
>> idl
>>  0    0   0    0    14    2   21    8    2 1107    0 134569    3  97  
0
>> 0
>>  1    0   0    0    32   16 57522    2    6  928    0     2    0  50  
0
>>  50
>>  2    0   0    0 19564 19542   46    0   10    1    0   140    0  28  
0
>>  72
>>  3    0   0    0   306  104  122    0    7    0    0    58    0   0   0
>> 100
>>
>>
>> Next step, I just run one dtrace command: dtrace -n
''mac_tx:entry{@
>> [probefunc,stack()]=count();}''
>> Then I can get performance at 9.57Gbps and mpstat output like
following:
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt
>> idl
>>  0    0   0    0    23    5 2055    9    4  529    0 142719    3  81  
0
>>  15
>>  1    0   0    1    21    8 24343    0    2 2523    0     0    0  88  
0
>>  12
>>  2   14   0    5 19678 19537   81    0    5    0    0   150    0  43  
0
>>  57
>>  3    0   0    6   308  104   93    0    5    2    0   278    0   0   0
>> 100
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt
>> idl
>>  0    0   0    2    19    4 1998    9    6  556    0 142911    3  82  
0
>>  16
>>  1    0   0    0    20    8 23543    1    2 2556    0     0    0  88  
0
>>  12
>>  2    0   0    6 19647 19499  106    0    8    1    0   266    0  43  
0
>>  57
>>  3    0   0    2   308  104   70    0    5    1    0    28    0   0   0
>> 100
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt
>> idl
>>  0    0   0    0    21    3 1968   10    4  556    0 144547    3  82  
0
>>  15
>>  1    0   0    0    20   10 23334    0    2 2622    0     0    0  90  
0
>>  10
>>  2    0   0    9 19797 19658   92    0   10    1    0   274    0  44  
0
>>  56
>>  3    0   0    0   307  104   95    0    6    2    0   182    0   0   0
>>  99
>>
>> I dont think dtrace can help improve the performance of nic. If you
>> compare the mpstat output, the biggest difference is that context
switch has
>> been reduced very much from 55000 to 23000.   This leads to my point
that
>> two much context switches hinders the performance of crossbow.
>> If I make these two kernel threads and interrupt run in totally
different
>> cores, the performance can be reduced to about 7.8Gbps while context
>> switches will be increase to at about 80000 per second, but the
interrupts
>> remains at about 19500/s
>>
>> In crossbow, thread mac_soft_ring_worker will be wakeup by
>> mac_rx_srs_poll_ring and interrupt through calling the function
>> mac_soft_ring_worker_wakeup. I just think that if I can disable the
polling
>> function, then context switches should be reduced.
>>  Thanks
>> Zhihui
>>
>>
>>
>> 2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at sun.com
<mailto:
>> rajagopal.kunhappan at sun.com>>
>>
>>    Hi Zhihui,
>>
>>
>>        In crossbow, each mac_srs has a kernel thread called
>>        "mac_rx_srs_poll_ring"
>>        to poll the hardware and crossbow will wakeup this thread to
>>        poll packets
>>        from the hardware automatically. Does crossbow provide any
>>        method to disable
>>        the polling mechanism, for example disabling the this kernel
>> thread?
>>
>>
>>    Presently no. Can we know why you would want to do that?
>>
>>    Thanks,
>>    -krgopi
>>
>>        Thanks
>>        Zhihui
>>
>>
>>
>>
>> 
------------------------------------------------------------------------
>>
>>        _______________________________________________
>>        crossbow-discuss mailing list
>>        crossbow-discuss at opensolaris.org
>>        <mailto:crossbow-discuss at opensolaris.org>
>>        http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
>>
>>
>>
>>    --
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20090401/21fe9bac/attachment.html>

zhihui Chen

2009-Apr-01 09:31 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

We know that mac_tx has two call path to go through, one is syscall path
 like following:               dld`str_mdata_fastpath_put+0xa4
 ip`tcp_send_data+0x8b4
              ip`tcp_output+0x7ea
              ip`squeue_enter+0x330
              ip`tcp_sendmsg+0xfb
              sockfs`so_sendmsg+0x1c7
              sockfs`socket_sendmsg+0x61
              sockfs`sendit+0x167
              sockfs`send+0x78
              sockfs`send32+0x22
              unix`_sys_sysenter_post_swapgs+0x14b

The other path is worker thread path like followng:
              dld`str_mdata_fastpath_put+0xa4
              ip`tcp_send_data+0x8b4
              ip`tcp_send+0xb01
              ip`tcp_wput_data+0x721
              ip`tcp_rput_data+0x33d1
              ip`squeue_drain+0x179
              ip`squeue_enter+0x3f4
              ip`ip_input+0xc17
              mac`mac_rx_soft_ring_drain+0xdf
              mac`mac_soft_ring_worker+0x111
              unix`thread_start+0x8

I tried to know the number of mac_tx call path in 10 seconds on another
dterm.

Before I using dtrace comand to improve performance, the call path
distribution is
Syscall Path: 641615
Worker Thread Path: 482210

After using dtrace command to improve performance, the call path
distribution is:
Syscall Path:  319273
Worker Thread Path:  1061620

Thanks
Zhihui



2009/4/1 zhihui Chen <zhchen3 at gmail.com>
> Thanks, I have tried your method and the poll function is disabled. After
> that, the context switches is decreased very much, but the performance is
> still remained at 8.8Gbps. Mpstat output like following:CPU minf mjf xcal
>  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
>   0    0   0    0    33    1   15    8    0 1366    0 134084    3  97   0
> 0
>   1   10   0    0    56   16 35208    6    8 1736    0     9    0  48   0
>  52
>   2    4   0   37 19646 19618   58    0   10  408    0   154    0  34   0
>  66
>   3    0   0    0   308  107  118    0    6    1    0   185    0   0   0
> 100
>
>
> Then I use the same dtrace command again, the performance is improved to
> 9.5Gbps and context switches is also reduced from 35000 to 6300. Mpstat
> output likes following:
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
> idl
>   0    0   0    0    17    3 1858    8    3  605    0 142000    3  81   0
>  16
>   1    0   0    0    15    6 4472    0    2 2516    0     0    0  92   0
> 8
>   2    0   0    0 19740 19679  126    0    6  253    0   289    0  46   0
>  54
>   3    0   0    9   509  208   66    0    4    0    0   272    0   0   0
> 100
>
> Because this is TX heavy workload, maybe we should care more about thread
> mac_soft_ring_worker???
>
> Thanks
> Zhihui
>
> 2009/4/1 Sunay Tripathi <Sunay.Tripathi at sun.com>
>
>> Sure. Just run this and polling will get disabled
>> % dladm create-vnic -l ixgbe0 vnic1
>>
>> Let us know what you get with polling disabled. We don''t have
a
>> tunable to disable polling but since ixgbe can''t assign rx
rings to
>> VNIC yet, it disable polling for primary NIC as well.
>>
>> Cheers,
>> Sunay
>>
>> zhihui Chen wrote:
>>
>>> During my test for 10GBE(Intel Ixgbe) with snv_110, I find the
context
>>> switch is a big problem for the performance. Benchmark:
Netperf-2.4.4
>>> Workload: TCP_STREAM (sending 8KB-size tcp packets from SUT to
remote
>>> machine)
>>>
>>> Crossbow use two kernel threads ( mac_soft_ring_worker and
>>> mac_rx_srs_poll_ring) to help send and recv packets in the kernel.
On
>>> multi-core or multi-processor system, these two threads and
interrupt for
>>> nic  can run on different CPU. Considering following scenario on my
4-core
>>> system:
>>>    mac_soft_ring_worker: CPU 1
>>>    mac_rx_srs_poll_ring:  CPU 1
>>>    Interrupt: CPU 2
>>>
>>> I run the workload and bind the application to free CPU 0, then I
get the
>>> performance results at 8.8Gbps and mpstat output like following:   
CPU
>>> minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt idl
>>>  0    0   0    0    17    3   21    9    0 1093    0 134501    3 
97   0
>>>   0
>>>  1    0   0    0    29   13 56972    2    7  992    0     2    0 
50   0
>>>  50
>>>  2   14   0    0 19473 19455   37    0    8    0    0   149    0 
28   0
>>>  72
>>>  3    0   0    1   305  104  129    0    4    1    0     9    0   0
0
>>> 100
>>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
sys  wt
>>> idl
>>>  0    0   0    0    14    2    2    7    0 1120    0 133511    3 
97   0
>>>   0
>>>  1    0   0    0    24   12 54501    2    6  971    0     2    0 
48   0
>>>  52
>>>  2    0   0    0 19668 19648   45    0    9    0    0   149    0 
28   0
>>>  72
>>>  3    0   0    0   306  104  128    0    6    0    0    11    0   0
0
>>> 100
>>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
sys  wt
>>> idl
>>>  0    0   0    0    14    2   21    8    2 1107    0 134569    3 
97   0
>>>   0
>>>  1    0   0    0    32   16 57522    2    6  928    0     2    0 
50   0
>>>  50
>>>  2    0   0    0 19564 19542   46    0   10    1    0   140    0 
28   0
>>>  72
>>>  3    0   0    0   306  104  122    0    7    0    0    58    0   0
0
>>> 100
>>>
>>>
>>> Next step, I just run one dtrace command: dtrace -n
''mac_tx:entry{@
>>> [probefunc,stack()]=count();}''
>>> Then I can get performance at 9.57Gbps and mpstat output like
following:
>>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
sys  wt
>>> idl
>>>  0    0   0    0    23    5 2055    9    4  529    0 142719    3 
81   0
>>>  15
>>>  1    0   0    1    21    8 24343    0    2 2523    0     0    0 
88   0
>>>  12
>>>  2   14   0    5 19678 19537   81    0    5    0    0   150    0 
43   0
>>>  57
>>>  3    0   0    6   308  104   93    0    5    2    0   278    0   0
0
>>> 100
>>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
sys  wt
>>> idl
>>>  0    0   0    2    19    4 1998    9    6  556    0 142911    3 
82   0
>>>  16
>>>  1    0   0    0    20    8 23543    1    2 2556    0     0    0 
88   0
>>>  12
>>>  2    0   0    6 19647 19499  106    0    8    1    0   266    0 
43   0
>>>  57
>>>  3    0   0    2   308  104   70    0    5    1    0    28    0   0
0
>>> 100
>>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
sys  wt
>>> idl
>>>  0    0   0    0    21    3 1968   10    4  556    0 144547    3 
82   0
>>>  15
>>>  1    0   0    0    20   10 23334    0    2 2622    0     0    0 
90   0
>>>  10
>>>  2    0   0    9 19797 19658   92    0   10    1    0   274    0 
44   0
>>>  56
>>>  3    0   0    0   307  104   95    0    6    2    0   182    0   0
0
>>>  99
>>>
>>> I dont think dtrace can help improve the performance of nic. If you
>>> compare the mpstat output, the biggest difference is that context
switch has
>>> been reduced very much from 55000 to 23000.   This leads to my
point that
>>> two much context switches hinders the performance of crossbow.
>>> If I make these two kernel threads and interrupt run in totally
different
>>> cores, the performance can be reduced to about 7.8Gbps while
context
>>> switches will be increase to at about 80000 per second, but the
interrupts
>>> remains at about 19500/s
>>>
>>> In crossbow, thread mac_soft_ring_worker will be wakeup by
>>> mac_rx_srs_poll_ring and interrupt through calling the function
>>> mac_soft_ring_worker_wakeup. I just think that if I can disable the
polling
>>> function, then context switches should be reduced.
>>>  Thanks
>>> Zhihui
>>>
>>>
>>>
>>> 2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at sun.com
<mailto:
>>> rajagopal.kunhappan at sun.com>>
>>>
>>>    Hi Zhihui,
>>>
>>>
>>>        In crossbow, each mac_srs has a kernel thread called
>>>        "mac_rx_srs_poll_ring"
>>>        to poll the hardware and crossbow will wakeup this thread to
>>>        poll packets
>>>        from the hardware automatically. Does crossbow provide any
>>>        method to disable
>>>        the polling mechanism, for example disabling the this kernel
>>> thread?
>>>
>>>
>>>    Presently no. Can we know why you would want to do that?
>>>
>>>    Thanks,
>>>    -krgopi
>>>
>>>        Thanks
>>>        Zhihui
>>>
>>>
>>>
>>>
>>> 
------------------------------------------------------------------------
>>>
>>>        _______________________________________________
>>>        crossbow-discuss mailing list
>>>        crossbow-discuss at opensolaris.org
>>>        <mailto:crossbow-discuss at opensolaris.org>
>>>       
http://mail.opensolaris.org/mailman/listinfo/crossbow-discuss
>>>
>>>
>>>
>>>    --
>>>
>>>
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20090401/a27260e1/attachment.html>

Sunay Tripathi

2009-Apr-01 19:13 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

Zhihui,

Where did you disable polling? On sender or receiver? Because like you,
I am getting 8.8 Gbps on receiver but if I disable polling on receiver,
I drop down to under 2 Gbps!!

So clearly polling on receive side is necessary. Now before we go into
the transmit side and effect of polling when LSO is in use, can
you verify that the issue you were seeing with mac_rx_srs_poll running
100% even without load is resolved? If not, was that on receiver or
sender?

I am kind of confused about what are we trying to debug (receiver or
sender).

Thanks,
Sunay

PS: I am looking at your other email as well and I can explain why you
see the behaviour on transmit and explore ways on fixing it, but I
need to clearly understand what we are doing first.


zhihui Chen wrote:> Thanks, I have tried your method and the poll function is 
> disabled. After that, the context switches is decreased very much, but 
> the performance is still remained at 8.8Gbps. Mpstat output like following:
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0    33    1   15    8    0 1366    0 134084    3  97   
> 0   0
>   1   10   0    0    56   16 35208    6    8 1736    0     9    0  48   
> 0  52
>   2    4   0   37 19646 19618   58    0   10  408    0   154    0  34   
> 0  66
>   3    0   0    0   308  107  118    0    6    1    0   185    0   0   0
100
> 
> 
> Then I use the same dtrace command again, the performance is improved to 
> 9.5Gbps and context switches is also reduced from 35000 to 6300. Mpstat 
> output likes following:
> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt
idl
>   0    0   0    0    17    3 1858    8    3  605    0 142000    3  81   
> 0  16
>   1    0   0    0    15    6 4472    0    2 2516    0     0    0  92   0  
8
>   2    0   0    0 19740 19679  126    0    6  253    0   289    0  46   
> 0  54
>   3    0   0    9   509  208   66    0    4    0    0   272    0   0   0
100
> 
> Because this is TX heavy workload, maybe we should care more about 
> thread mac_soft_ring_worker???
> 
> Thanks
> Zhihui
> 
> 2009/4/1 Sunay Tripathi <Sunay.Tripathi at sun.com 
> <mailto:Sunay.Tripathi at sun.com>>
> 
>     Sure. Just run this and polling will get disabled
>     % dladm create-vnic -l ixgbe0 vnic1
> 
>     Let us know what you get with polling disabled. We don''t have
a
>     tunable to disable polling but since ixgbe can''t assign rx
rings to
>     VNIC yet, it disable polling for primary NIC as well.
> 
>     Cheers,
>     Sunay
> 
>     zhihui Chen wrote:
> 
>         During my test for 10GBE(Intel Ixgbe) with snv_110, I find the
>         context switch is a big problem for the performance. Benchmark:
>         Netperf-2.4.4
>         Workload: TCP_STREAM (sending 8KB-size tcp packets from SUT to
>         remote machine)
> 
>         Crossbow use two kernel threads ( mac_soft_ring_worker and
>         mac_rx_srs_poll_ring) to help send and recv packets in the
>         kernel. On multi-core or multi-processor system, these two
>         threads and interrupt for nic  can run on different CPU.
>         Considering following scenario on my 4-core system:
>            mac_soft_ring_worker: CPU 1
>            mac_rx_srs_poll_ring:  CPU 1
>            Interrupt: CPU 2
> 
>         I run the workload and bind the application to free CPU 0, then
>         I get the performance results at 8.8Gbps and mpstat output like
>         following:     CPU minf mjf xcal  intr ithr  csw icsw migr smtx
>          srw syscl  usr sys  wt idl
>          0    0   0    0    17    3   21    9    0 1093    0 134501    3
>          97   0   0
>          1    0   0    0    29   13 56972    2    7  992    0     2    0
>          50   0  50
>          2   14   0    0 19473 19455   37    0    8    0    0   149    0
>          28   0  72
>          3    0   0    1   305  104  129    0    4    1    0     9    0
>           0   0 100
>         CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>         sys  wt idl
>          0    0   0    0    14    2    2    7    0 1120    0 133511    3
>          97   0   0
>          1    0   0    0    24   12 54501    2    6  971    0     2    0
>          48   0  52
>          2    0   0    0 19668 19648   45    0    9    0    0   149    0
>          28   0  72
>          3    0   0    0   306  104  128    0    6    0    0    11    0
>           0   0 100
>         CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>         sys  wt idl
>          0    0   0    0    14    2   21    8    2 1107    0 134569    3
>          97   0   0
>          1    0   0    0    32   16 57522    2    6  928    0     2    0
>          50   0  50
>          2    0   0    0 19564 19542   46    0   10    1    0   140    0
>          28   0  72
>          3    0   0    0   306  104  122    0    7    0    0    58    0
>           0   0 100
> 
> 
>         Next step, I just run one dtrace command: dtrace -n
>         ''mac_tx:entry{@[probefunc,stack()]=count();}''
>         Then I can get performance at 9.57Gbps and mpstat output like
>         following:
>         CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>         sys  wt idl
>          0    0   0    0    23    5 2055    9    4  529    0 142719    3
>          81   0  15
>          1    0   0    1    21    8 24343    0    2 2523    0     0    0
>          88   0  12
>          2   14   0    5 19678 19537   81    0    5    0    0   150    0
>          43   0  57
>          3    0   0    6   308  104   93    0    5    2    0   278    0
>           0   0 100
>         CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>         sys  wt idl
>          0    0   0    2    19    4 1998    9    6  556    0 142911    3
>          82   0  16
>          1    0   0    0    20    8 23543    1    2 2556    0     0    0
>          88   0  12
>          2    0   0    6 19647 19499  106    0    8    1    0   266    0
>          43   0  57
>          3    0   0    2   308  104   70    0    5    1    0    28    0
>           0   0 100
>         CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>         sys  wt idl
>          0    0   0    0    21    3 1968   10    4  556    0 144547    3
>          82   0  15
>          1    0   0    0    20   10 23334    0    2 2622    0     0    0
>          90   0  10
>          2    0   0    9 19797 19658   92    0   10    1    0   274    0
>          44   0  56
>          3    0   0    0   307  104   95    0    6    2    0   182    0
>           0   0  99
> 
>         I dont think dtrace can help improve the performance of nic. If
>         you compare the mpstat output, the biggest difference is that
>         context switch has been reduced very much from 55000 to 23000.  
>         This leads to my point that two much context switches hinders
>         the performance of crossbow.
>         If I make these two kernel threads and interrupt run in totally
>         different cores, the performance can be reduced to about 7.8Gbps
>         while context switches will be increase to at about 80000 per
>         second, but the interrupts remains at about 19500/s
> 
>         In crossbow, thread mac_soft_ring_worker will be wakeup by
>         mac_rx_srs_poll_ring and interrupt through calling the function
>         mac_soft_ring_worker_wakeup. I just think that if I can disable
>         the polling function, then context switches should be reduced.
>          Thanks
>         Zhihui
> 
> 
> 
>         2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at sun.com
>         <mailto:rajagopal.kunhappan at sun.com>
>         <mailto:rajagopal.kunhappan at sun.com
>         <mailto:rajagopal.kunhappan at sun.com>>>
> 
> 
>            Hi Zhihui,
> 
> 
>                In crossbow, each mac_srs has a kernel thread called
>                "mac_rx_srs_poll_ring"
>                to poll the hardware and crossbow will wakeup this thread to
>                poll packets
>                from the hardware automatically. Does crossbow provide any
>                method to disable
>                the polling mechanism, for example disabling the this
>         kernel thread?
> 
> 
>            Presently no. Can we know why you would want to do that?
> 
>            Thanks,
>            -krgopi
> 
>                Thanks
>                Zhihui
> 
>

zhihui Chen

2009-Apr-02 01:59 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

The workload is TX workload and all tuning is executed on tx side. The rx
side is a more powerful system without any tuning.

The issue with mac_rx_sys_polling running 100% without load just happens
randomly. In the another mail, Eric Cheng said that this is the bug 681973
and will be fixed in b112. This issue also happens in tx side.

Currently I try to solve send performance issue. On the same system, receive
performance can be 9.8Gbps (MTU 9000), looks very good from throughput
perspective.

Thanks
Zhihui


2009/4/2 Sunay Tripathi <Sunay.Tripathi at sun.com>
> Zhihui,
>
> Where did you disable polling? On sender or receiver? Because like you,
> I am getting 8.8 Gbps on receiver but if I disable polling on receiver,
> I drop down to under 2 Gbps!!
>
> So clearly polling on receive side is necessary. Now before we go into
> the transmit side and effect of polling when LSO is in use, can
> you verify that the issue you were seeing with mac_rx_srs_poll running
> 100% even without load is resolved? If not, was that on receiver or
> sender?
>
> I am kind of confused about what are we trying to debug (receiver or
> sender).
>
> Thanks,
> Sunay
>
> PS: I am looking at your other email as well and I can explain why you
> see the behaviour on transmit and explore ways on fixing it, but I
> need to clearly understand what we are doing first.
>
>
> zhihui Chen wrote:
>
>> Thanks, I have tried your method and the poll function is disabled.
After
>> that, the context switches is decreased very much, but the performance
is
>> still remained at 8.8Gbps. Mpstat output like following:
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt
>> idl
>>  0    0   0    0    33    1   15    8    0 1366    0 134084    3  97  
0
>> 0
>>  1   10   0    0    56   16 35208    6    8 1736    0     9    0  48  
0
>>  52
>>  2    4   0   37 19646 19618   58    0   10  408    0   154    0  34  
0
>>  66
>>  3    0   0    0   308  107  118    0    6    1    0   185    0   0   0
>> 100
>>
>>
>> Then I use the same dtrace command again, the performance is improved
to
>> 9.5Gbps and context switches is also reduced from 35000 to 6300. Mpstat
>> output likes following:
>> CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys 
wt
>> idl
>>  0    0   0    0    17    3 1858    8    3  605    0 142000    3  81  
0
>>  16
>>  1    0   0    0    15    6 4472    0    2 2516    0     0    0  92   0
>> 8
>>  2    0   0    0 19740 19679  126    0    6  253    0   289    0  46  
0
>>  54
>>  3    0   0    9   509  208   66    0    4    0    0   272    0   0   0
>> 100
>>
>> Because this is TX heavy workload, maybe we should care more about
thread
>> mac_soft_ring_worker???
>>
>> Thanks
>> Zhihui
>>
>> 2009/4/1 Sunay Tripathi <Sunay.Tripathi at sun.com <mailto:
>> Sunay.Tripathi at sun.com>>
>>
>>
>>    Sure. Just run this and polling will get disabled
>>    % dladm create-vnic -l ixgbe0 vnic1
>>
>>    Let us know what you get with polling disabled. We don''t
have a
>>    tunable to disable polling but since ixgbe can''t assign rx
rings to
>>    VNIC yet, it disable polling for primary NIC as well.
>>
>>    Cheers,
>>    Sunay
>>
>>    zhihui Chen wrote:
>>
>>        During my test for 10GBE(Intel Ixgbe) with snv_110, I find the
>>        context switch is a big problem for the performance. Benchmark:
>>        Netperf-2.4.4
>>        Workload: TCP_STREAM (sending 8KB-size tcp packets from SUT to
>>        remote machine)
>>
>>        Crossbow use two kernel threads ( mac_soft_ring_worker and
>>        mac_rx_srs_poll_ring) to help send and recv packets in the
>>        kernel. On multi-core or multi-processor system, these two
>>        threads and interrupt for nic  can run on different CPU.
>>        Considering following scenario on my 4-core system:
>>           mac_soft_ring_worker: CPU 1
>>           mac_rx_srs_poll_ring:  CPU 1
>>           Interrupt: CPU 2
>>
>>        I run the workload and bind the application to free CPU 0, then
>>        I get the performance results at 8.8Gbps and mpstat output like
>>        following:     CPU minf mjf xcal  intr ithr  csw icsw migr smtx
>>         srw syscl  usr sys  wt idl
>>         0    0   0    0    17    3   21    9    0 1093    0 134501    3
>>         97   0   0
>>         1    0   0    0    29   13 56972    2    7  992    0     2    0
>>         50   0  50
>>         2   14   0    0 19473 19455   37    0    8    0    0   149    0
>>         28   0  72
>>         3    0   0    1   305  104  129    0    4    1    0     9    0
>>          0   0 100
>>        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>>        sys  wt idl
>>         0    0   0    0    14    2    2    7    0 1120    0 133511    3
>>         97   0   0
>>         1    0   0    0    24   12 54501    2    6  971    0     2    0
>>         48   0  52
>>         2    0   0    0 19668 19648   45    0    9    0    0   149    0
>>         28   0  72
>>         3    0   0    0   306  104  128    0    6    0    0    11    0
>>          0   0 100
>>        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>>        sys  wt idl
>>         0    0   0    0    14    2   21    8    2 1107    0 134569    3
>>         97   0   0
>>         1    0   0    0    32   16 57522    2    6  928    0     2    0
>>         50   0  50
>>         2    0   0    0 19564 19542   46    0   10    1    0   140    0
>>         28   0  72
>>         3    0   0    0   306  104  122    0    7    0    0    58    0
>>          0   0 100
>>
>>
>>        Next step, I just run one dtrace command: dtrace -n
>>        ''mac_tx:entry{@[probefunc,stack()]=count();}''
>>        Then I can get performance at 9.57Gbps and mpstat output like
>>        following:
>>        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>>        sys  wt idl
>>         0    0   0    0    23    5 2055    9    4  529    0 142719    3
>>         81   0  15
>>         1    0   0    1    21    8 24343    0    2 2523    0     0    0
>>         88   0  12
>>         2   14   0    5 19678 19537   81    0    5    0    0   150    0
>>         43   0  57
>>         3    0   0    6   308  104   93    0    5    2    0   278    0
>>          0   0 100
>>        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>>        sys  wt idl
>>         0    0   0    2    19    4 1998    9    6  556    0 142911    3
>>         82   0  16
>>         1    0   0    0    20    8 23543    1    2 2556    0     0    0
>>         88   0  12
>>         2    0   0    6 19647 19499  106    0    8    1    0   266    0
>>         43   0  57
>>         3    0   0    2   308  104   70    0    5    1    0    28    0
>>          0   0 100
>>        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>>        sys  wt idl
>>         0    0   0    0    21    3 1968   10    4  556    0 144547    3
>>         82   0  15
>>         1    0   0    0    20   10 23334    0    2 2622    0     0    0
>>         90   0  10
>>         2    0   0    9 19797 19658   92    0   10    1    0   274    0
>>         44   0  56
>>         3    0   0    0   307  104   95    0    6    2    0   182    0
>>          0   0  99
>>
>>        I dont think dtrace can help improve the performance of nic. If
>>        you compare the mpstat output, the biggest difference is that
>>        context switch has been reduced very much from 55000 to 23000.
>>      This leads to my point that two much context switches hinders
>>        the performance of crossbow.
>>        If I make these two kernel threads and interrupt run in totally
>>        different cores, the performance can be reduced to about 7.8Gbps
>>        while context switches will be increase to at about 80000 per
>>        second, but the interrupts remains at about 19500/s
>>
>>        In crossbow, thread mac_soft_ring_worker will be wakeup by
>>        mac_rx_srs_poll_ring and interrupt through calling the function
>>        mac_soft_ring_worker_wakeup. I just think that if I can disable
>>        the polling function, then context switches should be reduced.
>>         Thanks
>>        Zhihui
>>
>>
>>
>>        2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at sun.com
>>        <mailto:rajagopal.kunhappan at sun.com>
>>        <mailto:rajagopal.kunhappan at sun.com
>>        <mailto:rajagopal.kunhappan at sun.com>>>
>>
>>
>>           Hi Zhihui,
>>
>>
>>               In crossbow, each mac_srs has a kernel thread called
>>               "mac_rx_srs_poll_ring"
>>               to poll the hardware and crossbow will wakeup this thread
to
>>               poll packets
>>               from the hardware automatically. Does crossbow provide
any
>>               method to disable
>>               the polling mechanism, for example disabling the this
>>        kernel thread?
>>
>>
>>           Presently no. Can we know why you would want to do that?
>>
>>           Thanks,
>>           -krgopi
>>
>>               Thanks
>>               Zhihui
>>
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20090402/c462c88e/attachment.html>

Sunay Tripathi

2009-Apr-03 07:53 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

OK. Now I understand much better. So we are debugging send side over
TCP workload so the only thing thats being processed on receive is ACKs.
Based on the other two stack traces you had sent it seems like that
performance is higher on Tx when outbound packets are processed by
squeue_worker.

Given that you have LSO working on transmit, you probably want to
slow down the ACK processing a bit. I am suspecting that when you
run dtrace with probes in data path, it slows things down enough
to allow LSO to get bigger chunks. Can you rerun the same experiment
(with dtrace and without) and look at the lso counters in TCP (delta)?
Use ''kstat tcp |grep lso'' before and after in each case.

FWIW, I have a similar setup but with multiple clients. Transmitting
to one client gives lower results but when I transmit to 2 or more
clients, I consistently get 9.3+ Gbps (I used uperf to send multiple
streams with default MTU of 1500).

Txn1       5.49GB/5.05   (s) =    9.34Gb/s      142521txn/s 
7.02us/txn

So normally when you load up the system, the performance will get
better. What you are facing is micro benchmark issues. Perhaps drop
the MTU size to 1500 and try.

Cheers,
Sunay

zhihui Chen wrote:> The workload is TX workload and all tuning is executed on tx side. The 
> rx side is a more powerful system without any tuning.
> 
> The issue with mac_rx_sys_polling running 100% without load just happens 
> randomly. In the another mail, Eric Cheng said that this is the bug 
> 681973 and will be fixed in b112. This issue also happens in tx side.
>  
> Currently I try to solve send performance issue. On the same system, 
> receive performance can be 9.8Gbps (MTU 9000), looks very good from 
> throughput perspective.
> 
> Thanks
> Zhihui
> 
> 
> 2009/4/2 Sunay Tripathi <Sunay.Tripathi at sun.com 
> <mailto:Sunay.Tripathi at sun.com>>
> 
>     Zhihui,
> 
>     Where did you disable polling? On sender or receiver? Because like you,
>     I am getting 8.8 Gbps on receiver but if I disable polling on receiver,
>     I drop down to under 2 Gbps!!
> 
>     So clearly polling on receive side is necessary. Now before we go into
>     the transmit side and effect of polling when LSO is in use, can
>     you verify that the issue you were seeing with mac_rx_srs_poll running
>     100% even without load is resolved? If not, was that on receiver or
>     sender?
> 
>     I am kind of confused about what are we trying to debug (receiver or
>     sender).
> 
>     Thanks,
>     Sunay
> 
>     PS: I am looking at your other email as well and I can explain why you
>     see the behaviour on transmit and explore ways on fixing it, but I
>     need to clearly understand what we are doing first.
> 
> 
>     zhihui Chen wrote:
> 
>         Thanks, I have tried your method and the poll function is
>         disabled. After that, the context switches is decreased very
>         much, but the performance is still remained at 8.8Gbps. Mpstat
>         output like following:
>         CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>         sys  wt idl
>          0    0   0    0    33    1   15    8    0 1366    0 134084    3
>          97   0   0
>          1   10   0    0    56   16 35208    6    8 1736    0     9    0
>          48   0  52
>          2    4   0   37 19646 19618   58    0   10  408    0   154    0
>          34   0  66
>          3    0   0    0   308  107  118    0    6    1    0   185    0
>           0   0 100
> 
> 
>         Then I use the same dtrace command again, the performance is
>         improved to 9.5Gbps and context switches is also reduced from
>         35000 to 6300. Mpstat output likes following:
>         CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>         sys  wt idl
>          0    0   0    0    17    3 1858    8    3  605    0 142000    3
>          81   0  16
>          1    0   0    0    15    6 4472    0    2 2516    0     0    0
>          92   0   8
>          2    0   0    0 19740 19679  126    0    6  253    0   289    0
>          46   0  54
>          3    0   0    9   509  208   66    0    4    0    0   272    0
>           0   0 100
> 
>         Because this is TX heavy workload, maybe we should care more
>         about thread mac_soft_ring_worker???
> 
>         Thanks
>         Zhihui
> 
>         2009/4/1 Sunay Tripathi <Sunay.Tripathi at sun.com
>         <mailto:Sunay.Tripathi at sun.com> <mailto:Sunay.Tripathi
at sun.com
>         <mailto:Sunay.Tripathi at sun.com>>>
> 
> 
>            Sure. Just run this and polling will get disabled
>            % dladm create-vnic -l ixgbe0 vnic1
> 
>            Let us know what you get with polling disabled. We
don''t have a
>            tunable to disable polling but since ixgbe can''t assign
rx
>         rings to
>            VNIC yet, it disable polling for primary NIC as well.
> 
>            Cheers,
>            Sunay
> 
>            zhihui Chen wrote:
> 
>                During my test for 10GBE(Intel Ixgbe) with snv_110, I
>         find the
>                context switch is a big problem for the performance.
>         Benchmark:
>                Netperf-2.4.4
>                Workload: TCP_STREAM (sending 8KB-size tcp packets from
>         SUT to
>                remote machine)
> 
>                Crossbow use two kernel threads ( mac_soft_ring_worker and
>                mac_rx_srs_poll_ring) to help send and recv packets in the
>                kernel. On multi-core or multi-processor system, these two
>                threads and interrupt for nic  can run on different CPU.
>                Considering following scenario on my 4-core system:
>                   mac_soft_ring_worker: CPU 1
>                   mac_rx_srs_poll_ring:  CPU 1
>                   Interrupt: CPU 2
> 
>                I run the workload and bind the application to free CPU
>         0, then
>                I get the performance results at 8.8Gbps and mpstat
>         output like
>                following:     CPU minf mjf xcal  intr ithr  csw icsw
>         migr smtx
>                 srw syscl  usr sys  wt idl
>                 0    0   0    0    17    3   21    9    0 1093    0
>         134501    3
>                 97   0   0
>                 1    0   0    0    29   13 56972    2    7  992    0    
>         2    0
>                 50   0  50
>                 2   14   0    0 19473 19455   37    0    8    0    0  
>         149    0
>                 28   0  72
>                 3    0   0    1   305  104  129    0    4    1    0    
>         9    0
>                  0   0 100
>                CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>         syscl  usr
>                sys  wt idl
>                 0    0   0    0    14    2    2    7    0 1120    0
>         133511    3
>                 97   0   0
>                 1    0   0    0    24   12 54501    2    6  971    0    
>         2    0
>                 48   0  52
>                 2    0   0    0 19668 19648   45    0    9    0    0  
>         149    0
>                 28   0  72
>                 3    0   0    0   306  104  128    0    6    0    0  
>          11    0
>                  0   0 100
>                CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>         syscl  usr
>                sys  wt idl
>                 0    0   0    0    14    2   21    8    2 1107    0
>         134569    3
>                 97   0   0
>                 1    0   0    0    32   16 57522    2    6  928    0    
>         2    0
>                 50   0  50
>                 2    0   0    0 19564 19542   46    0   10    1    0  
>         140    0
>                 28   0  72
>                 3    0   0    0   306  104  122    0    7    0    0  
>          58    0
>                  0   0 100
> 
> 
>                Next step, I just run one dtrace command: dtrace -n
>               
''mac_tx:entry{@[probefunc,stack()]=count();}''
>                Then I can get performance at 9.57Gbps and mpstat output
like
>                following:
>                CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>         syscl  usr
>                sys  wt idl
>                 0    0   0    0    23    5 2055    9    4  529    0
>         142719    3
>                 81   0  15
>                 1    0   0    1    21    8 24343    0    2 2523    0    
>         0    0
>                 88   0  12
>                 2   14   0    5 19678 19537   81    0    5    0    0  
>         150    0
>                 43   0  57
>                 3    0   0    6   308  104   93    0    5    2    0  
>         278    0
>                  0   0 100
>                CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>         syscl  usr
>                sys  wt idl
>                 0    0   0    2    19    4 1998    9    6  556    0
>         142911    3
>                 82   0  16
>                 1    0   0    0    20    8 23543    1    2 2556    0    
>         0    0
>                 88   0  12
>                 2    0   0    6 19647 19499  106    0    8    1    0  
>         266    0
>                 43   0  57
>                 3    0   0    2   308  104   70    0    5    1    0  
>          28    0
>                  0   0 100
>                CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>         syscl  usr
>                sys  wt idl
>                 0    0   0    0    21    3 1968   10    4  556    0
>         144547    3
>                 82   0  15
>                 1    0   0    0    20   10 23334    0    2 2622    0    
>         0    0
>                 90   0  10
>                 2    0   0    9 19797 19658   92    0   10    1    0  
>         274    0
>                 44   0  56
>                 3    0   0    0   307  104   95    0    6    2    0  
>         182    0
>                  0   0  99
> 
>                I dont think dtrace can help improve the performance of
>         nic. If
>                you compare the mpstat output, the biggest difference is
that
>                context switch has been reduced very much from 55000 to
>         23000.          This leads to my point that two much context
>         switches hinders
>                the performance of crossbow.
>                If I make these two kernel threads and interrupt run in
>         totally
>                different cores, the performance can be reduced to about
>         7.8Gbps
>                while context switches will be increase to at about 80000
per
>                second, but the interrupts remains at about 19500/s
> 
>                In crossbow, thread mac_soft_ring_worker will be wakeup by
>                mac_rx_srs_poll_ring and interrupt through calling the
>         function
>                mac_soft_ring_worker_wakeup. I just think that if I can
>         disable
>                the polling function, then context switches should be
>         reduced.
>                 Thanks
>                Zhihui
> 
> 
> 
>                2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at
sun.com
>         <mailto:rajagopal.kunhappan at sun.com>
>                <mailto:rajagopal.kunhappan at sun.com
>         <mailto:rajagopal.kunhappan at sun.com>>
>                <mailto:rajagopal.kunhappan at sun.com
>         <mailto:rajagopal.kunhappan at sun.com>
>                <mailto:rajagopal.kunhappan at sun.com
>         <mailto:rajagopal.kunhappan at sun.com>>>>
> 
> 
>                   Hi Zhihui,
> 
> 
>                       In crossbow, each mac_srs has a kernel thread called
>                       "mac_rx_srs_poll_ring"
>                       to poll the hardware and crossbow will wakeup this
>         thread to
>                       poll packets
>                       from the hardware automatically. Does crossbow
>         provide any
>                       method to disable
>                       the polling mechanism, for example disabling the this
>                kernel thread?
> 
> 
>                   Presently no. Can we know why you would want to do that?
> 
>                   Thanks,
>                   -krgopi
> 
>                       Thanks
>                       Zhihui
> 
> 
>

zhihui Chen

2009-Apr-04 12:45 UTC

head link

[crossbow-discuss] Hwo to disable the polling function of mac_srs

2009/4/3 Sunay Tripathi <Sunay.Tripathi at sun.com>
> OK. Now I understand much better. So we are debugging send side over
> TCP workload so the only thing thats being processed on receive is ACKs.
> Based on the other two stack traces you had sent it seems like that
> performance is higher on Tx when outbound packets are processed by
> squeue_worker.
>Yes, after the dtrace command is executed, CPU utilization on the core
running worker thread will be increased from about 40% to 80% and more
outbound packets are handled by the worker thread.
>
> Given that you have LSO working on transmit, you probably want to
> slow down the ACK processing a bit. I am suspecting that when you
> run dtrace with probes in data path, it slows things down enough
> to allow LSO to get bigger chunks. Can you rerun the same experiment
> (with dtrace and without) and look at the lso counters in TCP (delta)?
> Use ''kstat tcp |grep lso'' before and after in each case.
Checked,  the number of outbound packets through LSO is very small and can
be ignored.
>
>
> FWIW, I have a similar setup but with multiple clients. Transmitting
> to one client gives lower results but when I transmit to 2 or more
> clients, I consistently get 9.3+ Gbps (I used uperf to send multiple
> streams with default MTU of 1500).
>
> Txn1       5.49GB/5.05   (s) =    9.34Gb/s      142521txn/s 7.02us/txn
>
> So normally when you load up the system, the performance will get
> better. What you are facing is micro benchmark issues. Perhaps drop
> the MTU size to 1500 and try.

Yes, fully agree. I have tested with 15 clients ( each client uses 1GB nic)
and MTU is set to 1500, the throughput for TCP_SENDING can be 9.5Gbps.

Thanks
Zhihui

Cheers,> Sunay
>
> zhihui Chen wrote:
>
>> The workload is TX workload and all tuning is executed on tx side. The
rx
>> side is a more powerful system without any tuning.
>>
>> The issue with mac_rx_sys_polling running 100% without load just
happens
>> randomly. In the another mail, Eric Cheng said that this is the bug
681973
>> and will be fixed in b112. This issue also happens in tx side.
>>  Currently I try to solve send performance issue. On the same system,
>> receive performance can be 9.8Gbps (MTU 9000), looks very good from
>> throughput perspective.
>>
>> Thanks
>> Zhihui
>>
>>
>> 2009/4/2 Sunay Tripathi <Sunay.Tripathi at sun.com <mailto:
>> Sunay.Tripathi at sun.com>>
>>
>>
>>    Zhihui,
>>
>>    Where did you disable polling? On sender or receiver? Because like
you,
>>    I am getting 8.8 Gbps on receiver but if I disable polling on
receiver,
>>    I drop down to under 2 Gbps!!
>>
>>    So clearly polling on receive side is necessary. Now before we go
into
>>    the transmit side and effect of polling when LSO is in use, can
>>    you verify that the issue you were seeing with mac_rx_srs_poll
running
>>    100% even without load is resolved? If not, was that on receiver or
>>    sender?
>>
>>    I am kind of confused about what are we trying to debug (receiver or
>>    sender).
>>
>>    Thanks,
>>    Sunay
>>
>>    PS: I am looking at your other email as well and I can explain why
you
>>    see the behaviour on transmit and explore ways on fixing it, but I
>>    need to clearly understand what we are doing first.
>>
>>
>>    zhihui Chen wrote:
>>
>>        Thanks, I have tried your method and the poll function is
>>        disabled. After that, the context switches is decreased very
>>        much, but the performance is still remained at 8.8Gbps. Mpstat
>>        output like following:
>>        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>>        sys  wt idl
>>         0    0   0    0    33    1   15    8    0 1366    0 134084    3
>>         97   0   0
>>         1   10   0    0    56   16 35208    6    8 1736    0     9    0
>>         48   0  52
>>         2    4   0   37 19646 19618   58    0   10  408    0   154    0
>>         34   0  66
>>         3    0   0    0   308  107  118    0    6    1    0   185    0
>>          0   0 100
>>
>>
>>        Then I use the same dtrace command again, the performance is
>>        improved to 9.5Gbps and context switches is also reduced from
>>        35000 to 6300. Mpstat output likes following:
>>        CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr
>>        sys  wt idl
>>         0    0   0    0    17    3 1858    8    3  605    0 142000    3
>>         81   0  16
>>         1    0   0    0    15    6 4472    0    2 2516    0     0    0
>>         92   0   8
>>         2    0   0    0 19740 19679  126    0    6  253    0   289    0
>>         46   0  54
>>         3    0   0    9   509  208   66    0    4    0    0   272    0
>>          0   0 100
>>
>>        Because this is TX heavy workload, maybe we should care more
>>        about thread mac_soft_ring_worker???
>>
>>        Thanks
>>        Zhihui
>>
>>        2009/4/1 Sunay Tripathi <Sunay.Tripathi at sun.com
>>        <mailto:Sunay.Tripathi at sun.com>
<mailto:Sunay.Tripathi at sun.com
>>
>>        <mailto:Sunay.Tripathi at sun.com>>>
>>
>>
>>           Sure. Just run this and polling will get disabled
>>           % dladm create-vnic -l ixgbe0 vnic1
>>
>>           Let us know what you get with polling disabled. We
don''t have a
>>           tunable to disable polling but since ixgbe can''t
assign rx
>>        rings to
>>           VNIC yet, it disable polling for primary NIC as well.
>>
>>           Cheers,
>>           Sunay
>>
>>           zhihui Chen wrote:
>>
>>               During my test for 10GBE(Intel Ixgbe) with snv_110, I
>>        find the
>>               context switch is a big problem for the performance.
>>        Benchmark:
>>               Netperf-2.4.4
>>               Workload: TCP_STREAM (sending 8KB-size tcp packets from
>>        SUT to
>>               remote machine)
>>
>>               Crossbow use two kernel threads ( mac_soft_ring_worker
and
>>               mac_rx_srs_poll_ring) to help send and recv packets in
the
>>               kernel. On multi-core or multi-processor system, these
two
>>               threads and interrupt for nic  can run on different CPU.
>>               Considering following scenario on my 4-core system:
>>                  mac_soft_ring_worker: CPU 1
>>                  mac_rx_srs_poll_ring:  CPU 1
>>                  Interrupt: CPU 2
>>
>>               I run the workload and bind the application to free CPU
>>        0, then
>>               I get the performance results at 8.8Gbps and mpstat
>>        output like
>>               following:     CPU minf mjf xcal  intr ithr  csw icsw
>>        migr smtx
>>                srw syscl  usr sys  wt idl
>>                0    0   0    0    17    3   21    9    0 1093    0
>>        134501    3
>>                97   0   0
>>                1    0   0    0    29   13 56972    2    7  992    0
>>      2    0
>>                50   0  50
>>                2   14   0    0 19473 19455   37    0    8    0    0
>>    149    0
>>                28   0  72
>>                3    0   0    1   305  104  129    0    4    1    0
>>    9    0
>>                 0   0 100
>>               CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>>        syscl  usr
>>               sys  wt idl
>>                0    0   0    0    14    2    2    7    0 1120    0
>>        133511    3
>>                97   0   0
>>                1    0   0    0    24   12 54501    2    6  971    0
>>      2    0
>>                48   0  52
>>                2    0   0    0 19668 19648   45    0    9    0    0
>>    149    0
>>                28   0  72
>>                3    0   0    0   306  104  128    0    6    0    0
>>   11    0
>>                 0   0 100
>>               CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>>        syscl  usr
>>               sys  wt idl
>>                0    0   0    0    14    2   21    8    2 1107    0
>>        134569    3
>>                97   0   0
>>                1    0   0    0    32   16 57522    2    6  928    0
>>      2    0
>>                50   0  50
>>                2    0   0    0 19564 19542   46    0   10    1    0
>>    140    0
>>                28   0  72
>>                3    0   0    0   306  104  122    0    7    0    0
>>   58    0
>>                 0   0 100
>>
>>
>>               Next step, I just run one dtrace command: dtrace -n
>>              
''mac_tx:entry{@[probefunc,stack()]=count();}''
>>               Then I can get performance at 9.57Gbps and mpstat output
>> like
>>               following:
>>               CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>>        syscl  usr
>>               sys  wt idl
>>                0    0   0    0    23    5 2055    9    4  529    0
>>        142719    3
>>                81   0  15
>>                1    0   0    1    21    8 24343    0    2 2523    0
>>      0    0
>>                88   0  12
>>                2   14   0    5 19678 19537   81    0    5    0    0
>>    150    0
>>                43   0  57
>>                3    0   0    6   308  104   93    0    5    2    0
>>  278    0
>>                 0   0 100
>>               CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>>        syscl  usr
>>               sys  wt idl
>>                0    0   0    2    19    4 1998    9    6  556    0
>>        142911    3
>>                82   0  16
>>                1    0   0    0    20    8 23543    1    2 2556    0
>>      0    0
>>                88   0  12
>>                2    0   0    6 19647 19499  106    0    8    1    0
>>    266    0
>>                43   0  57
>>                3    0   0    2   308  104   70    0    5    1    0
>>   28    0
>>                 0   0 100
>>               CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw
>>        syscl  usr
>>               sys  wt idl
>>                0    0   0    0    21    3 1968   10    4  556    0
>>        144547    3
>>                82   0  15
>>                1    0   0    0    20   10 23334    0    2 2622    0
>>      0    0
>>                90   0  10
>>                2    0   0    9 19797 19658   92    0   10    1    0
>>    274    0
>>                44   0  56
>>                3    0   0    0   307  104   95    0    6    2    0
>>  182    0
>>                 0   0  99
>>
>>               I dont think dtrace can help improve the performance of
>>        nic. If
>>               you compare the mpstat output, the biggest difference is
>> that
>>               context switch has been reduced very much from 55000 to
>>        23000.          This leads to my point that two much context
>>        switches hinders
>>               the performance of crossbow.
>>               If I make these two kernel threads and interrupt run in
>>        totally
>>               different cores, the performance can be reduced to about
>>        7.8Gbps
>>               while context switches will be increase to at about 80000
>> per
>>               second, but the interrupts remains at about 19500/s
>>
>>               In crossbow, thread mac_soft_ring_worker will be wakeup
by
>>               mac_rx_srs_poll_ring and interrupt through calling the
>>        function
>>               mac_soft_ring_worker_wakeup. I just think that if I can
>>        disable
>>               the polling function, then context switches should be
>>        reduced.
>>                Thanks
>>               Zhihui
>>
>>
>>
>>               2009/4/1 rajagopal kunhappan <rajagopal.kunhappan at
sun.com
>>        <mailto:rajagopal.kunhappan at sun.com>
>>               <mailto:rajagopal.kunhappan at sun.com
>>        <mailto:rajagopal.kunhappan at sun.com>>
>>               <mailto:rajagopal.kunhappan at sun.com
>>        <mailto:rajagopal.kunhappan at sun.com>
>>               <mailto:rajagopal.kunhappan at sun.com
>>        <mailto:rajagopal.kunhappan at sun.com>>>>
>>
>>
>>                  Hi Zhihui,
>>
>>
>>                      In crossbow, each mac_srs has a kernel thread
called
>>                      "mac_rx_srs_poll_ring"
>>                      to poll the hardware and crossbow will wakeup this
>>        thread to
>>                      poll packets
>>                      from the hardware automatically. Does crossbow
>>        provide any
>>                      method to disable
>>                      the polling mechanism, for example disabling the
this
>>               kernel thread?
>>
>>
>>                  Presently no. Can we know why you would want to do
that?
>>
>>                  Thanks,
>>                  -krgopi
>>
>>                      Thanks
>>                      Zhihui
>>
>>
>>
>>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/crossbow-discuss/attachments/20090404/b6878fa6/attachment.html>

Apparently Analagous Threads

Search for more reasonably related threads

crossbow discuss - Mar 2009 - Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

[crossbow-discuss] Hwo to disable the polling function of mac_srs

Apparently Analagous Threads