thr3ads.net - Xen devel - RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU [Apr 2005]

If this information is useful, please help other people find it:
Share via:

Santos, Jose Renato G (Jose Renato Santos)

2005-Apr-05 02:07 UTC

RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Hi,

  We had a similar network problem in the past. We were using a TCP
benchmark instead of MPI but I believe your problem is probably the same
as the one we encountered.
  It took us a while to get to the bottom of this and we only identified
the reason for this behavior after we ported oprofile to Xen and did
some performance profiling experiments.

  Here is a brief explanation of the problem we found and the solution
that worked for us.
  Xenolinux allocates a full page (4KB) to store socket buffers instead
of using just MTU bytes as in traditional linux. This is necessary to
enable page exchanges between the guest and the I/O domains. The side
effect of this is that memory space used for  socket buffers is not very
efficient. Even if packets have the maximum MTU size (typically 1500
bytes for Ethernet) the total buffer utilization is very low ( at most
just slightly  higher than 35%). If packets arrive faster than they are
processed at the receiver side, they will exhaust the receiver buffer
before the TCP advertised window is reached (By default Linux uses a TCP
advertised window equal to 75% of the receive buffer size. In standard
Linux, this is typically sufficient to stop packet transmission at the
sender before running out of receive buffers. The same is not true in
Xen due to inefficient use of socket buffers). When a packet arrives and
there is no receive buffer available, TCP tries to free socket buffer
space by eliminating socket buffer fragmentation (i.e. eliminating
wasted buffer space). This is done at the cost of an extra copy of all
receive buffer to new compacted socket buffers. This introduces overhead
and reduces throughput when the CPU is the bottleneck, which seems to be
your case.

This problem is not very frequent because modern CPUs are fast enough to
receive packets at Gigabit speeds and the receive buffer does not fill
up. However the problem may arise when using slower machines and/or when
the workload consumes a lot of CPU cycles, such as for example
scientific MPI applications. In your case in you have both factors
against you.

The solution to this problem is trivial. You just have to change the TCP
advertised window of your guest to a lower value. In our case, we used
25% of the receive buffer size and that was sufficient  to eliminate the
problem. This can be done using the following command
> echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
(The default 2 corresponds to 75% of receive buffer, and -2 corresponds
to 25%)

Please let me know if this improve your results. You should still see a
degradation in throughput when comparing xen to traditional linux, but
hopefully you should be able to see better throughputs. You should also
try running your experiments in domain 0. This will give better
throughput although still lower than traditional linux.
I am curious to know if this have any effect in your experiments.
Please, post the new results if this has any effect in your results

Thanks

Renato


 
 > -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com 
> [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
> xuehai zhang
> Sent: Monday, April 04, 2005 4:19 PM
> To: Xen-devel@lists.xensource.com
> Subject: [Xen-devel] MPI benchmark performance gap between 
> native linux anddomU
> 
> 
> 
> Hi all,
> 
> I did the following experiments to explore the MPI 
> application execution performance on both native linux 
> machines and inside of unpriviledged Xen user domains. I use 
> 8 machines with identical HW configurations (498.756 MHz dual 
> CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
> Benchmarks (PMB).
> 
> Experiment 1: I boot all 8 nodes with native linux (nosmp, 
> kernel 2.4.29) and use all of them for PMB tests.
> 
> Experiment 2: I boot all 8 nodes with Xen running and start a 
> single user domain (port 2.6.10,using file-backed VBD) on 
> each node with 360MB memory. Then I run the same PMB tests 
> among these 8 user domains.
> 
> The expreiment results show, running a same MPI benchmark in 
> user domains usually results in a worse (sometimes very bad) 
> performance comparing with on native linux machines. The 
> following are the results for PMB SendRecv benchmark for both 
> experiments (table1 and table2 report throughput and latency 
> respectively). As you may notice, SendRecv can achieve a 
> 14.9MB/sec throughput on native linux machines but can get a 
> maximum 7.07 MB/sec throughput if running inside of user 
> domains. The latency results also have big gap.
> 
> Clearly, there is difference between the memory used in the 
> native linux machine of Experiment 1 (512MB) and in the user 
> domain (360MB, can not go higher because dom0 started with 
> 128MB memory) of Experiment 2. However, I don''t think it is 
> the main cause of the performance gap because the tested 
> message sizes are much smaller than both memory sizes.
> 
> I will appreciate your help if you had the similar experience 
> and wanna share your insights.
> 
> BTW, if you are not familar with PMB SendRecv benchmark, you 
> can find a detailed explaination at 
> http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
> 
> Thanks in advance for you help.
> 
> Xuehai
> 
> 
> P.S. Table 1: SendRecv throughput (MB/sec) performance
> 
> Message_Size(bytes)    Experiment_1    Experiment_2
> 0                0             0
> 1                0             0
> 2                0             0
> 4                0             0
> 8                0.04          0.01
> 16                    0.16          0.01
> 32                    0.34          0.02
> 64                    0.65          0.04
> 128                    1.17          0.09
> 256                    2.15          0.59
> 512                    3.4           1.23
> 1K                    5.29          2.57
> 2K                    7.68          3.5
> 4K                    10.7          4.96
> 8K                    13.35         7.07
> 16K                    14.9          3.77
> 32K                    9.85          3.68
> 64K                    5.06          3.02
> 128K                    7.91          4.94
> 256K                    7.85          5.25
> 512K                    7.93          6.11
> 1M                    7.85          6.5
> 2M                    8.18          5.44
> 4M                    7.55          4.93
> 
> Table 2: SendRecv latency (millisec) performance
> 
> Message_Size(bytes)    Experiment_1    Experiment_2
> 0                   1979.6        3010.96
> 1                   1724.16       3218.88
> 2                   1669.65       3185.3
> 4                   1637.26       3055.67
> 8                   406.77        2966.17
> 16                  185.76        2777.89
> 32                  181.06        2791.06
> 64                  189.12        2940.82
> 128                 210.51        2716.3
> 256                 227.36        843.94
> 512                 287.28        796.71
> 1K                  368.72        758.19
> 2K                  508.65        1144.24
> 4K                  730.59        1612.66
> 8K                  1170.22       2471.65
> 16K                 2096.86       8300.18
> 32K                 6340.45       17017.99
> 64K                 24640.78      41264.5
> 128K                31709.09      50608.97
> 256K                63680.67      94918.13
> 512K                125531.7      162168.47
> 1M                  251566.94     321451.02
> 2M                  477431.32     707981
> 4M                  997768.35     1503987.61
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Keir Fraser

2005-Apr-05 08:59 UTC

head link

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

On 5 Apr 2005, at 03:07, Santos, Jose Renato G (Jose Renato Santos) 
wrote:
>  Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not 
> very
> efficient.
This is true, but these days we lie to the network stack about how big 
the skb data area is. The ''truesize'' field, which is what I
think is
used for socket buffer accounting, will be around 1600 bytes, not 4096. 
So I would expect the old trick of reducing the receive windows not to 
work: but if it does then that is very interesting!

  -- Keir


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Nivedita Singhvi

2005-Apr-05 22:22 UTC

head link

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Santos, Jose Renato G (Jose Renato Santos) wrote:
>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.
Hello! Was this on the 2.6 kernel? Would you be able to
share the oprofile port? It would be very handy indeed
right now. (I was told by a few people that someone
was porting oprofile and I believe there was some status
on the list that went by) but haven''t seen it yet...
>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer
Most small connections (say upto 3 - 4K) involve only 3 to 5 segments,
and so the tcp window never really opens fully.  On longer lived
connections, it does help very much to have a large buffer.
> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.
/proc/net/netstat will show a counter of just how many times this
happens (RcvPruned). Would be interesting if that was significant.
> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.
> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command
>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
How much did this improve your results by? And wouldn''t
  making the default socket buffers, max socket buffers
larger by, say, 5 times be more effective (other than for
those applications using setsockopt() to set their buffers
to some size already, but not large enough)?
> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results
Yep, me too..

thanks,
Nivedita



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

xuehai zhang

2005-Apr-05 22:23 UTC

head link

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Hi Ian and Jose,

Based on your suggestions, I did two more experiments: one (with tag
"domU-B" in table below) is
changing the TCP advertised window of domU to -2 (the default is 2) and the
other (with tag "dom0"
in table below) is to repeat the experiment in dom0 (only dom0 is running). The
following table
contains the results from these two new experiments plus two old ones (with tags
"native-linux" and
"domU-A" in table below) in my previous email.

I have the following observation from the results:

1. Decreasing the scaling of TCP window ("domU-B") doesn''t
buy any good to the  performance but
slightly slowdown the performance (comparing with "domU-A").

2. Generally, the performance of running the experiments in dom0
("dom0" column) is very close
(slightly less) to the performance on native linux ("native-linux"
column). However, in certain
situations, it outperforms the performance on native linux. For example,
throughput values when
message size is 64KB and latency values when message size is 1 , or 2, or 4, or
8 bytes.

3. The performance gap between domU and dom0 is big, similarly as domU and
native linux.

BTW, each reported data point in the following table is the average of over 10
runs of the same
experiments. I forget to mention that in experiment using user domains, the 8
domU forms a private
network and each domU is assigned a private network IP (for example,
192.168.254.X).

Xuehai

*********************************
*SendRecv Throughput(Mbytes/sec)*
*********************************

Msg Size(bytes)  native-linux	dom0          domU-A          domU-B
         0         0              0.00          0              0.00
         1         0              0.01          0              0.00
         2         0              0.01          0              0.00
         4         0              0.03          0              0.00
         8         0.04           0.05          0.01           0.01
        16         0.16           0.11          0.01           0.01
        32         0.34           0.21          0.02           0.02
        64         0.65           0.42          0.04           0.04
       128         1.17           0.79          0.09           0.10
       256         2.15           1.44          0.59           0.58
       512         3.4            2.39          1.23           1.22
      1024         5.29           3.79          2.57           2.50
      2048         7.68           5.30          3.5            3.44
      4096         10.7           8.51          4.96           5.23
      8192         13.35          11.06         7.07           6.00
     16384         14.9           13.60         3.77           4.62
     32768         9.85           11.13         3.68           4.34
     65536         5.06           9.06          3.02           3.14
    131072         7.91           7.61          4.94           5.04
    262144         7.85           7.65          5.25           5.29
    524288         7.93           7.77          6.11           5.40
   1048576         7.85           7.82          6.5            5.62
   2097152         8.18           7.35          5.44           5.32
   4194304         7.55           6.88          4.93           4.92

*********************************
*   SendRecv Latency(millisec)  *
*********************************

Msg Size(bytes)  native-linux	dom0           domU-A         domU-B					
         0         1979.6         1920.83       3010.96        3246.71    		
         1         1724.16        397.27        3218.88        3219.63
         2         1669.65        297.58        3185.3         3298.86
         4         1637.26        285.27        3055.67        3222.34
         8         406.77         282.78        2966.17        3001.24
        16         185.76         283.87        2777.89        2761.90
        32         181.06         284.75        2791.06        2798.77
        64         189.12         293.93        2940.82        3043.55
       128         210.51         310.47        2716.3         2495.83
       256         227.36         338.13        843.94         853.86
       512         287.28         408.14        796.71         805.51
      1024         368.72         515.59        758.19         786.67
      2048         508.65         737.12        1144.24        1150.66
      4096         730.59         917.97        1612.66        1516.35
      8192         1170.22        1411.94       2471.65        2650.17
     16384         2096.86        2297.19       8300.18        6857.13
     32768         6340.45        5619.56       17017.99       14392.36
     65536         24640.78       13787.31      41264.5        39871.19
    131072         31709.09       32797.52      50608.97       49533.68
    262144         63680.67       65174.67      94918.13       94157.30
    524288         125531.7       128116.73     162168.47      189307.05
   1048576         251566.94      252257.55     321451.02      361714.44
   2097152         477431.32      527432.60     707981         728504.38
   4194304         997768.35      1108898.61    1503987.61     1534795.56

Santos, Jose Renato G (Jose Renato Santos) wrote:>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.
> 
>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer
> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.
> 
> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.
> 
> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command
> 
> 
>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
> 
> 
> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results
> 
> Thanks
> 
> Renato
> 
> 
>  
>  
> 
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
>>xuehai zhang
>>Sent: Monday, April 04, 2005 4:19 PM
>>To: Xen-devel@lists.xensource.com
>>Subject: [Xen-devel] MPI benchmark performance gap between 
>>native linux anddomU
>>
>>
>>
>>Hi all,
>>
>>I did the following experiments to explore the MPI 
>>application execution performance on both native linux 
>>machines and inside of unpriviledged Xen user domains. I use 
>>8 machines with identical HW configurations (498.756 MHz dual 
>>CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
>>Benchmarks (PMB).
>>
>>Experiment 1: I boot all 8 nodes with native linux (nosmp, 
>>kernel 2.4.29) and use all of them for PMB tests.
>>
>>Experiment 2: I boot all 8 nodes with Xen running and start a 
>>single user domain (port 2.6.10,using file-backed VBD) on 
>>each node with 360MB memory. Then I run the same PMB tests 
>>among these 8 user domains.
>>
>>The expreiment results show, running a same MPI benchmark in 
>>user domains usually results in a worse (sometimes very bad) 
>>performance comparing with on native linux machines. The 
>>following are the results for PMB SendRecv benchmark for both 
>>experiments (table1 and table2 report throughput and latency 
>>respectively). As you may notice, SendRecv can achieve a 
>>14.9MB/sec throughput on native linux machines but can get a 
>>maximum 7.07 MB/sec throughput if running inside of user 
>>domains. The latency results also have big gap.
>>
>>Clearly, there is difference between the memory used in the 
>>native linux machine of Experiment 1 (512MB) and in the user 
>>domain (360MB, can not go higher because dom0 started with 
>>128MB memory) of Experiment 2. However, I don''t think it is 
>>the main cause of the performance gap because the tested 
>>message sizes are much smaller than both memory sizes.
>>
>>I will appreciate your help if you had the similar experience 
>>and wanna share your insights.
>>
>>BTW, if you are not familar with PMB SendRecv benchmark, you 
>>can find a detailed explaination at 
>>http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
>>
>>Thanks in advance for you help.
>>
>>Xuehai
>>
>>
>>P.S. Table 1: SendRecv throughput (MB/sec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                0             0
>>1                0             0
>>2                0             0
>>4                0             0
>>8                0.04          0.01
>>16                    0.16          0.01
>>32                    0.34          0.02
>>64                    0.65          0.04
>>128                    1.17          0.09
>>256                    2.15          0.59
>>512                    3.4           1.23
>>1K                    5.29          2.57
>>2K                    7.68          3.5
>>4K                    10.7          4.96
>>8K                    13.35         7.07
>>16K                    14.9          3.77
>>32K                    9.85          3.68
>>64K                    5.06          3.02
>>128K                    7.91          4.94
>>256K                    7.85          5.25
>>512K                    7.93          6.11
>>1M                    7.85          6.5
>>2M                    8.18          5.44
>>4M                    7.55          4.93
>>
>>Table 2: SendRecv latency (millisec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                   1979.6        3010.96
>>1                   1724.16       3218.88
>>2                   1669.65       3185.3
>>4                   1637.26       3055.67
>>8                   406.77        2966.17
>>16                  185.76        2777.89
>>32                  181.06        2791.06
>>64                  189.12        2940.82
>>128                 210.51        2716.3
>>256                 227.36        843.94
>>512                 287.28        796.71
>>1K                  368.72        758.19
>>2K                  508.65        1144.24
>>4K                  730.59        1612.66
>>8K                  1170.22       2471.65
>>16K                 2096.86       8300.18
>>32K                 6340.45       17017.99
>>64K                 24640.78      41264.5
>>128K                31709.09      50608.97
>>256K                63680.67      94918.13
>>512K                125531.7      162168.47
>>1M                  251566.94     321451.02
>>2M                  477431.32     707981
>>4M                  997768.35     1503987.61
>>
>>
>>
>>_______________________________________________
>>Xen-devel mailing list
>>Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

xuehai zhang

2005-Apr-05 22:34 UTC

head link

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Santos, Jose Renato G (Jose Renato Santos) wrote:>   Hi,
> 
>   We had a similar network problem in the past. We were using a TCP
> benchmark instead of MPI but I believe your problem is probably the same
> as the one we encountered.
>   It took us a while to get to the bottom of this and we only identified
> the reason for this behavior after we ported oprofile to Xen and did
> some performance profiling experiments.
> 
>   Here is a brief explanation of the problem we found and the solution
> that worked for us.
>   Xenolinux allocates a full page (4KB) to store socket buffers instead
> of using just MTU bytes as in traditional linux. This is necessary to
> enable page exchanges between the guest and the I/O domains. The side
> effect of this is that memory space used for  socket buffers is not very
> efficient. Even if packets have the maximum MTU size (typically 1500
> bytes for Ethernet) the total buffer utilization is very low ( at most
> just slightly  higher than 35%). If packets arrive faster than they are
> processed at the receiver side, they will exhaust the receiver buffer
> before the TCP advertised window is reached (By default Linux uses a TCP
> advertised window equal to 75% of the receive buffer size. In standard
> Linux, this is typically sufficient to stop packet transmission at the
> sender before running out of receive buffers. The same is not true in
> Xen due to inefficient use of socket buffers). When a packet arrives and
> there is no receive buffer available, TCP tries to free socket buffer
> space by eliminating socket buffer fragmentation (i.e. eliminating
> wasted buffer space). This is done at the cost of an extra copy of all
> receive buffer to new compacted socket buffers. This introduces overhead
> and reduces throughput when the CPU is the bottleneck, which seems to be
> your case.
> 
> This problem is not very frequent because modern CPUs are fast enough to
> receive packets at Gigabit speeds and the receive buffer does not fill
> up. However the problem may arise when using slower machines and/or when
> the workload consumes a lot of CPU cycles, such as for example
> scientific MPI applications. In your case in you have both factors
> against you.
> 
> The solution to this problem is trivial. You just have to change the TCP
> advertised window of your guest to a lower value. In our case, we used
> 25% of the receive buffer size and that was sufficient  to eliminate the
> problem. This can be done using the following command
> 
> 
>>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
In my experiments, I notice the above changing doesn''t persist upon
reboots (every reboot will
change the value back to 2, the default value for Debian Sarge 3.1). Is there a
way to make a
permanent changing?

Thanks.

Xuehai
> (The default 2 corresponds to 75% of receive buffer, and -2 corresponds
> to 25%)
> 
> Please let me know if this improve your results. You should still see a
> degradation in throughput when comparing xen to traditional linux, but
> hopefully you should be able to see better throughputs. You should also
> try running your experiments in domain 0. This will give better
> throughput although still lower than traditional linux.
> I am curious to know if this have any effect in your experiments.
> Please, post the new results if this has any effect in your results
> 
> Thanks
> 
> Renato
> 
> 
>  
>  
> 
>>-----Original Message-----
>>From: xen-devel-bounces@lists.xensource.com 
>>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of 
>>xuehai zhang
>>Sent: Monday, April 04, 2005 4:19 PM
>>To: Xen-devel@lists.xensource.com
>>Subject: [Xen-devel] MPI benchmark performance gap between 
>>native linux anddomU
>>
>>
>>
>>Hi all,
>>
>>I did the following experiments to explore the MPI 
>>application execution performance on both native linux 
>>machines and inside of unpriviledged Xen user domains. I use 
>>8 machines with identical HW configurations (498.756 MHz dual 
>>CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI 
>>Benchmarks (PMB).
>>
>>Experiment 1: I boot all 8 nodes with native linux (nosmp, 
>>kernel 2.4.29) and use all of them for PMB tests.
>>
>>Experiment 2: I boot all 8 nodes with Xen running and start a 
>>single user domain (port 2.6.10,using file-backed VBD) on 
>>each node with 360MB memory. Then I run the same PMB tests 
>>among these 8 user domains.
>>
>>The expreiment results show, running a same MPI benchmark in 
>>user domains usually results in a worse (sometimes very bad) 
>>performance comparing with on native linux machines. The 
>>following are the results for PMB SendRecv benchmark for both 
>>experiments (table1 and table2 report throughput and latency 
>>respectively). As you may notice, SendRecv can achieve a 
>>14.9MB/sec throughput on native linux machines but can get a 
>>maximum 7.07 MB/sec throughput if running inside of user 
>>domains. The latency results also have big gap.
>>
>>Clearly, there is difference between the memory used in the 
>>native linux machine of Experiment 1 (512MB) and in the user 
>>domain (360MB, can not go higher because dom0 started with 
>>128MB memory) of Experiment 2. However, I don''t think it is 
>>the main cause of the performance gap because the tested 
>>message sizes are much smaller than both memory sizes.
>>
>>I will appreciate your help if you had the similar experience 
>>and wanna share your insights.
>>
>>BTW, if you are not familar with PMB SendRecv benchmark, you 
>>can find a detailed explaination at 
>>http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1).
>>
>>Thanks in advance for you help.
>>
>>Xuehai
>>
>>
>>P.S. Table 1: SendRecv throughput (MB/sec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                0             0
>>1                0             0
>>2                0             0
>>4                0             0
>>8                0.04          0.01
>>16                    0.16          0.01
>>32                    0.34          0.02
>>64                    0.65          0.04
>>128                    1.17          0.09
>>256                    2.15          0.59
>>512                    3.4           1.23
>>1K                    5.29          2.57
>>2K                    7.68          3.5
>>4K                    10.7          4.96
>>8K                    13.35         7.07
>>16K                    14.9          3.77
>>32K                    9.85          3.68
>>64K                    5.06          3.02
>>128K                    7.91          4.94
>>256K                    7.85          5.25
>>512K                    7.93          6.11
>>1M                    7.85          6.5
>>2M                    8.18          5.44
>>4M                    7.55          4.93
>>
>>Table 2: SendRecv latency (millisec) performance
>>
>>Message_Size(bytes)    Experiment_1    Experiment_2
>>0                   1979.6        3010.96
>>1                   1724.16       3218.88
>>2                   1669.65       3185.3
>>4                   1637.26       3055.67
>>8                   406.77        2966.17
>>16                  185.76        2777.89
>>32                  181.06        2791.06
>>64                  189.12        2940.82
>>128                 210.51        2716.3
>>256                 227.36        843.94
>>512                 287.28        796.71
>>1K                  368.72        758.19
>>2K                  508.65        1144.24
>>4K                  730.59        1612.66
>>8K                  1170.22       2471.65
>>16K                 2096.86       8300.18
>>32K                 6340.45       17017.99
>>64K                 24640.78      41264.5
>>128K                31709.09      50608.97
>>256K                63680.67      94918.13
>>512K                125531.7      162168.47
>>1M                  251566.94     321451.02
>>2M                  477431.32     707981
>>4M                  997768.35     1503987.61
>>
>>
>>
>>_______________________________________________
>>Xen-devel mailing list
>>Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
>>
> 
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Nivedita Singhvi

2005-Apr-05 22:53 UTC

head link

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

xuehai zhang wrote:

>>> echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
> 
> 
> In my experiments, I notice the above changing doesn''t persist
upon
> reboots (every reboot will change the value back to 2, the default value 
> for Debian Sarge 3.1). Is there a way to make a permanent changing?

You can edit /etc/sysctl.conf and add the following entry:

net.ipv4.tcp_adv_win_scale = -2

Or you can put in a sysctl -w net.ipv4.tcp_adv_win_scale -2
into some appropriate /etc/init.d startup script.

thanks,
Nivedita


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Nivedita Singhvi

2005-Apr-05 22:58 UTC

head link

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Nivedita Singhvi wrote:
> Or you can put in a sysctl -w net.ipv4.tcp_adv_win_scale -2
grrr...

sysctl -w net.ipv4.tcp_adv_win_scale=-2

thanks,
Nivedita


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Apr 2005 - RE: MPI benchmark performance gap between native linux anddomU

RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU