Santos, Jose Renato G (Jose Renato Santos)
2005-Apr-05 03:10 UTC
RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU
I guess I overlooked the rates you reported in your post. Looking now at your rates carefully I got somewhat confused. When you say MB/sec do you mean Megabyte/sec or Migabit/sec? In any case these are much lower rates than in our case (we were using a gigabit network). Now, I am starting to think that your problem might be different than ours, but it does not hurt to try changing the advertised window, just in case. Also, the numbers your report are inconsistent. You mention that your network is 10 MB/s, and that native linux achieve 14.9 MB/s. How is it possible to achieve a throughput higher than the network bandwidth? Could you please clarify? Thanks Renato> -----Original Message----- > From: Santos, Jose Renato G (Jose Renato Santos) > Sent: Monday, April 04, 2005 7:07 PM > To: ''xuehai zhang''; Xen-devel@lists.xensource.com > Cc: Aravind Menon; Turner, Yoshio; G John Janakiraman > Subject: RE: [Xen-devel] MPI benchmark performance gap > between native linux anddomU > > > > Hi, > > We had a similar network problem in the past. We were using > a TCP benchmark instead of MPI but I believe your problem is > probably the same as the one we encountered. > It took us a while to get to the bottom of this and we only > identified the reason for this behavior after we ported > oprofile to Xen and did some performance profiling experiments. > > Here is a brief explanation of the problem we found and the > solution that worked for us. > Xenolinux allocates a full page (4KB) to store socket > buffers instead of using just MTU bytes as in traditional > linux. This is necessary to enable page exchanges between the > guest and the I/O domains. The side effect of this is that > memory space used for socket buffers is not very efficient. > Even if packets have the maximum MTU size (typically 1500 > bytes for Ethernet) the total buffer utilization is very low > ( at most just slightly higher than 35%). If packets arrive > faster than they are processed at the receiver side, they > will exhaust the receiver buffer before the TCP advertised > window is reached (By default Linux uses a TCP advertised > window equal to 75% of the receive buffer size. In standard > Linux, this is typically sufficient to stop packet > transmission at the sender before running out of receive > buffers. The same is not true in Xen due to inefficient use > of socket buffers). When a packet arrives and there is no > receive buffer available, TCP tries to free socket buffer > space by eliminating socket buffer fragmentation (i.e. > eliminating wasted buffer space). This is done at the cost of > an extra copy of all receive buffer to new compacted socket > buffers. This introduces overhead and reduces throughput when > the CPU is the bottleneck, which seems to be your case. > > This problem is not very frequent because modern CPUs are > fast enough to receive packets at Gigabit speeds and the > receive buffer does not fill up. However the problem may > arise when using slower machines and/or when the workload > consumes a lot of CPU cycles, such as for example scientific > MPI applications. In your case in you have both factors against you. > > The solution to this problem is trivial. You just have to > change the TCP advertised window of your guest to a lower > value. In our case, we used 25% of the receive buffer size > and that was sufficient to eliminate the problem. This can > be done using the following command > > > echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale > > (The default 2 corresponds to 75% of receive buffer, and -2 > corresponds to 25%) > > Please let me know if this improve your results. You should > still see a degradation in throughput when comparing xen to > traditional linux, but hopefully you should be able to see > better throughputs. You should also try running your > experiments in domain 0. This will give better throughput > although still lower than traditional linux. I am curious to > know if this have any effect in your experiments. Please, > post the new results if this has any effect in your results > > Thanks > > Renato > > > > > > -----Original Message----- > > From: xen-devel-bounces@lists.xensource.com > > [mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of > > xuehai zhang > > Sent: Monday, April 04, 2005 4:19 PM > > To: Xen-devel@lists.xensource.com > > Subject: [Xen-devel] MPI benchmark performance gap between > > native linux anddomU > > > > > > > > Hi all, > > > > I did the following experiments to explore the MPI > > application execution performance on both native linux > > machines and inside of unpriviledged Xen user domains. I use > > 8 machines with identical HW configurations (498.756 MHz dual > > CPU, 512MB memory, on a 10MB/sec LAN) and I use Pallas MPI > > Benchmarks (PMB). > > > > Experiment 1: I boot all 8 nodes with native linux (nosmp, > > kernel 2.4.29) and use all of them for PMB tests. > > > > Experiment 2: I boot all 8 nodes with Xen running and start a > > single user domain (port 2.6.10,using file-backed VBD) on > > each node with 360MB memory. Then I run the same PMB tests > > among these 8 user domains. > > > > The expreiment results show, running a same MPI benchmark in > > user domains usually results in a worse (sometimes very bad) > > performance comparing with on native linux machines. The > > following are the results for PMB SendRecv benchmark for both > > experiments (table1 and table2 report throughput and latency > > respectively). As you may notice, SendRecv can achieve a > > 14.9MB/sec throughput on native linux machines but can get a > > maximum 7.07 MB/sec throughput if running inside of user > > domains. The latency results also have big gap. > > > > Clearly, there is difference between the memory used in the > > native linux machine of Experiment 1 (512MB) and in the user > > domain (360MB, can not go higher because dom0 started with > > 128MB memory) of Experiment 2. However, I don''t think it is > > the main cause of the performance gap because the tested > > message sizes are much smaller than both memory sizes. > > > > I will appreciate your help if you had the similar experience > > and wanna share your insights. > > > > BTW, if you are not familar with PMB SendRecv benchmark, you > > can find a detailed explaination at > > http://people.cs.uchicago.edu/~hai/PMB-MPI1.pdf (see section 4.3.1). > > > > Thanks in advance for you help. > > > > Xuehai > > > > > > P.S. Table 1: SendRecv throughput (MB/sec) performance > > > > Message_Size(bytes) Experiment_1 Experiment_2 > > 0 0 0 > > 1 0 0 > > 2 0 0 > > 4 0 0 > > 8 0.04 0.01 > > 16 0.16 0.01 > > 32 0.34 0.02 > > 64 0.65 0.04 > > 128 1.17 0.09 > > 256 2.15 0.59 > > 512 3.4 1.23 > > 1K 5.29 2.57 > > 2K 7.68 3.5 > > 4K 10.7 4.96 > > 8K 13.35 7.07 > > 16K 14.9 3.77 > > 32K 9.85 3.68 > > 64K 5.06 3.02 > > 128K 7.91 4.94 > > 256K 7.85 5.25 > > 512K 7.93 6.11 > > 1M 7.85 6.5 > > 2M 8.18 5.44 > > 4M 7.55 4.93 > > > > Table 2: SendRecv latency (millisec) performance > > > > Message_Size(bytes) Experiment_1 Experiment_2 > > 0 1979.6 3010.96 > > 1 1724.16 3218.88 > > 2 1669.65 3185.3 > > 4 1637.26 3055.67 > > 8 406.77 2966.17 > > 16 185.76 2777.89 > > 32 181.06 2791.06 > > 64 189.12 2940.82 > > 128 210.51 2716.3 > > 256 227.36 843.94 > > 512 287.28 796.71 > > 1K 368.72 758.19 > > 2K 508.65 1144.24 > > 4K 730.59 1612.66 > > 8K 1170.22 2471.65 > > 16K 2096.86 8300.18 > > 32K 6340.45 17017.99 > > 64K 24640.78 41264.5 > > 128K 31709.09 50608.97 > > 256K 63680.67 94918.13 > > 512K 125531.7 162168.47 > > 1M 251566.94 321451.02 > > 2M 477431.32 707981 > > 4M 997768.35 1503987.61 > > > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel > > >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
xuehai zhang
2005-Apr-05 05:27 UTC
Re: [Xen-devel] MPI benchmark performance gap between native linux anddomU
Jose, Thank you so much for your valueable information.> I guess I overlooked the rates you reported in your post. > Looking now at your rates carefully I got somewhat confused. When you > say MB/sec do you mean Megabyte/sec or Migabit/sec?It is Megabyte/sec (2^20 bytes)> In any case these > are much lower rates than in our case (we were using a gigabit network). > Now, I am starting to think that your problem might be different than > ours, but it does not hurt to try changing the advertised window, just > in case.I will try your suggestion and sent out update tomorrow morning.> Also, the numbers your report are inconsistent. You mention that your > network is 10 MB/s, and that native linux achieve 14.9 MB/s. How is it > possible to achieve a throughput higher than the network bandwidth? > Could you please clarify?Yes, it is a little confusing. It is due to the caculation of SendRecv''s throughput. If you take a look at the PMB user manual (following the link in my previous email), the throughput is defined as 2X/1.048567/time (time is latency). So, it is a weighted throughput and could go beyond 10MB/s, which is the max bandwidth of the network. Thanks. Xuehai _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel