thr3ads.net - Xen devel - RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU [Apr 2005]

If this information is useful, please help other people find it:
Share via:
Santos, Jose Renato G (Jose Renato Santos)
2005-Apr-06 00:17 UTC
RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU

Nivedita, Bin, Andrew and all interested in Xenoprof

  We should be posting the xenoprof patches in a few days.
  We are doing some last cleaning up in the code. Just be a little more
patient

  Thanks

  Renato 
>> -----Original Message-----
>> From: Nivedita Singhvi [mailto:niv@us.ibm.com] 
>> Sent: Tuesday, April 05, 2005 3:23 PM
>> To: Santos, Jose Renato G (Jose Renato Santos)
>> Cc: xuehai zhang; Xen-devel@lists.xensource.com; Turner, 
>> Yoshio; Aravind Menon; G John Janakiraman
>> Subject: Re: [Xen-devel] MPI benchmark performance gap 
>> between native linux anddomU
>> 
>> 
>> Santos, Jose Renato G (Jose Renato Santos) wrote:
>> 
>> >   Hi,
>> > 
>> >   We had a similar network problem in the past. We were 
>> using a TCP 
>> > benchmark instead of MPI but I believe your problem is 
>> probably the 
>> > same as the one we encountered.
>> >   It took us a while to get to the bottom of this and we only 
>> > identified the reason for this behavior after we ported 
>> oprofile to 
>> > Xen and did some performance profiling experiments.
>> 
>> Hello! Was this on the 2.6 kernel? Would you be able to
>> share the oprofile port? It would be very handy indeed
>> right now. (I was told by a few people that someone
>> was porting oprofile and I believe there was some status
>> on the list that went by) but haven''t seen it yet...
>> 
>> >   Here is a brief explanation of the problem we found and 
>> the solution 
>> > that worked for us.
>> >   Xenolinux allocates a full page (4KB) to store socket buffers 
>> > instead of using just MTU bytes as in traditional linux. This is 
>> > necessary to enable page exchanges between the guest and the I/O 
>> > domains. The side effect of this is that memory space used 
>> for  socket 
>> > buffers is not very efficient. Even if packets have the 
>> maximum MTU 
>> > size (typically 1500 bytes for Ethernet) the total buffer 
>> utilization 
>> > is very low ( at most just slightly  higher than 35%). If packets 
>> > arrive faster than they are processed at the receiver 
>> side, they will 
>> > exhaust the receiver buffer
>> 
>> Most small connections (say upto 3 - 4K) involve only 3 to 5 
>> segments, and so the tcp window never really opens fully.  
>> On longer lived connections, it does help very much to have 
>> a large buffer.
>> 
>> > before the TCP advertised window is reached (By default 
>> Linux uses a 
>> > TCP advertised window equal to 75% of the receive buffer size. In 
>> > standard Linux, this is typically sufficient to stop packet 
>> > transmission at the sender before running out of receive 
>> buffers. The 
>> > same is not true in Xen due to inefficient use of socket buffers).
>> > When a packet arrives and there is no receive buffer 
>> available, TCP 
>> > tries to free socket buffer space by eliminating socket buffer 
>> > fragmentation (i.e. eliminating wasted buffer space). This 
>> is done at 
>> > the cost of an extra copy of all receive buffer to new compacted 
>> > socket buffers. This introduces overhead and reduces 
>> throughput when 
>> > the CPU is the bottleneck, which seems to be your case.
>> 
>> /proc/net/netstat will show a counter of just how many times 
>> this happens (RcvPruned). Would be interesting if that was 
>> significant.
>> 
>> > This problem is not very frequent because modern CPUs are 
>> fast enough 
>> > to receive packets at Gigabit speeds and the receive 
>> buffer does not 
>> > fill up. However the problem may arise when using slower machines 
>> > and/or when the workload consumes a lot of CPU cycles, such as for
>> > example scientific MPI applications. In your case in you have both
>> > factors against you.
>> 
>> 
>> > The solution to this problem is trivial. You just have to 
>> change the 
>> > TCP advertised window of your guest to a lower value. In 
>> our case, we 
>> > used 25% of the receive buffer size and that was sufficient  to 
>> > eliminate the problem. This can be done using the following
command
>> 
>> >>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale
>> 
>> How much did this improve your results by? And wouldn''t
>>   making the default socket buffers, max socket buffers
>> larger by, say, 5 times be more effective (other than for
>> those applications using setsockopt() to set their buffers
>> to some size already, but not large enough)?
>> 
>> > (The default 2 corresponds to 75% of receive buffer, and -2 
>> > corresponds to 25%)
>> > 
>> > Please let me know if this improve your results. You 
>> should still see 
>> > a degradation in throughput when comparing xen to 
>> traditional linux, 
>> > but hopefully you should be able to see better 
>> throughputs. You should 
>> > also try running your experiments in domain 0. This will 
>> give better 
>> > throughput although still lower than traditional linux. I 
>> am curious 
>> > to know if this have any effect in your experiments. 
>> Please, post the 
>> > new results if this has any effect in your results
>> 
>> Yep, me too..
>> 
>> thanks,
>> Nivedita
>> 
>> 
>> 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel
Xen devel - Apr 2005 - RE: MPI benchmark performance gap between native linux anddomU

RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU