Santos, Jose Renato G (Jose Renato Santos)
2005-Apr-06 00:17 UTC
RE: [Xen-devel] MPI benchmark performance gap between native linux anddomU
Nivedita, Bin, Andrew and all interested in Xenoprof We should be posting the xenoprof patches in a few days. We are doing some last cleaning up in the code. Just be a little more patient Thanks Renato>> -----Original Message----- >> From: Nivedita Singhvi [mailto:niv@us.ibm.com] >> Sent: Tuesday, April 05, 2005 3:23 PM >> To: Santos, Jose Renato G (Jose Renato Santos) >> Cc: xuehai zhang; Xen-devel@lists.xensource.com; Turner, >> Yoshio; Aravind Menon; G John Janakiraman >> Subject: Re: [Xen-devel] MPI benchmark performance gap >> between native linux anddomU >> >> >> Santos, Jose Renato G (Jose Renato Santos) wrote: >> >> > Hi, >> > >> > We had a similar network problem in the past. We were >> using a TCP >> > benchmark instead of MPI but I believe your problem is >> probably the >> > same as the one we encountered. >> > It took us a while to get to the bottom of this and we only >> > identified the reason for this behavior after we ported >> oprofile to >> > Xen and did some performance profiling experiments. >> >> Hello! Was this on the 2.6 kernel? Would you be able to >> share the oprofile port? It would be very handy indeed >> right now. (I was told by a few people that someone >> was porting oprofile and I believe there was some status >> on the list that went by) but haven''t seen it yet... >> >> > Here is a brief explanation of the problem we found and >> the solution >> > that worked for us. >> > Xenolinux allocates a full page (4KB) to store socket buffers >> > instead of using just MTU bytes as in traditional linux. This is >> > necessary to enable page exchanges between the guest and the I/O >> > domains. The side effect of this is that memory space used >> for socket >> > buffers is not very efficient. Even if packets have the >> maximum MTU >> > size (typically 1500 bytes for Ethernet) the total buffer >> utilization >> > is very low ( at most just slightly higher than 35%). If packets >> > arrive faster than they are processed at the receiver >> side, they will >> > exhaust the receiver buffer >> >> Most small connections (say upto 3 - 4K) involve only 3 to 5 >> segments, and so the tcp window never really opens fully. >> On longer lived connections, it does help very much to have >> a large buffer. >> >> > before the TCP advertised window is reached (By default >> Linux uses a >> > TCP advertised window equal to 75% of the receive buffer size. In >> > standard Linux, this is typically sufficient to stop packet >> > transmission at the sender before running out of receive >> buffers. The >> > same is not true in Xen due to inefficient use of socket buffers). >> > When a packet arrives and there is no receive buffer >> available, TCP >> > tries to free socket buffer space by eliminating socket buffer >> > fragmentation (i.e. eliminating wasted buffer space). This >> is done at >> > the cost of an extra copy of all receive buffer to new compacted >> > socket buffers. This introduces overhead and reduces >> throughput when >> > the CPU is the bottleneck, which seems to be your case. >> >> /proc/net/netstat will show a counter of just how many times >> this happens (RcvPruned). Would be interesting if that was >> significant. >> >> > This problem is not very frequent because modern CPUs are >> fast enough >> > to receive packets at Gigabit speeds and the receive >> buffer does not >> > fill up. However the problem may arise when using slower machines >> > and/or when the workload consumes a lot of CPU cycles, such as for >> > example scientific MPI applications. In your case in you have both >> > factors against you. >> >> >> > The solution to this problem is trivial. You just have to >> change the >> > TCP advertised window of your guest to a lower value. In >> our case, we >> > used 25% of the receive buffer size and that was sufficient to >> > eliminate the problem. This can be done using the following command >> >> >>echo -2 > /proc/sys/net/ipv4/tcp_adv_win_scale >> >> How much did this improve your results by? And wouldn''t >> making the default socket buffers, max socket buffers >> larger by, say, 5 times be more effective (other than for >> those applications using setsockopt() to set their buffers >> to some size already, but not large enough)? >> >> > (The default 2 corresponds to 75% of receive buffer, and -2 >> > corresponds to 25%) >> > >> > Please let me know if this improve your results. You >> should still see >> > a degradation in throughput when comparing xen to >> traditional linux, >> > but hopefully you should be able to see better >> throughputs. You should >> > also try running your experiments in domain 0. This will >> give better >> > throughput although still lower than traditional linux. I >> am curious >> > to know if this have any effect in your experiments. >> Please, post the >> > new results if this has any effect in your results >> >> Yep, me too.. >> >> thanks, >> Nivedita >> >> >>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel