thr3ads.net - Xen devel - [Xen-devel] Xen cluster n/w performance (again!) [Nov 2004]

If this information is useful, please help other people find it:
Share via:

Diwaker Gupta

2004-Nov-21 06:37 UTC

[Xen-devel] Xen cluster n/w performance (again!)

Hey everyone,

A few weeks back there was some discussion on Xen''s I/O performance in
clusters on the list. I did some experiments today myself using iperf
(not ttcp):

o Xen dom0 talking to another machine in the cluster running native
linux: b/w around 904Mbps, thats nice :)

o Xen VM (running on the same machine as the dom0 in the previous
experiment) talking to another machine running native linux (again,
same as in previous experiment) only achieves 128 Mbps

I read on the list that you folks at Cambridge got upto 800+ Mbps
across VMs? Did you guys do any special optimizations or set any
special parameters? I read something about socket buffer size?

Thanks,
-- 
Diwaker Gupta
http://resolute.ucsd.edu/diwaker


-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Ian Pratt

2004-Nov-21 10:52 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

> Hey everyone,
> 
> A few weeks back there was some discussion on Xen''s I/O
performance in
> clusters on the list. I did some experiments today myself using iperf
> (not ttcp):
> 
> o Xen dom0 talking to another machine in the cluster running native
> linux: b/w around 904Mbps, thats nice :)
> 
> o Xen VM (running on the same machine as the dom0 in the previous
> experiment) talking to another machine running native linux (again,
> same as in previous experiment) only achieves 128 Mbps
> 
> I read on the list that you folks at Cambridge got upto 800+ Mbps
> across VMs? Did you guys do any special optimizations or set any
> special parameters? I read something about socket buffer size?
We did our measurements with a 128KB socket buffer on a 3.0GHz dual
Xeon. 

It does make a difference as to whether dom0 and domU are sharing
the same CPU, on different CPUs, or on different hyperthreads of
the same package.

At least in our experiments, with an MTU of 1500 bytes things
were pretty good regardless of the CPU allocation, but with an
artificially reduced MTU of 552 bytes we were definitely seeing
the advantage of having two CPUs.

There have been a couple of reports on the list of people seeing
unexpectedly low numbers, so something odd must be going on on
some systems. Please can you give more information about your
setup. When doing the experiments it would be interesting to know
the number of interrupts per second reported by the dom0 and domU
in each configuration.

I guess the best approach to solving this is probably to add more
instrumentation to the netfront/back drivers and export the data
via a /proc/interface. It''s possible that we''re getting into a
situation whereby the pipelining is breaking down and we''re only
transferring a couple of packets per context switch. 

Ian

Here''s the actual data we recorded:
                                                                                
MTU 1500        MTU 552
                TX      RX      TX      RX
Linux SMP       897     897     808     808
dom0            897     898     718     769
domU UP         897     843     436     379
domU HT         897     897     651     577
domU SMP        897     897     778     663
(VMWare)        291     651     101     137
(UML)           165     203     61      91


-------------------------------------------------------
This SF.Net email is sponsored by: InterSystems CACHE
FREE OODBMS DOWNLOAD - A multidimensional database that combines
robust object and relational technologies, making it a perfect match
for Java, C++,COM, XML, ODBC and JDBC. www.intersystems.com/match8
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Nov-21 20:10 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

> > Hey everyone,
> > 
> > A few weeks back there was some discussion on Xen''s I/O
performance in
> > clusters on the list. I did some experiments today myself using iperf
> > (not ttcp):
> > 
> > o Xen dom0 talking to another machine in the cluster running native
> > linux: b/w around 904Mbps, thats nice :)
> > 
> > o Xen VM (running on the same machine as the dom0 in the previous
> > experiment) talking to another machine running native linux (again,
> > same as in previous experiment) only achieves 128 Mbps
> > 
> > I read on the list that you folks at Cambridge got upto 800+ Mbps
> > across VMs? Did you guys do any special optimizations or set any
> > special parameters? I read something about socket buffer size?
One thing you might want to try is to change a line in the file
linux-2.6.9-xenU/drivers/xen/netfront/netfront.c.
From:
#define RX_MIN_TARGET 8
To:
#define RX_MIN_TARGET NETIF_RX_RING_SIZE

One possibility is that dynamic buffer sizing is dropping some packets
and causing TCP to crap itself.

If this improves things then I''ll have to be much more careful about
shrinking the buffers, and/or add a config option to disable the
resizing completely. 

 -- Keir


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

David Becker

2004-Nov-22 15:28 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance

I changed RX_MIN_TARGET in
linux-2.4.27-xenU/arch/xen/drivers/netif/frontend/main.c
and it made no difference at all in the 1500 byte iperf test.

" the number of interrupts per second reported by the dom0 and domU
" in each configuration.



For iperf tcp stream sent from stock linux to xenU
(2.4.27-xenU+RX_MIN_TARGET mod)


    1500 MTU: 464 Mbps

interrupts per second seen on xen0:
irq  1:         0 Phys-irq  keyboard    irq129:         0 Dynamic-irq  ctrl-if 
irq 14:         0 Phys-irq  ide0        irq130:      7473 Dynamic-irq  timer   
irq 17:     28526 Phys-irq  eth0        irq131:         0 Dynamic-irq  timer_d 
irq 18:        15 Phys-irq  aic7xxx     irq132:         0 Dynamic-irq  console 
irq 19:         0 Phys-irq  aic7xxx     irq133:         0 Dynamic-irq  blkif-b 
irq128:         0 Dynamic-irq  misdire  irq134:      1464 Dynamic-irq  vif3.0  

interrupts per second seen on xenU:
irq128:         0 Dynamic-irq  misdire  irq131:         0 Dynamic-irq  timer_d  
irq129:         0 Dynamic-irq  ctrl-if  irq132:         0 Dynamic-irq  blkif   
irq130:      5273 Dynamic-irq  timer    irq133:      5121 Dynamic-irq  eth0    


    552 MTU: 230 Mbps

interrupts per second seen on xen0:
irq  1:         0 Phys-irq  keyboard    irq129:         0 Dynamic-irq  ctrl-if 
irq 14:         0 Phys-irq  ide0        irq130:      9103 Dynamic-irq  timer   
irq 17:     19227 Phys-irq  eth0        irq131:         0 Dynamic-irq  timer_d 
irq 18:        10 Phys-irq  aic7xxx     irq132:         0 Dynamic-irq  console 
irq 19:         0 Phys-irq  aic7xxx     irq133:         0 Dynamic-irq  blkif-b 
irq128:         0 Dynamic-irq  misdire  irq134:      1804 Dynamic-irq  vif3.0  

interrupts per second seen on xenU:
irq128:         0 Dynamic-irq  misdire  irq131:         0 Dynamic-irq  timer_d 
irq129:         0 Dynamic-irq  ctrl-if  irq132:         0 Dynamic-irq  blkif   
irq130:      7264 Dynamic-irq  timer    irq133:      7158 Dynamic-irq  eth0    


The e1000 driver is stock, so these are with the default interrupt coalescing
settings.




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Diwaker Gupta

2004-Nov-23 21:32 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

> One thing you might want to try is to change a line in the file
> linux-2.6.9-xenU/drivers/xen/netfront/netfront.c.
> From:
> #define RX_MIN_TARGET 8
> To:
> #define RX_MIN_TARGET NETIF_RX_RING_SIZE
> 
> One possibility is that dynamic buffer sizing is dropping some packets
> and causing TCP to crap itself.
> 
> If this improves things then I''ll have to be much more careful
about
> shrinking the buffers, and/or add a config option to disable the
> resizing completely.
> 
>  -- Keir
> 
My bad.

Sorry about the false alarm everyone, it was a routing issue. With the
correct routing setup, I can see upto 930 Mbps between VMs running on
2 distinct physical machines in the cluster.

Cheers!

-- 
Diwaker Gupta
http://resolute.ucsd.edu/diwaker


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Ian Pratt

2004-Nov-23 22:14 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

> Sorry about the false alarm everyone, it was a routing issue. With the
> correct routing setup, I can see upto 930 Mbps between VMs running on
> 2 distinct physical machines in the cluster.
Phew! 

Is anyone else still seeing network performance anomalies? 

I know that some people were seeing odd _dom0_ network performance,
but I suspect that''s an IOAPIC issue that will go away when
Xen''s
boot code gets restructured over the next few weeks. (see the new
Roadmap web page).

Ian


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

David Becker

2004-Nov-30 21:09 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

" Is anyone else still seeing network performance anomalies? 

I still cannot get over 500Mbps into xenU with any hardware that I have.
This holds using kernels I build, or using kernel binaries from
the 2.0.1 tarball.

I tried tuning a e1000 driver, which greatly reduced the interrupt count,
but had no effect on bandwidth.

I can get 600 to 750 Mbps into domain-0 from a stock linux host. 
That rate then drops to around 500 after starting the etherbridge.
Running top on Domain-0 claims the domain is over 60% idle.

This is with e1000 and bcm5703 NICs on IBM x335, Dell 1650 and
other platforms, and a variety of CPU clock speeds.
Running iperf under stock linux 2.4.25 gets 940Mbps between any of them.



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Ian Pratt

2004-Nov-30 21:42 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

> 
> " Is anyone else still seeing network performance anomalies? 
> 
> I still cannot get over 500Mbps into xenU with any hardware that I have.
> This holds using kernels I build, or using kernel binaries from
> the 2.0.1 tarball.
> 
> I tried tuning a e1000 driver, which greatly reduced the interrupt count,
> but had no effect on bandwidth.
> 
> I can get 600 to 750 Mbps into domain-0 from a stock linux host. 
> That rate then drops to around 500 after starting the etherbridge.
> Running top on Domain-0 claims the domain is over 60% idle.
With the domain 0 otherwise idle, what happens if you run
''slurp''
(attached).

The only time I''ve ever seen the bridge burn CPU is if you try
setting some of its delay parameters to zero in which case it can
cause it to periodically loop.
> This is with e1000 and bcm5703 NICs on IBM x335, Dell 1650 and
> other platforms, and a variety of CPU clock speeds.
> Running iperf under stock linux 2.4.25 gets 940Mbps between any of them.
What''s the spec of the most modern machines you''ve tried Xen
on?

Ian



/******************************************************************************
 * slurp.c
 * 
 * Slurps spare CPU cycles and prints a percentage estimate every second.
 */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <err.h>


/* rpcc: get full 64-bit Pentium TSC value */
static __inline__ unsigned long long int rpcc(void) 
{
    unsigned int __h, __l;
    __asm__ __volatile__ ("rdtsc" :"=a" (__l),
"=d" (__h));
    return (((unsigned long long)__h) << 32) + __l;
}


/*
 * find_cpu_speed:
 *   Interrogates /proc/cpuinfo for the processor clock speed.
 * 
 *   Returns: speed of processor in MHz, rounded down to nearest whole MHz.
 */
#define MAX_LINE_LEN 50
int find_cpu_speed(void)
{
    FILE *f;
    char s[MAX_LINE_LEN], *a, *b;

    if ( (f = fopen("/proc/cpuinfo", "r")) == NULL ) goto
out;

    while ( fgets(s, MAX_LINE_LEN, f) )
    {
        if ( strstr(s, "cpu MHz") )
        {
            /* Find the start of the speed value, and stop at the dec point. */
            if ( !(a=strpbrk(s,"0123456789")) ||
!(b=strpbrk(a,".")) ) break;
            *b = ''\0'';
            fclose(f);
            return(atoi(a));
        }
    }

 out:
    fprintf(stderr, "find_cpu_speed: error parsing /proc/cpuinfo for cpu
MHz");
    exit(1);
}


int main( int argc, char **argv )
{
    int mhz, i, cpu=-1;

    /*
     * no_preempt_estimate is our estimate, in clock cycles, of how long it
     * takes to execute one iteration of the main loop when we aren''t
     * preempted. 50000 cycles is an overestimate, which we want because:
     *  (a) On the first pass through the loop, diff will be almost 0,
     *      which will knock the estimate down to <40000 immediately.
     *  (b) It''s safer to approach real value from above than from
below --
     *      note that this algorithm is unstable if n_p_e gets too small!
     */
    unsigned int no_preempt_estimate = 50000;

    /*
     * prev  = timestamp on previous iteration;
     * this  = timestamp on this iteration;
     * diff  = difference between the above two stamps;
     * start = timestamp when we last printed CPU % estimate;
     */
    unsigned long long int prev, this, diff, start;

    /*
     * preempt_time = approx. cycles we''ve been preempted for since
last stats
     *                display.
     */
    unsigned long long int preempt_time = 0;

    if ( argc > 1 )
	cpu = atoi(argv[1]);
    else if ( argc > 2 )
	exit(-1);


    /* Required in order to print intermediate results at fixed period. */
    mhz = find_cpu_speed();
    printf("CPU speed = %d MHz, using cpu %d\n", mhz, cpu);

    if (cpu>=0)
    {
	int rc; 
	unsigned long bs = 0; 
	bs = 1<<cpu;

	rc=sched_setaffinity( getpid(), sizeof(bs)*8, &bs );

	if(rc)
	    err(rc,"sched_getaffinity failed\n.");
	    
    }
							       

    start = prev = rpcc();

    for ( ; ; )
    {
        /*
         * By looping for a while here we hope to reduce affect of getting
         * preempted in critical "timestamp swapping" section of the
loop.
         * In addition, it should ensure that
''no_preempt_estimate'' stays
         * reasonably large which helps keep this algorithm stable.
         */
        for ( i = 0; i < 10000; i++ );

        /*
         * The critical bit! Getting preempted here will shaft us a bit,
         * but the loop above should make this a rare occurrence.
         */
	this = rpcc();
	diff = this - prev;
	prev = this;

        /* if ( diff > (1.5 * preempt_estimate) */
        if ( diff > no_preempt_estimate + (no_preempt_estimate>>1) )
        {
            /* We were probably preempted for a while. */
            preempt_time += diff - no_preempt_estimate;            
        }
        else
        {
            /*
             * Looks like we weren''t preempted -- update our time
estimate:
             * New estimate = 0.75*old_est + 0.25*curr_diff
             */
            no_preempt_estimate                 (no_preempt_estimate>>1) +
(no_preempt_estimate>>2) +
                (diff>>2);
        }
	    
        /* Dump CPU time every second. */
        if ( (this - start) / mhz > 1000000 ) 
        { 
            printf("Slurped %.2f%% CPU, cpu %d\n", 
                   100.0*((this-start-preempt_time)/((double)this-start)),
		   cpu);
            start = this;
            preempt_time = 0;
        }
    }

    return(0);
}


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

David Becker

2004-Nov-30 23:42 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

" With the domain 0 otherwise idle, what happens if you run
''slurp''
" (attached).

slurp output on domain 0.  xen0 is hit with a 10 second iperf run in the
middle.   (xen0 is idle, there are no xenU hosted, but etherbridge and
xend are running)

Slurped 99.32% CPU, cpu -1
Slurped 99.32% CPU, cpu -1
Slurped 96.52% CPU, cpu -1
Slurped 31.97% CPU, cpu -1
Slurped 54.25% CPU, cpu -1
Slurped 42.67% CPU, cpu -1
Slurped 42.56% CPU, cpu -1
Slurped 49.14% CPU, cpu -1
Slurped 32.32% CPU, cpu -1
Slurped 37.31% CPU, cpu -1
Slurped 30.14% CPU, cpu -1
Slurped 52.59% CPU, cpu -1
Slurped 40.65% CPU, cpu -1
Slurped 99.22% CPU, cpu -1
Slurped 99.32% CPU, cpu -1
Slurped 99.29% CPU, cpu -1

" What''s the spec of the most modern machines you''ve
tried Xen on?

dual 2.8GHz P4xeon 2GB ram.  Sending iperf from another box on that
switch, I see 380Mbps into a xenU on a 2.8GHz host.

Of course the slower CPUs are more available so most of my tests
run there (2.0GHz P4 and 1.4GHz P3).   Is the network overhead so high
that it matters?   They all get wire speed with stock linux.

Another odd thing I saw is that, while on stock linux the
''timer'' irq is rock
steady at 100 interrupts per second,  under xen0 the timer irq varies
from 60 to 200 when idle, and hits 5000 per second xen0 is receiving an
iperf stream.



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

Keir Fraser

2004-Dec-01 09:09 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

> Another odd thing I saw is that, while on stock linux the
''timer'' irq is rock
> steady at 100 interrupts per second,  under xen0 the timer irq varies
> from 60 to 200 when idle, and hits 5000 per second xen0 is receiving an
> iperf stream.
Domains don''t get tick interrupts when they aren''t running, so
you can
see tick rates lower than native. Also, a domain gets a tick interrupt
every time it is rescheduled, which can happen at an arbitrarily fast
rate which explains your very high tick rates in some cases.

 -- Keir


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xen-devel

David Becker

2004-Dec-03 21:23 UTC

head link

Re: [Xen-devel] Xen cluster n/w performance (again!)

I looked closely at tcpdumps of an iperf stream flowing into domain-0 from a
stock linux box.

It looks like my xen0 is not able to send out ACK packets while receiving
incoming data packets. 
That is based on the first attached postscript graph.  It was generated 
by tcptrace/xplot from a tcpdump taken on xen0 while iperf data for xen0
is arriving.
  The green line plots the time of the highest seen ACK seq number.
  The yellow line plots the seq number of the window limit over time.
  The black segments show time of incoming data (diamonds show TCP PUSH flag)

The second postscript graph traces an iperf stream as seen from the sender side.
The sender runs stock linux 2.4.25.  The receiver runs 2.4.27-xen0.

It shows that the sender fills the window as soon as green acks arrive, then
waits quite a while for the next batch of acks to resume transmitting to
xen0.  Increasing the window size makes no difference as the sender keeps the
window full (graphwise that means the yellow and green lines are farther
apart, but the black segments reach the yellow line as soon as the acks
arrive).

Looking at iperf flows between stock linux boxes shows the sender never
fills the window (the black segments rarely reach the yellow line).

I''m guessing this means the handling of Rx interrupts completely blocks
out
progress on the Tx side.  I tried throwing in a noapic kernel param but
it made no difference.

Xen devel - Nov 2004 - Xen cluster n/w performance (again!)

[Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)

Re: [Xen-devel] Xen cluster n/w performance (again!)