thr3ads.net - Xen devel - [Xen-devel] Xen dom0 network I/O scalability [Apr 2011]

If this information is useful, please help other people find it:
Share via:

Kaushik Kumar Ram

2011-Apr-27 17:30 UTC

[Xen-devel] Xen dom0 network I/O scalability

Hi folks,

I have a few questions/clarifications about dom0 network I/O scalability. I
would appreciate any feedback/pointers.

So the current implementation of netback does not scale beyond a single CPU
core, thanks to the use of tasklets, making it a bottleneck (please correct me
if I am wrong). I remember coming across some patches which attempts to use
softirqs instead of tasklets to solve this issue. But the latest version of
linux-2.6-xen.hg repo does not include them. Are they included in some other
version of dom0 Linux? Or will they be included in future? This seems like a
very important problem to solve especially as the number of VMs and number of
CPU cores go up. Is there any fundamental limitation in solving this problem?

Thanks.
--Kaushik

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2011-Apr-27 17:50 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On Wed, Apr 27, 2011 at 12:30:27PM -0500, Kaushik Kumar Ram
wrote:> Hi folks,
> 
> I have a few questions/clarifications about dom0 network I/O scalability. I
would appreciate any feedback/pointers.
> 
> So the current implementation of netback does not scale beyond a single CPU
core, thanks to the use of tasklets, making it a bottleneck (please correct me
if I am wrong). I remember coming across some patches which attempts to use
softirqs instead of tasklets to solve this issue. But the latest version of
linux-2.6-xen.hg repo does not include them. Are they included in some other
version of dom0 Linux? Or will they be included in future? This seems like a
very important problem to solve especially as the number of VMs and number of
CPU cores go up. Is there any fundamental limitation in solving this problem?
You should be using the 2.6.39 kernel or the 2.6.32 to take advantage of those
patches.> 
> Thanks.
> --Kaushik
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Kaushik Kumar Ram

2011-May-10 02:13 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On Apr 27, 2011, at 12:50 PM, Konrad Rzeszutek Wilk
wrote:>> So the current implementation of netback does not scale beyond a single
CPU core, thanks to the use of tasklets, making it a bottleneck (please correct
me if I am wrong). I remember coming across some patches which attempts to use
softirqs instead of tasklets to solve this issue. But the latest version of
linux-2.6-xen.hg repo does not include them. Are they included in some other
version of dom0 Linux? Or will they be included in future?
> 
> You should be using the 2.6.39 kernel or the 2.6.32 to take advantage of
those patches.
Thanks Konrad. I got hold of a pvops dom0 kernel from Jeremy''s git repo
(xen/stable-2.6.32.x). As you pointed out it did include those patches. I spent
some time studying the new netback design and ran some experiments. I have a few
questions regarding them.

I am using the latest version of the hypervisor from the xen-unstable.hg repo. I
ran the experiments on a dual socket AMD quad-core opteron machine (with 8 CPU
cores). My experiments simply involved running ''netperf''
between 1 or 2 pairs of VMs on the same machine. I allocated 4 vcpus for dom0
and one each for the VMs. None of the vcpus were pinned.

- So the new design allows you to choose between tasklets and kthreads within
netback, with tasklets being the default option. Is there any specific reason
for this?

- The inter-VM performance (throughput) is worse using both tasklets and
kthreads as compared to the old version of netback (as in linux-2.6-xen.hg
repo). I observed about 50% drop in throughput in my experiments. Has anyone
else observed this? Is the new version yet to be optimized?

- Two tasklets (rx and tx) are created per vcpu within netback. But in my
experiments I noticed that only one vcpu was being used during the experiments
(even with 4 VMs).  I also observed that all the event channel notifications
within netback are always sent to vcpu 0. So my conjecture is that since the
tasklets are always scheduled by vcpu 0, all of them are run only on vcpu 0. Is
this a BUG?

- Unlike with tasklets, I observed the CPU utilization go up when I used
kthreads and increased the number of VMs. But the performance never scaled up.
On profiling the code (using xenoprof) I observed significant synchronization
overhead due to lock contention. The main culprit seems to be the per-domain
lock acquired inside the hypervisor (specifically within do_grant_table_op).
Further, packets are copied (inside gnttab_copy) while this lock is held. Seems
like a bad idea?

- A smaller source of overhead is when the ''_lock'' is acquired
within netback in netif_idx_release(). Shouldn''t this lock be per
struct xen-netbk instead of being global (declared as static within the
function)? Is this a BUG?

If some (or all) of these points have already been discussed before, I apologize
in advance!

I appreciate any feedback or pointers.

Thanks.

--Kaushik 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2011-May-10 20:23 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On Mon, May 09, 2011 at 09:13:09PM -0500, Kaushik Kumar Ram
wrote:> On Apr 27, 2011, at 12:50 PM, Konrad Rzeszutek Wilk wrote:
> >> So the current implementation of netback does not scale beyond a
single CPU core, thanks to the use of tasklets, making it a bottleneck (please
correct me if I am wrong). I remember coming across some patches which attempts
to use softirqs instead of tasklets to solve this issue. But the latest version
of linux-2.6-xen.hg repo does not include them. Are they included in some other
version of dom0 Linux? Or will they be included in future?
> > 
> > You should be using the 2.6.39 kernel or the 2.6.32 to take advantage
of those patches.
> 
> Thanks Konrad. I got hold of a pvops dom0 kernel from Jeremy''s git
repo (xen/stable-2.6.32.x). As you pointed out it did include those patches. I
spent some time studying the new netback design and ran some experiments. I have
a few questions regarding them.
> 
> I am using the latest version of the hypervisor from the xen-unstable.hg
repo. I ran the experiments on a dual socket AMD quad-core opteron machine (with
8 CPU cores). My experiments simply involved running ''netperf''
between 1 or 2 pairs of VMs on the same machine. I allocated 4 vcpus for dom0
and one each for the VMs. None of the vcpus were pinned.
> 
> - So the new design allows you to choose between tasklets and kthreads
within netback, with tasklets being the default option. Is there any specific
reason for this?
Not sure where the thread is for this - but when the patches for that were
posted it showed
a big improvement in performance over 10GB. But it did require spreading the
netback across
the CPUs.> 
> - The inter-VM performance (throughput) is worse using both tasklets and
kthreads as compared to the old version of netback (as in linux-2.6-xen.hg
repo). I observed about 50% drop in throughput in my experiments. Has anyone
else observed this? Is the new version yet to be optimized?
That is not surprising. The "new" version of netback copies pages. It
does not "swizzel" or "map" then between
domains (so zero copying).> 
> - Two tasklets (rx and tx) are created per vcpu within netback. But in my
experiments I noticed that only one vcpu was being used during the experiments
(even with 4 VMs).  I also observed that all the event channel notifications
within netback are always sent to vcpu 0. So my conjecture is that since the
tasklets are always scheduled by vcpu 0, all of them are run only on vcpu 0. Is
this a BUG?
Yes. We need to fix ''irqbalance'' to work properly. There is
something not working right.> 
> - Unlike with tasklets, I observed the CPU utilization go up when I used
kthreads and increased the number of VMs. But the performance never scaled up.
On profiling the code (using xenoprof) I observed significant synchronization
overhead due to lock contention. The main culprit seems to be the per-domain
lock acquired inside the hypervisor (specifically within do_grant_table_op).
Further, packets are copied (inside gnttab_copy) while this lock is held. Seems
like a bad idea?
Ian was thinking (and he proposed a talk at Linux Plumbers Conference) to
reintroduce the zero copying
functionality back. But it is not an easy problem b/c the way the pages go
through the Linux kernel
plumbing.
> 
> - A smaller source of overhead is when the ''_lock'' is
acquired within netback in netif_idx_release(). Shouldn''t this lock be
per struct xen-netbk instead of being global (declared as static within the
function)? Is this a BUG?
Ian, what is your thought?> 
> If some (or all) of these points have already been discussed before, I
apologize in advance!
> 
> I appreciate any feedback or pointers.
> 
> Thanks.
> 
> --Kaushik 
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-May-11 09:31 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On Tue, 2011-05-10 at 21:23 +0100, Konrad Rzeszutek Wilk
wrote:> On Mon, May 09, 2011 at 09:13:09PM -0500, Kaushik Kumar Ram wrote:
> > On Apr 27, 2011, at 12:50 PM, Konrad Rzeszutek Wilk wrote:
> > >> So the current implementation of netback does not scale
beyond a
> single CPU core, thanks to the use of tasklets, making it a bottleneck
> (please correct me if I am wrong). I remember coming across some
> patches which attempts to use softirqs instead of tasklets to solve
> this issue. But the latest version of linux-2.6-xen.hg repo does not
> include them. Are they included in some other version of dom0 Linux?
> Or will they be included in future? 
> > > 
> > > You should be using the 2.6.39 kernel or the 2.6.32 to take
> advantage of those patches.
> > 
> > Thanks Konrad. I got hold of a pvops dom0 kernel from
Jeremy''s git
> repo (xen/stable-2.6.32.x). As you pointed out it did include those
> patches. I spent some time studying the new netback design and ran
> some experiments. I have a few questions regarding them. 
> > 
> > I am using the latest version of the hypervisor from the
> xen-unstable.hg repo. I ran the experiments on a dual socket AMD
> quad-core opteron machine (with 8 CPU cores). My experiments simply
> involved running ''netperf'' between 1 or 2 pairs of VMs on
the same
> machine. I allocated 4 vcpus for dom0 and one each for the VMs. None
> of the vcpus were pinned.
> > 
> > - So the new design allows you to choose between tasklets and
> kthreads within netback, with tasklets being the default option. Is
> there any specific reason for this?
> 
> Not sure where the thread is for this - but when the patches for that
> were posted it showed a big improvement in performance over 10GB. But
> it did require spreading the netback across the CPUs.
Using a tasklet basically allows netback to take the entire CPU under
heavy load (mostly a problem when you only have the same number of VCPUs
assigned to dom0 as you have netback tasklets). Using a thread causes
the network processing to at least get scheduled alongside e.g. your
sshd and toolstack. I seem to remember a small throughput reduction in
thread vs. tasklet mode but this is more than offset by the "can use
dom0" factor. In the upstream version of netback I removed the tasklet
option so threaded is the only choice.

The decision to go multi-thread/tasklet was somewhat orthogonal to this
and was about utilising all of the VCPUs in dom0. Previously netback
would only ever use 1CPU. In the new design each VIF interface is
statically assigned to a particular netback thread at start of day. So
for a given VIF interface there is no real difference other than lower
contention with other VIFs.
> > - The inter-VM performance (throughput) is worse using both tasklets
> and kthreads as compared to the old version of netback (as in
> linux-2.6-xen.hg repo). I observed about 50% drop in throughput in my
> experiments. Has anyone else observed this? Is the new version yet to
> be optimized?
> 
> That is not surprising. The "new" version of netback copies
pages. It
> does not "swizzel" or "map" then between domains (so
zero copying).
I think Kaushik is running a xen/2.6.32.x tree and the copying only
variant is only in mainline.

A 50% drop in performance between linux-2.6-xen.hg and the xen.git
2.6.32 tree is slightly worrying but such a big drop sounds more like a
misconfiguration, e.g. something like enabling debugging options in the
kernel .config rather than a design or implementation issue in netback.

(I actually have no idea what was in the linux-2.6-xen.hg tree, I don''t
recall such a tree ever being properly maintained, the last cset appears
to be from 2006 and I recently cleaned it out of xenbits because noone
knew what it was -- did you mean linux-2.6.18-xen.hg?)
> > - Two tasklets (rx and tx) are created per vcpu within netback. But
> in my experiments I noticed that only one vcpu was being used during
> the experiments (even with 4 VMs).  I also observed that all the event
> channel notifications within netback are always sent to vcpu 0. So my
> conjecture is that since the tasklets are always scheduled by vcpu 0,
> all of them are run only on vcpu 0. Is this a BUG?
> 
> Yes. We need to fix ''irqbalance'' to work properly. There
is something
> not working right.
The fix is to install the "irqbalanced" package. Without it no IRQ
balancing will occur in a modern kernel. (perhaps this linux-2.6-xen.hg
tree was from a time when the kernel would do balancing on its own?).
You can also manually balance the VIF IRQs under /proc/irq if you are so
inclined.
> > - Unlike with tasklets, I observed the CPU utilization go up when I
> used kthreads and increased the number of VMs. But the performance
> never scaled up. On profiling the code (using xenoprof) I observed
> significant synchronization overhead due to lock contention. The main
> culprit seems to be the per-domain lock acquired inside the hypervisor
> (specifically within do_grant_table_op). Further, packets are copied
> (inside gnttab_copy) while this lock is held. Seems like a bad idea?
> 
> Ian was thinking (and he proposed a talk at Linux Plumbers Conference)
> to reintroduce the zero copying
> functionality back. But it is not an easy problem b/c the way the
> pages go through the Linux kernel
> plumbing.
> 
> > 
> > - A smaller source of overhead is when the ''_lock''
is acquired
> within netback in netif_idx_release(). Shouldn''t this lock be per
> struct xen-netbk instead of being global (declared as static within
> the function)? Is this a BUG?
> 
> Ian, what is your thought?
I suspect the _lock could be moved into the netbk, I expect it was just
missed in the switch to multi-threading because it was static in the
function instead of a normal global var located with all the others.

Ian.
> > 
> > If some (or all) of these points have already been discussed before,
> I apologize in advance! 
> > 
> > I appreciate any feedback or pointers.
> > 
> > Thanks.
> > 
> > --Kaushik 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Kaushik Kumar Ram

2011-May-11 18:43 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On May 11, 2011, at 4:31 AM, Ian Campbell wrote:> 
>>> - The inter-VM performance (throughput) is worse using both
tasklets
>> and kthreads as compared to the old version of netback (as in
>> linux-2.6-xen.hg repo). I observed about 50% drop in throughput in my
>> experiments. Has anyone else observed this? Is the new version yet to
>> be optimized?
>> 
>> That is not surprising. The "new" version of netback copies
pages. It
>> does not "swizzel" or "map" then between domains
(so zero copying).
> 
> I think Kaushik is running a xen/2.6.32.x tree and the copying only
> variant is only in mainline.
> 
> A 50% drop in performance between linux-2.6-xen.hg and the xen.git
> 2.6.32 tree is slightly worrying but such a big drop sounds more like a
> misconfiguration, e.g. something like enabling debugging options in the
> kernel .config rather than a design or implementation issue in netback.
> 
> (I actually have no idea what was in the linux-2.6-xen.hg tree, I
don''t
> recall such a tree ever being properly maintained, the last cset appears
> to be from 2006 and I recently cleaned it out of xenbits because noone
> knew what it was -- did you mean linux-2.6.18-xen.hg?)
I was referring to the single-threaded netback version in linux-2.6.18-xen.hg 
(which btw also uses copying). I don''t believe misconfiguration to be
reason.
As I mentioned previously, I profiled the code and found significant
synchronization
overhead due to lock contention. This essentially happens when two vcpus in 
dom0 perform the grant hypercall and both try to acquire the domain_lock.

I don''t think re-introducing zero-copy in the receive path is a
solution to
this problem. I mentioned packet copies only to explain the severity of this
problem. Let me try to clarify. Consider the following scenario: vcpu 1 
performs a hypercall, acquires the domain_lock, and starts copying one or more 
packets (in gnttab_copy). Now vcpu 2 also performs a hypercall, but it cannot 
acquire the domain_lock until all the copies have completed and the lock is 
released by vcpu 1. So the domain_lock could be held for a long time before 
it is released.

I think to properly scale netback we need more fine grained locking.
>>> - Two tasklets (rx and tx) are created per vcpu within netback. But
>> in my experiments I noticed that only one vcpu was being used during
>> the experiments (even with 4 VMs).  I also observed that all the event
>> channel notifications within netback are always sent to vcpu 0. So my
>> conjecture is that since the tasklets are always scheduled by vcpu 0,
>> all of them are run only on vcpu 0. Is this a BUG?
>> 
>> Yes. We need to fix ''irqbalance'' to work properly.
There is something
>> not working right.
> 
> The fix is to install the "irqbalanced" package. Without it no
IRQ
> balancing will occur in a modern kernel. (perhaps this linux-2.6-xen.hg
> tree was from a time when the kernel would do balancing on its own?).
> You can also manually balance the VIF IRQs under /proc/irq if you are so
> inclined.
Why cannot the virq associated with each xen_netbk be bound to a different 
vcpu during initlization? There is after all one struct xen_netbk per vcpu in
dom0.
This seems like the simplest fix for this problem.
>>> - A smaller source of overhead is when the
''_lock'' is acquired
>> within netback in netif_idx_release(). Shouldn''t this lock be
per
>> struct xen-netbk instead of being global (declared as static within
>> the function)? Is this a BUG?
>> 
>> Ian, what is your thought?
> 
> I suspect the _lock could be moved into the netbk, I expect it was just
> missed in the switch to multi-threading because it was static in the
> function instead of a normal global var located with all the others.
Yes, it has to be moved into struct xen_netbk.

Also, which git repo/branch should I be using If I would like to experiment with
the latest dom0 networking?

--Kaushik
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-May-12 08:21 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On Wed, 2011-05-11 at 19:43 +0100, Kaushik Kumar Ram
wrote:> On May 11, 2011, at 4:31 AM, Ian Campbell wrote:
> > 
> >>> - The inter-VM performance (throughput) is worse using both
tasklets
> >> and kthreads as compared to the old version of netback (as in
> >> linux-2.6-xen.hg repo). I observed about 50% drop in throughput in
my
> >> experiments. Has anyone else observed this? Is the new version yet
to
> >> be optimized?
> >> 
> >> That is not surprising. The "new" version of netback
copies pages. It
> >> does not "swizzel" or "map" then between
domains (so zero copying).
> > 
> > I think Kaushik is running a xen/2.6.32.x tree and the copying only
> > variant is only in mainline.
> > 
> > A 50% drop in performance between linux-2.6-xen.hg and the xen.git
> > 2.6.32 tree is slightly worrying but such a big drop sounds more like
a
> > misconfiguration, e.g. something like enabling debugging options in
the
> > kernel .config rather than a design or implementation issue in
netback.
> > 
> > (I actually have no idea what was in the linux-2.6-xen.hg tree, I
don''t
> > recall such a tree ever being properly maintained, the last cset
appears
> > to be from 2006 and I recently cleaned it out of xenbits because noone
> > knew what it was -- did you mean linux-2.6.18-xen.hg?)
> 
> I was referring to the single-threaded netback version in
linux-2.6.18-xen.hg
> (which btw also uses copying).
Ah, I think we are talking about different values of copying.

A long time ago to backend->frontend path (guest receive) operated using
a page flipping mode. At some point a copying mode was added to this
path which became the default some time in 2006. You would have to go
out of your way to find a guest which used flipping mode these days. I
think this is the copying you are referring too, it''s so long ago that
there was a distinction on this path that I''d forgotten all about it
until now.

The frontend->backend path (guest transmit) has used a mapping
(PageForeign) based scheme practically since forever. However when
netback was upstreamed into 2.6.39 this had to be removed in favour of a
copy based implementation (PageForeign has fingers in the mm subsystem
which were unacceptable for upstreaming). This is the copying mode
Konrad and I were talking about. We know the performance will suffer
versus mapping mode, and we are working to find ways of reinstating
mapping.
>  I don''t believe misconfiguration to be reason. 
> As I mentioned previously, I profiled the code and found significant
synchronization
> overhead due to lock contention. This essentially happens when two vcpus in
> dom0 perform the grant hypercall and both try to acquire the domain_lock.
> 
> I don''t think re-introducing zero-copy in the receive path is a
solution to
> this problem.
As far as I can tell you are running with the zero-copy path. Only
mainline 2.6.39+ has anything different.

I think you need to go into detail about your test setup so we can all
get on the same page and stop confusing ourselves by guessing which
modes netback has available and is running in. Please can you describe
precisely which kernels you are running (tree URL and changeset as well
as the .config you are using). Please also describe your guest
configuration (kernels, cfg file, distro etc) and benchmark methodology
(e.g. netperf options).

I''d also be interesting in seeing the actual numbers you are seeing,
alongside specifics of the test scenario which produced them.

I''m especially interesting in the details of the experiment(s) where
you
saw a 50% drop in throughput.
>  I mentioned packet copies only to explain the severity of this
> problem. Let me try to clarify. Consider the following scenario: vcpu 1 
> performs a hypercall, acquires the domain_lock, and starts copying one or
more
> packets (in gnttab_copy). Now vcpu 2 also performs a hypercall, but it
cannot
> acquire the domain_lock until all the copies have completed and the lock is
> released by vcpu 1. So the domain_lock could be held for a long time before
> it is released.
But this isn''t a difference between the multi-threaded/tasklet and
single-threaded/tasklet version of netback, is it?

In the single threaded case the serialisation is explicit due to the
lack of threading, and it would obviously be good to avoid for the
multithreaded case, but the contention doesn''t really explain why
multi-threaded mode would be 50% slower. (I suppose the threading case
could serialise things into a different order, perhaps one which is
somehow pessimal for e.g. TCP)

It is quite easy to force the number of tasklets/threads to 1 (by
forcing xen_netbk_group_nr to 1 in netback_init()). This might be an
interesting experiment to see if the degradation is down to contention
between threads or something else which has changed between 2.6.18 and
2.6.32 (there is an extent to which this is comparing apples to oranges
but 50% is pretty severe...).
> I think to properly scale netback we need more fine grained locking.
Quite possibly. It doesn''t seem at all unlikely that the domain lock on
the guest-receive grant copy is going to hurt at some point. There are
some plans to rework the guest receive path to do the copy on the guest
side, the primary motivation is to remove load from dom0 and to allow
better accounting of work to the guests to request it but a side-effect
of this could be to reduce contention on dom0''s domain_lock.

However I would like to get to the bottom of the 50% degradation between
linux-2.6.18-xen.hg and xen.git#xen/stable-2.6.32.x before we move on to
how we can further improve the situation in xen.git.
> >>> - Two tasklets (rx and tx) are created per vcpu within
netback. But
> >> in my experiments I noticed that only one vcpu was being used
during
> >> the experiments (even with 4 VMs).  I also observed that all the
event
> >> channel notifications within netback are always sent to vcpu 0. So
my
> >> conjecture is that since the tasklets are always scheduled by vcpu
0,
> >> all of them are run only on vcpu 0. Is this a BUG?
> >> 
> >> Yes. We need to fix ''irqbalance'' to work
properly. There is something
> >> not working right.
> > 
> > The fix is to install the "irqbalanced" package. Without it
no IRQ
> > balancing will occur in a modern kernel. (perhaps this
linux-2.6-xen.hg
> > tree was from a time when the kernel would do balancing on its own?).
> > You can also manually balance the VIF IRQs under /proc/irq if you are
so
> > inclined.
> 
> Why cannot the virq associated with each xen_netbk be bound to a different 
> vcpu during initlization?
An IRQ is associated with a VIF and multiple VIFs can be associated with
a netbk.

I suppose we could bind the IRQ to the same CPU as the associated netrbk
thread but this can move around so we''d need to follow it. The tasklet
case is easier since, I think, the tasklet will be run on whichever CPU
scheduled it, which will be the one the IRQ occurred on.

Drivers are not typically expected to behave in this way. In fact I''m
not sure it is even allowed by the IRQ subsystem and I expect upstream
would frown on a driver doing this sort of thing (I expect their answer
would be "why aren''t you using irqbalanced?"). If you can
make this work
and it shows real gains over running irqbalanced we can of course
consider it.

[...]> Also, which git repo/branch should I be using If I would like to experiment
with
> the latest dom0 networking?
I wouldn''t recommend playing with the stuff in mainline right now -- we
know it isn''t the best due to the use of copying on the guest receive
path. The xen.git#xen/stable-2.6.32.x tree is probably the best one to
experiment on.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Kaushik Kumar Ram

2011-May-12 20:10 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On May 12, 2011, at 3:21 AM, Ian Campbell wrote:
> A long time ago to backend->frontend path (guest receive) operated using
> a page flipping mode. At some point a copying mode was added to this
> path which became the default some time in 2006. You would have to go
> out of your way to find a guest which used flipping mode these days. I
> think this is the copying you are referring too, it''s so long ago
that
> there was a distinction on this path that I''d forgotten all about
it
> until now.
I was not referring to page flipping at all. I was only talking about the copies
in the receive path. 
> The frontend->backend path (guest transmit) has used a mapping
> (PageForeign) based scheme practically since forever. However when
> netback was upstreamed into 2.6.39 this had to be removed in favour of a
> copy based implementation (PageForeign has fingers in the mm subsystem
> which were unacceptable for upstreaming). This is the copying mode
> Konrad and I were talking about. We know the performance will suffer
> versus mapping mode, and we are working to find ways of reinstating
> mapping.
Hmm.. I did not know that the copying mode was introduced in the transmit path. 
But as I said above I was only referring to the receive path. 
> As far as I can tell you are running with the zero-copy path. Only
> mainline 2.6.39+ has anything different.
Again, I was only referring to the receive path! I assumed you were talking
about
re-introducing zero-copy in the receive path (aka page flipping). 
To be clear: 
- xen.git#xen/stable-2.6.32.x uses copying in the RX path and 
mapping (zero-copy) in the RX path. 
- Copying is used in both RX and TX path in 2.6.39+ for upstreaming.
> I think you need to go into detail about your test setup so we can all
> get on the same page and stop confusing ourselves by guessing which
> modes netback has available and is running in. Please can you describe
> precisely which kernels you are running (tree URL and changeset as well
> as the .config you are using). Please also describe your guest
> configuration (kernels, cfg file, distro etc) and benchmark methodology
> (e.g. netperf options).
> 
> I''d also be interesting in seeing the actual numbers you are
seeing,
> alongside specifics of the test scenario which produced them.
> 
> I''m especially interesting in the details of the experiment(s)
where you
> saw a 50% drop in throughput.
I agree. I plan to run the experiments again next week. I will get back to you 
with all the details. 

But these are the versions I am trying to compare:
1. http://xenbits.xensource.com/linux-2.6.18-xen.hg (single-threaded legacy
netback)
2. xen.git#xen/stable-2.6.32.x (multi-threaded netback using tasklets)
3. xen.git#xen/stable-2.6.32.x (multi-threaded netback using kthreads)

And (1) outperforms both (2) and (3).
>> I mentioned packet copies only to explain the severity of this
>> problem. Let me try to clarify. Consider the following scenario: vcpu 1
>> performs a hypercall, acquires the domain_lock, and starts copying one
or more
>> packets (in gnttab_copy). Now vcpu 2 also performs a hypercall, but it
cannot
>> acquire the domain_lock until all the copies have completed and the
lock is
>> released by vcpu 1. So the domain_lock could be held for a long time
before
>> it is released.
> 
> But this isn''t a difference between the multi-threaded/tasklet and
> single-threaded/tasklet version of netback, is it?
> 
> In the single threaded case the serialisation is explicit due to the
> lack of threading, and it would obviously be good to avoid for the
> multithreaded case, but the contention doesn''t really explain why
> multi-threaded mode would be 50% slower. (I suppose the threading case
> could serialise things into a different order, perhaps one which is
> somehow pessimal for e.g. TCP)
> 
> It is quite easy to force the number of tasklets/threads to 1 (by
> forcing xen_netbk_group_nr to 1 in netback_init()). This might be an
> interesting experiment to see if the degradation is down to contention
> between threads or something else which has changed between 2.6.18 and
> 2.6.32 (there is an extent to which this is comparing apples to oranges
> but 50% is pretty severe...).
Hmm.. You are right.  I will run the above experiments next week.
>> I think to properly scale netback we need more fine grained locking.
> 
> Quite possibly. It doesn''t seem at all unlikely that the domain
lock on
> the guest-receive grant copy is going to hurt at some point. There are
> some plans to rework the guest receive path to do the copy on the guest
> side, the primary motivation is to remove load from dom0 and to allow
> better accounting of work to the guests to request it but a side-effect
> of this could be to reduce contention on dom0''s domain_lock.
> 
> However I would like to get to the bottom of the 50% degradation between
> linux-2.6.18-xen.hg and xen.git#xen/stable-2.6.32.x before we move on to
> how we can further improve the situation in xen.git.
OK.
> An IRQ is associated with a VIF and multiple VIFs can be associated with
> a netbk.
> 
> I suppose we could bind the IRQ to the same CPU as the associated netrbk
> thread but this can move around so we''d need to follow it. The
tasklet
> case is easier since, I think, the tasklet will be run on whichever CPU
> scheduled it, which will be the one the IRQ occurred on.
> 
> Drivers are not typically expected to behave in this way. In fact
I''m
> not sure it is even allowed by the IRQ subsystem and I expect upstream
> would frown on a driver doing this sort of thing (I expect their answer
> would be "why aren''t you using irqbalanced?"). If you
can make this work
> and it shows real gains over running irqbalanced we can of course
> consider it.
OK.

Thanks.

--Kaushik
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2011-May-13 07:33 UTC

head link

Re: [Xen-devel] Xen dom0 network I/O scalability

On Thu, 2011-05-12 at 21:10 +0100, Kaushik Kumar Ram
wrote:> On May 12, 2011, at 3:21 AM, Ian Campbell wrote:
> 
> > A long time ago to backend->frontend path (guest receive) operated
using
> > a page flipping mode. At some point a copying mode was added to this
> > path which became the default some time in 2006. You would have to go
> > out of your way to find a guest which used flipping mode these days. I
> > think this is the copying you are referring too, it''s so long
ago that
> > there was a distinction on this path that I''d forgotten all
about it
> > until now.
> 
> I was not referring to page flipping at all. I was only talking about the
copies
> in the receive path. 
Sure, I was just explaining the background as to why everyone got
confused when you mentioned copying mode since most peoples minds went
to the other copying vs. !copying distinction.
> > As far as I can tell you are running with the zero-copy path. Only
> > mainline 2.6.39+ has anything different.
> 
> Again, I was only referring to the receive path! I assumed you were talking
about
> re-introducing zero-copy in the receive path (aka page flipping). 
> To be clear: 
> - xen.git#xen/stable-2.6.32.x uses copying in the RX path and 
> mapping (zero-copy) in the RX path. 
                             ^TX> - Copying is used in both RX and TX path in 2.6.39+ for upstreaming.
Correct (as amended).

For completeness linux-2.6.18-xen.hg has both copying and flipping RX
paths but defaults to copying (you''d be hard pressed to find a guest
which uses flipping mode). The TX path is mapping (zero-copy).

In xen.git#xen/stable-2.6.32.x the code to support flipping RX path has
been completely removed.
> > I think you need to go into detail about your test setup so we can all
> > get on the same page and stop confusing ourselves by guessing which
> > modes netback has available and is running in. Please can you describe
> > precisely which kernels you are running (tree URL and changeset as
well
> > as the .config you are using). Please also describe your guest
> > configuration (kernels, cfg file, distro etc) and benchmark
methodology
> > (e.g. netperf options).
> > 
> > I''d also be interesting in seeing the actual numbers you are
seeing,
> > alongside specifics of the test scenario which produced them.
> > 
> > I''m especially interesting in the details of the
experiment(s) where you
> > saw a 50% drop in throughput.
> 
> I agree. I plan to run the experiments again next week. I will get back to
you
> with all the details. 
Thanks.

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Apr 2011 - Xen dom0 network I/O scalability

[Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability

Re: [Xen-devel] Xen dom0 network I/O scalability