thr3ads.net - Xen devel - [Xen-devel] new netfront and occasional receive path lockup [Aug 2010]

If this information is useful, please help other people find it:
Share via:

Christophe Saout

2010-Aug-22 16:43 UTC

[Xen-devel] new netfront and occasional receive path lockup

Hi,

I''ve been playing with some of the new pvops code, namely DomU guest
code.  What I''ve been observing on one of the virtual machines is that
the network (vif) is dying after about ten to sixty minutes of uptime.
The unfortunate thing here is that I can only repoduce it on a
production VM and have been unlucky so far to trigger the bug on a test
machine.  While this has not been tragic - rebooting fixed the issue,
unfortunately I can''t spend very much time on debugging after the issue
pops up.

Now, what is happening is that the receive path goes dead.  The DomU can
send packets to Dom0 and those are visible using tcpdump on the Dom0 on
the virtual interface, but not the other way around.

Now, I have done more than one change at a time (I''d like to avoid
going
into pinning it down since I can only reproduce it on a production
machine, as I said, so suggestions are welcome), but my suspicion is
that it might have to do with the new "smart polling" feature in
xen/netfront.  Note that I have also updated Dom0 to pull in the latest
dom0/backend and netback changes, just to make sure it''s not due to an
issue that has been fixed there, but I''m still seeing the same.

The production machine is a machine that doesn''t have much network
load,
but deals with a lot of small network requests (DNS and smtp mostly).  A
workload which is hard to reproduce on the test machine.  Heavy network
load (NFS, FTP and so on) for days hasn''t triggered the problem.  Also,
segmentation offloading and similar settings don''t have any effect.

The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT
enabled.

I''ve been looking at the code, if there might be a race condition
somewhere, something like where one could run into a situation where the
hrtimer doesn''t run and Dom0 believes the DomU should be polling and
doesn''t emit an interrupt or something, but I''m afraid I
don''t know
enough to judge this (I mean, there are spinlocks which look safe to
me).

Do you have any suggestions what to try?  I can trigger the issue on the
production VM again, but debugging should not take more than a few
minutes if it happens.  Access is only possible via the console.
Neither Dom0 nor the guest show anything unusual in the kernel message
and continue to behave normally after the network goes dead (also able
to shut down the guest normally).

Thanks,
	Christophe



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Christophe Saout

2010-Aug-22 18:37 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

Hi again,
> I''ve been looking at the code, if there might be a race condition
> somewhere, something like where one could run into a situation where the
> hrtimer doesn''t run and Dom0 believes the DomU should be polling
and
> doesn''t emit an interrupt or something, but I''m afraid I
don''t know
> enough to judge this (I mean, there are spinlocks which look safe to
> me).
Hmm, looking a bit more.

rx.sring->private.netif.smartpoll_active lies in a piece of memory that
is shared between netback and netfront, is that right?

If that is so, the tx spinlock in netfront only protects against
simultaneous modifications from another thread in netfront, so netback
can read smartpoll_active while netfront is fiddling with it.  Is that
safe?

Note that when the lockup occurs, /proc/interrupts in the guest doesn''t
show any interrupts arriving from for eth0 anymore.  Are there any
conditions where netback waits for netfront to retrieve packages even
when new packages arrive? (like e.g. when the ring is full and there is
backlog into the network stack or something?) Any way to debug this from
the Dom0 side?  Like looking into the state of the ring from userspace?
Debug options?

	Christophe



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Christophe Saout

2010-Aug-23 14:26 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

Hi yet again,

[not quoting everything again]

I finally managed to trigger the issue on the test VM, which is now
stuck in that state since last night and can be inspected.  Apparently
the tx ring on the netback side is full, since every packet sent is
immediately dropped (as seen from ifconfig output).  No interrupts
moving on the guest.

Still I''m wondering what would be the best course of action trying to
debug this now.  Should I have compiled some debugger into the
hypervisor? (gdbsx apparently needs that)

Thanks,
	Christophe



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Konrad Rzeszutek Wilk

2010-Aug-23 16:04 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

On Mon, Aug 23, 2010 at 04:26:52PM +0200, Christophe Saout
wrote:> Hi yet again,
> 
> [not quoting everything again]
> 
> I finally managed to trigger the issue on the test VM, which is now
> stuck in that state since last night and can be inspected.  Apparently
> the tx ring on the netback side is full, since every packet sent is
> immediately dropped (as seen from ifconfig output).  No interrupts
> moving on the guest.
What is the kernel and hypervisor in Dom0? And what is it in DomU?
> 
> Still I''m wondering what would be the best course of action trying
to
> debug this now.  Should I have compiled some debugger into the
> hypervisor? (gdbsx apparently needs that)
Sure. An easier path might be to do ''xm debug-keys q'' which
should
trigger the debug irq handler. In DomU that should print out all of the
event channel bits which we can analyze that and see if the
proper bits are not set (and hence the IRQ handler isn''t picking up
from the ring buffer).

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Christophe Saout

2010-Aug-23 17:09 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

Hi Konrad,
> > I finally managed to trigger the issue on the test VM, which is now
> > stuck in that state since last night and can be inspected.  Apparently
> > the tx ring on the netback side is full, since every packet sent is
> > immediately dropped (as seen from ifconfig output).  No interrupts
> > moving on the guest.
> 
> What is the kernel and hypervisor in Dom0? And what is it in DomU?
The hypervisor is from the Xen 4.0.0 release and the Dom0 is from
Jeremy''s 2.6.32 stable branch for pvops Dom0 (and lately with the
xen/dom0/backend branches merged in top because I hoped there might be
some fixes that help).  The same kernel has been working fine as guest,
but my newer one where I took an upstream 2.6.35, applied some of the
upstream fixes branches and also pulled the xen/netfront in, is now
causeing this issue.  Everything else is working just fine, so I am
pretty sure it is related to a netfront-specific change and not to
anything else.
> > hypervisor? (gdbsx apparently needs that)
> 
> Sure.
Also, I noticed that "gdb /path/to/vmlinux /proc/kcore" does allow me
to
inspect the memory.  I''ll try to see if I can pinpoint some of the
interesting memory locations.
> An easier path might be to do ''xm debug-keys q'' which
should
> trigger the debug irq handler. In DomU that should print out all of the
> event channel bits which we can analyze that and see if the
> proper bits are not set (and hence the IRQ handler isn''t picking
up
> from the ring buffer).
I''m not exactly sure how to read the output of that.
http://www.saout.de/assets/xm-debug-q.txt

	Christophe



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Aug-24 00:46 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

On 08/22/2010 09:43 AM, Christophe Saout wrote:> Hi,
>
> I''ve been playing with some of the new pvops code, namely DomU
guest
> code.  What I''ve been observing on one of the virtual machines is
that
> the network (vif) is dying after about ten to sixty minutes of uptime.
> The unfortunate thing here is that I can only repoduce it on a
> production VM and have been unlucky so far to trigger the bug on a test
> machine.  While this has not been tragic - rebooting fixed the issue,
> unfortunately I can''t spend very much time on debugging after the
issue
> pops up.
Ah, OK.  I''ve seen this a couple of times as well.  And it just
happened
to me then...

> Now, what is happening is that the receive path goes dead.  The DomU can
> send packets to Dom0 and those are visible using tcpdump on the Dom0 on
> the virtual interface, but not the other way around.
I hadn''t got to that level of diagnosis, but I can confirm that
that''s
what seems to be happening here too.
> Now, I have done more than one change at a time (I''d like to avoid
going
> into pinning it down since I can only reproduce it on a production
> machine, as I said, so suggestions are welcome), but my suspicion is
> that it might have to do with the new "smart polling" feature in
> xen/netfront.  Note that I have also updated Dom0 to pull in the latest
> dom0/backend and netback changes, just to make sure it''s not due
to an
> issue that has been fixed there, but I''m still seeing the same.
I agree.  I think I started seeing this once I merged smartpoll into
netfront.

    J
> The production machine is a machine that doesn''t have much network
load,
> but deals with a lot of small network requests (DNS and smtp mostly).  A
> workload which is hard to reproduce on the test machine.  Heavy network
> load (NFS, FTP and so on) for days hasn''t triggered the problem. 
Also,
> segmentation offloading and similar settings don''t have any
effect.
>
> The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT
> enabled.
>
> I''ve been looking at the code, if there might be a race condition
> somewhere, something like where one could run into a situation where the
> hrtimer doesn''t run and Dom0 believes the DomU should be polling
and
> doesn''t emit an interrupt or something, but I''m afraid I
don''t know
> enough to judge this (I mean, there are spinlocks which look safe to
> me).
>
> Do you have any suggestions what to try?  I can trigger the issue on the
> production VM again, but debugging should not take more than a few
> minutes if it happens.  Access is only possible via the console.
> Neither Dom0 nor the guest show anything unusual in the kernel message
> and continue to behave normally after the network goes dead (also able
> to shut down the guest normally).
>
> Thanks,
> 	Christophe
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Aug-24 00:53 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

On 08/22/2010 11:37 AM, Christophe Saout wrote:> Hmm, looking a bit more.
>
> rx.sring->private.netif.smartpoll_active lies in a piece of memory that
> is shared between netback and netfront, is that right?
>
> If that is so, the tx spinlock in netfront only protects against
> simultaneous modifications from another thread in netfront, so netback
> can read smartpoll_active while netfront is fiddling with it.  Is that
> safe?
It depends on exactly how it is used.  But any use cross-cpu shared
memory must carefully consider access ordering, and possibly have
explicit barriers to make sure that the expected ordering is actually
seen by all cpus.

    J
> Note that when the lockup occurs, /proc/interrupts in the guest
doesn''t
> show any interrupts arriving from for eth0 anymore.  Are there any
> conditions where netback waits for netfront to retrieve packages even
> when new packages arrive? (like e.g. when the ring is full and there is
> backlog into the network stack or something?) Any way to debug this from
> the Dom0 side?  Like looking into the state of the ring from userspace?
> Debug options?
>
> 	Christophe
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Aug-25 00:51 UTC

head link

RE: [Xen-devel] new netfront and occasional receive path lockup

Hi Christophe,

Thanks for finding and checking the problem.
I will try to reproduce the issue and check what caused the problem.

Thanks,
Dongxiao

Jeremy Fitzhardinge wrote:>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>> Hi,
>> 
>> I''ve been playing with some of the new pvops code, namely DomU
guest
>> code.  What I''ve been observing on one of the virtual machines
is
>> that 
>> the network (vif) is dying after about ten to sixty minutes of
>> uptime. 
>> The unfortunate thing here is that I can only repoduce it on a
>> production VM and have been unlucky so far to trigger the bug on a
>> test machine.  While this has not been tragic - rebooting fixed the
>> issue, unfortunately I can''t spend very much time on debugging
after
>> the issue pops up.
> 
> Ah, OK.  I''ve seen this a couple of times as well.  And it just
> happened to me then... 
> 
> 
>> Now, what is happening is that the receive path goes dead.  The DomU
>> can send packets to Dom0 and those are visible using tcpdump on the
>> Dom0 on the virtual interface, but not the other way around.
> 
> I hadn''t got to that level of diagnosis, but I can confirm that
> that''s what seems to be happening here too. 
> 
>> Now, I have done more than one change at a time (I''d like to
avoid
>> going into pinning it down since I can only reproduce it on a
>> production machine, as I said, so suggestions are welcome), but my
>> suspicion is that it might have to do with the new "smart
polling"
>> feature in xen/netfront.  Note that I have also updated Dom0 to pull
>> in the latest dom0/backend and netback changes, just to make sure
>> it''s 
>> not due to an issue that has been fixed there, but I''m still
seeing
>> the same. 
> 
> I agree.  I think I started seeing this once I merged smartpoll into
> netfront. 
> 
>     J
> 
>> The production machine is a machine that doesn''t have much
network
>> load, but deals with a lot of small network requests (DNS and smtp
>> mostly).  A workload which is hard to reproduce on the test machine.
>> Heavy network load (NFS, FTP and so on) for days hasn''t
triggered the
>> problem.  Also, segmentation offloading and similar settings
don''t
>> have any effect. 
>> 
>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>> PREEMPT 
>> enabled.
>> 
>> I''ve been looking at the code, if there might be a race
condition
>> somewhere, something like where one could run into a situation where
>> the hrtimer doesn''t run and Dom0 believes the DomU should be
polling
>> and doesn''t emit an interrupt or something, but I''m
afraid I don''t
>> know enough to judge this (I mean, there are spinlocks which look
>> safe 
>> to me).
>> 
>> Do you have any suggestions what to try?  I can trigger the issue on
>> the production VM again, but debugging should not take more than a
>> few 
>> minutes if it happens.  Access is only possible via the console.
>> Neither Dom0 nor the guest show anything unusual in the kernel
>> message 
>> and continue to behave normally after the network goes dead (also
>> able 
>> to shut down the guest normally).
>> 
>> Thanks,
>> 	Christophe
>> 
>> 
>> 
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Sep-09 18:50 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao
wrote:> Hi Christophe,
> 
> Thanks for finding and checking the problem.
> I will try to reproduce the issue and check what caused the problem.
> 
Hello,

Was this issue resolved? Some users have been complaining
"network freezing up" issues recently on ##xen on irc..

-- Pasi
> Thanks,
> Dongxiao
> 
> Jeremy Fitzhardinge wrote:
> >  On 08/22/2010 09:43 AM, Christophe Saout wrote:
> >> Hi,
> >> 
> >> I''ve been playing with some of the new pvops code, namely
DomU guest
> >> code.  What I''ve been observing on one of the virtual
machines is
> >> that 
> >> the network (vif) is dying after about ten to sixty minutes of
> >> uptime. 
> >> The unfortunate thing here is that I can only repoduce it on a
> >> production VM and have been unlucky so far to trigger the bug on a
> >> test machine.  While this has not been tragic - rebooting fixed
the
> >> issue, unfortunately I can''t spend very much time on
debugging after
> >> the issue pops up.
> > 
> > Ah, OK.  I''ve seen this a couple of times as well.  And it
just
> > happened to me then... 
> > 
> > 
> >> Now, what is happening is that the receive path goes dead.  The
DomU
> >> can send packets to Dom0 and those are visible using tcpdump on
the
> >> Dom0 on the virtual interface, but not the other way around.
> > 
> > I hadn''t got to that level of diagnosis, but I can confirm
that
> > that''s what seems to be happening here too. 
> > 
> >> Now, I have done more than one change at a time (I''d like
to avoid
> >> going into pinning it down since I can only reproduce it on a
> >> production machine, as I said, so suggestions are welcome), but my
> >> suspicion is that it might have to do with the new "smart
polling"
> >> feature in xen/netfront.  Note that I have also updated Dom0 to
pull
> >> in the latest dom0/backend and netback changes, just to make sure
> >> it''s 
> >> not due to an issue that has been fixed there, but I''m
still seeing
> >> the same. 
> > 
> > I agree.  I think I started seeing this once I merged smartpoll into
> > netfront. 
> > 
> >     J
> > 
> >> The production machine is a machine that doesn''t have
much network
> >> load, but deals with a lot of small network requests (DNS and smtp
> >> mostly).  A workload which is hard to reproduce on the test
machine.
> >> Heavy network load (NFS, FTP and so on) for days hasn''t
triggered the
> >> problem.  Also, segmentation offloading and similar settings
don''t
> >> have any effect. 
> >> 
> >> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
> >> PREEMPT 
> >> enabled.
> >> 
> >> I''ve been looking at the code, if there might be a race
condition
> >> somewhere, something like where one could run into a situation
where
> >> the hrtimer doesn''t run and Dom0 believes the DomU should
be polling
> >> and doesn''t emit an interrupt or something, but
I''m afraid I don''t
> >> know enough to judge this (I mean, there are spinlocks which look
> >> safe 
> >> to me).
> >> 
> >> Do you have any suggestions what to try?  I can trigger the issue
on
> >> the production VM again, but debugging should not take more than a
> >> few 
> >> minutes if it happens.  Access is only possible via the console.
> >> Neither Dom0 nor the guest show anything unusual in the kernel
> >> message 
> >> and continue to behave normally after the network goes dead (also
> >> able 
> >> to shut down the guest normally).
> >> 
> >> Thanks,
> >> 	Christophe
> >> 
> >> 
> >> 
> >> _______________________________________________
> >> Xen-devel mailing list
> >> Xen-devel@lists.xensource.com
> >> http://lists.xensource.com/xen-devel
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Sep-10 00:55 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>> Hi Christophe,
>>
>> Thanks for finding and checking the problem.
>> I will try to reproduce the issue and check what caused the problem.
>>
> Hello,
>
> Was this issue resolved? Some users have been complaining
> "network freezing up" issues recently on ##xen on irc..
Yeah, I''ll add a command-line parameter to disable smartpoll (and leave
it off by default).

    J
> -- Pasi
>
>> Thanks,
>> Dongxiao
>>
>> Jeremy Fitzhardinge wrote:
>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>> Hi,
>>>>
>>>> I''ve been playing with some of the new pvops code,
namely DomU guest
>>>> code.  What I''ve been observing on one of the virtual
machines is
>>>> that 
>>>> the network (vif) is dying after about ten to sixty minutes of
>>>> uptime. 
>>>> The unfortunate thing here is that I can only repoduce it on a
>>>> production VM and have been unlucky so far to trigger the bug
on a
>>>> test machine.  While this has not been tragic - rebooting fixed
the
>>>> issue, unfortunately I can''t spend very much time on
debugging after
>>>> the issue pops up.
>>> Ah, OK.  I''ve seen this a couple of times as well.  And it
just
>>> happened to me then... 
>>>
>>>
>>>> Now, what is happening is that the receive path goes dead.  The
DomU
>>>> can send packets to Dom0 and those are visible using tcpdump on
the
>>>> Dom0 on the virtual interface, but not the other way around.
>>> I hadn''t got to that level of diagnosis, but I can confirm
that
>>> that''s what seems to be happening here too. 
>>>
>>>> Now, I have done more than one change at a time (I''d
like to avoid
>>>> going into pinning it down since I can only reproduce it on a
>>>> production machine, as I said, so suggestions are welcome), but
my
>>>> suspicion is that it might have to do with the new "smart
polling"
>>>> feature in xen/netfront.  Note that I have also updated Dom0 to
pull
>>>> in the latest dom0/backend and netback changes, just to make
sure
>>>> it''s 
>>>> not due to an issue that has been fixed there, but I''m
still seeing
>>>> the same. 
>>> I agree.  I think I started seeing this once I merged smartpoll
into
>>> netfront. 
>>>
>>>     J
>>>
>>>> The production machine is a machine that doesn''t have
much network
>>>> load, but deals with a lot of small network requests (DNS and
smtp
>>>> mostly).  A workload which is hard to reproduce on the test
machine.
>>>> Heavy network load (NFS, FTP and so on) for days
hasn''t triggered the
>>>> problem.  Also, segmentation offloading and similar settings
don''t
>>>> have any effect. 
>>>>
>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has
>>>> PREEMPT 
>>>> enabled.
>>>>
>>>> I''ve been looking at the code, if there might be a
race condition
>>>> somewhere, something like where one could run into a situation
where
>>>> the hrtimer doesn''t run and Dom0 believes the DomU
should be polling
>>>> and doesn''t emit an interrupt or something, but
I''m afraid I don''t
>>>> know enough to judge this (I mean, there are spinlocks which
look
>>>> safe 
>>>> to me).
>>>>
>>>> Do you have any suggestions what to try?  I can trigger the
issue on
>>>> the production VM again, but debugging should not take more
than a
>>>> few 
>>>> minutes if it happens.  Access is only possible via the
console.
>>>> Neither Dom0 nor the guest show anything unusual in the kernel
>>>> message 
>>>> and continue to behave normally after the network goes dead
(also
>>>> able 
>>>> to shut down the guest normally).
>>>>
>>>> Thanks,
>>>> 	Christophe
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xensource.com
>> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Sep-10 01:45 UTC

head link

RE: [Xen-devel] new netfront and occasional receive path lockup

Hi Jeremy and Pasi,

I was frustrated that I couldn''t reproduce this bug in my site. 

However I investigated the code, indeed there is one race condition that
probably cause the bug. See the attached patch.

Could anybody who can see this bug help to try it? Appreciate much!

Thanks,
Dongxiao


Jeremy Fitzhardinge wrote:>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>> Hi Christophe,
>>> 
>>> Thanks for finding and checking the problem.
>>> I will try to reproduce the issue and check what caused the
problem.
>>> 
>> Hello,
>> 
>> Was this issue resolved? Some users have been complaining "network
>> freezing up" issues recently on ##xen on irc..
> 
> Yeah, I''ll add a command-line parameter to disable smartpoll (and
> leave it off by default). 
> 
>     J
> 
>> -- Pasi
>> 
>>> Thanks,
>>> Dongxiao
>>> 
>>> Jeremy Fitzhardinge wrote:
>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>> Hi,
>>>>> 
>>>>> I''ve been playing with some of the new pvops code,
namely DomU
>>>>> guest code.  What I''ve been observing on one of
the virtual
>>>>> machines is that the network (vif) is dying after about ten
to
>>>>> sixty minutes of uptime. The unfortunate thing here is that
I can
>>>>> only repoduce it on a production VM and have been unlucky
so far
>>>>> to trigger the bug on a test machine.  While this has not
been
>>>>> tragic - rebooting fixed the issue, unfortunately I
can''t spend
>>>>> very much time on debugging after the issue pops up.
>>>> Ah, OK.  I''ve seen this a couple of times as well. 
And it just
>>>> happened to me then... 
>>>> 
>>>> 
>>>>> Now, what is happening is that the receive path goes dead. 
The
>>>>> DomU can send packets to Dom0 and those are visible using
tcpdump
>>>>> on the Dom0 on the virtual interface, but not the other way
>>>>> around. 
>>>> I hadn''t got to that level of diagnosis, but I can
confirm that
>>>> that''s what seems to be happening here too.
>>>> 
>>>>> Now, I have done more than one change at a time
(I''d like to avoid
>>>>> going into pinning it down since I can only reproduce it on
a
>>>>> production machine, as I said, so suggestions are welcome),
but my
>>>>> suspicion is that it might have to do with the new
"smart polling"
>>>>> feature in xen/netfront.  Note that I have also updated
Dom0 to
>>>>> pull in the latest dom0/backend and netback changes, just
to make
>>>>> sure it''s not due to an issue that has been fixed
there, but I''m
>>>>> still seeing the same.
>>>> I agree.  I think I started seeing this once I merged smartpoll
>>>> into netfront. 
>>>> 
>>>>     J
>>>> 
>>>>> The production machine is a machine that doesn''t
have much network
>>>>> load, but deals with a lot of small network requests (DNS
and smtp
>>>>> mostly).  A workload which is hard to reproduce on the test
>>>>> machine. Heavy network load (NFS, FTP and so on) for days
hasn''t
>>>>> triggered the problem.  Also, segmentation offloading and
similar
>>>>> settings don''t have any effect. 
>>>>> 
>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU
has
>>>>> PREEMPT enabled. 
>>>>> 
>>>>> I''ve been looking at the code, if there might be a
race condition
>>>>> somewhere, something like where one could run into a
situation
>>>>> where the hrtimer doesn''t run and Dom0 believes
the DomU should be
>>>>> polling and doesn''t emit an interrupt or
something, but I''m afraid
>>>>> I don''t know enough to judge this (I mean, there
are spinlocks
>>>>> which look safe to me). 
>>>>> 
>>>>> Do you have any suggestions what to try?  I can trigger the
issue
>>>>> on the production VM again, but debugging should not take
more
>>>>> than a few minutes if it happens.  Access is only possible
via
>>>>> the console. Neither Dom0 nor the guest show anything
unusual in
>>>>> the kernel message and continue to behave normally after
the
>>>>> network goes dead (also able to shut down the guest
normally).
>>>>> 
>>>>> Thanks,
>>>>> 	Christophe
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel
>>> 
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Sep-10 02:25 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:> Hi Jeremy and Pasi,
>
> I was frustrated that I couldn''t reproduce this bug in my site. 
Perhaps you have been trying to reproduce it in the wrong conditions?  I
have generally seen this bug when the networking is under very light
load, such as a couple of fairly idle dom0<->domU ssh connections. 
I''m
not sure that I''ve seen it under heavy load.
> However I investigated the code, indeed there is one race condition that
> probably cause the bug. See the attached patch.
>
> Could anybody who can see this bug help to try it? Appreciate much!
Thanks for looking into this.  Your logic seems reasonable, so I''ll
apply it (however I also added a patch to make smartpoll default to
"off"; I guess I can switch that to default on again to make sure it
gets tested, but leave the option as a workaround if there are still
problems).

However, I am concerned about these manipulations of a cross-cpu shared
variable without any barriers or other ordering constraints.  Are you
sure this code is correct under any reordering (either by the compiler
or CPUs); and if the compiler decides to access it more or less often
than the source says it should?

Thanks,
    J
> Thanks,
> Dongxiao
>
>
> Jeremy Fitzhardinge wrote:
>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>> Hi Christophe,
>>>>
>>>> Thanks for finding and checking the problem.
>>>> I will try to reproduce the issue and check what caused the
problem.
>>>>
>>> Hello,
>>>
>>> Was this issue resolved? Some users have been complaining
"network
>>> freezing up" issues recently on ##xen on irc..
>> Yeah, I''ll add a command-line parameter to disable smartpoll
(and
>> leave it off by default). 
>>
>>     J
>>
>>> -- Pasi
>>>
>>>> Thanks,
>>>> Dongxiao
>>>>
>>>> Jeremy Fitzhardinge wrote:
>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I''ve been playing with some of the new pvops
code, namely DomU
>>>>>> guest code.  What I''ve been observing on one
of the virtual
>>>>>> machines is that the network (vif) is dying after about
ten to
>>>>>> sixty minutes of uptime. The unfortunate thing here is
that I can
>>>>>> only repoduce it on a production VM and have been
unlucky so far
>>>>>> to trigger the bug on a test machine.  While this has
not been
>>>>>> tragic - rebooting fixed the issue, unfortunately I
can''t spend
>>>>>> very much time on debugging after the issue pops up.
>>>>> Ah, OK.  I''ve seen this a couple of times as well.
And it just
>>>>> happened to me then... 
>>>>>
>>>>>
>>>>>> Now, what is happening is that the receive path goes
dead.  The
>>>>>> DomU can send packets to Dom0 and those are visible
using tcpdump
>>>>>> on the Dom0 on the virtual interface, but not the other
way
>>>>>> around. 
>>>>> I hadn''t got to that level of diagnosis, but I can
confirm that
>>>>> that''s what seems to be happening here too.
>>>>>
>>>>>> Now, I have done more than one change at a time
(I''d like to avoid
>>>>>> going into pinning it down since I can only reproduce
it on a
>>>>>> production machine, as I said, so suggestions are
welcome), but my
>>>>>> suspicion is that it might have to do with the new
"smart polling"
>>>>>> feature in xen/netfront.  Note that I have also updated
Dom0 to
>>>>>> pull in the latest dom0/backend and netback changes,
just to make
>>>>>> sure it''s not due to an issue that has been
fixed there, but I''m
>>>>>> still seeing the same.
>>>>> I agree.  I think I started seeing this once I merged
smartpoll
>>>>> into netfront. 
>>>>>
>>>>>     J
>>>>>
>>>>>> The production machine is a machine that
doesn''t have much network
>>>>>> load, but deals with a lot of small network requests
(DNS and smtp
>>>>>> mostly).  A workload which is hard to reproduce on the
test
>>>>>> machine. Heavy network load (NFS, FTP and so on) for
days hasn''t
>>>>>> triggered the problem.  Also, segmentation offloading
and similar
>>>>>> settings don''t have any effect. 
>>>>>>
>>>>>> The machine has 2 physical and the VM 2 virtual CPUs,
DomU has
>>>>>> PREEMPT enabled. 
>>>>>>
>>>>>> I''ve been looking at the code, if there might
be a race condition
>>>>>> somewhere, something like where one could run into a
situation
>>>>>> where the hrtimer doesn''t run and Dom0
believes the DomU should be
>>>>>> polling and doesn''t emit an interrupt or
something, but I''m afraid
>>>>>> I don''t know enough to judge this (I mean,
there are spinlocks
>>>>>> which look safe to me). 
>>>>>>
>>>>>> Do you have any suggestions what to try?  I can trigger
the issue
>>>>>> on the production VM again, but debugging should not
take more
>>>>>> than a few minutes if it happens.  Access is only
possible via
>>>>>> the console. Neither Dom0 nor the guest show anything
unusual in
>>>>>> the kernel message and continue to behave normally
after the
>>>>>> network goes dead (also able to shut down the guest
normally).
>>>>>>
>>>>>> Thanks,
>>>>>> 	Christophe
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Xen-devel mailing list
>>>>>> Xen-devel@lists.xensource.com
>>>>>> http://lists.xensource.com/xen-devel
>>>> _______________________________________________
>>>> Xen-devel mailing list
>>>> Xen-devel@lists.xensource.com
>>>> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Sep-10 02:37 UTC

head link

RE: [Xen-devel] new netfront and occasional receive path lockup

Jeremy Fitzhardinge wrote:>  On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:
>> Hi Jeremy and Pasi,
>> 
>> I was frustrated that I couldn''t reproduce this bug in my
site.
> 
> Perhaps you have been trying to reproduce it in the wrong conditions?
> I have generally seen this bug when the networking is under very
> light load, such as a couple of fairly idle dom0<->domU ssh
> connections.  I''m not sure that I''ve seen it under heavy
load.
> 
>> However I investigated the code, indeed there is one race condition
>> that probably cause the bug. See the attached patch.
>> 
>> Could anybody who can see this bug help to try it? Appreciate much!
> 
> Thanks for looking into this.  Your logic seems reasonable, so
I''ll
> apply it (however I also added a patch to make smartpoll default to
> "off"; I guess I can switch that to default on again to make sure
it
> gets tested, but leave the option as a workaround if there are still
> problems).    
> 
> However, I am concerned about these manipulations of a cross-cpu
> shared variable without any barriers or other ordering constraints. 
> Are you sure this code is correct under any reordering (either by the
> compiler or CPUs); and if the compiler decides to access it more or
> less often than the source says it should?    
Do you mean the flag
"np->rx.sring->private.netif.smartpoll_active"?
It is a flag in shared ring structure, Therefore operations towards
this flag are the same as other component in shared ring, such as
under spinlock, etc.

I will put dom0 and domU ssh(ed) for some time to see if the bug
still exists.

Thanks,
Dongxiao
> 
> Thanks,
>     J
> 
>> Thanks,
>> Dongxiao
>> 
>> 
>> Jeremy Fitzhardinge wrote:
>>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>>> Hi Christophe,
>>>>> 
>>>>> Thanks for finding and checking the problem.
>>>>> I will try to reproduce the issue and check what caused the
>>>>> problem. 
>>>>> 
>>>> Hello,
>>>> 
>>>> Was this issue resolved? Some users have been complaining
"network
>>>> freezing up" issues recently on ##xen on irc..
>>> Yeah, I''ll add a command-line parameter to disable
smartpoll (and
>>> leave it off by default). 
>>> 
>>>     J
>>> 
>>>> -- Pasi
>>>> 
>>>>> Thanks,
>>>>> Dongxiao
>>>>> 
>>>>> Jeremy Fitzhardinge wrote:
>>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I''ve been playing with some of the new
pvops code, namely DomU
>>>>>>> guest code.  What I''ve been observing on
one of the virtual
>>>>>>> machines is that the network (vif) is dying after
about ten to
>>>>>>> sixty minutes of uptime. The unfortunate thing here
is that I
>>>>>>> can only repoduce it on a production VM and have
been unlucky
>>>>>>> so far 
>>>>>>> to trigger the bug on a test machine.  While this
has not been
>>>>>>> tragic - rebooting fixed the issue, unfortunately I
can''t spend
>>>>>>> very much time on debugging after the issue pops
up.
>>>>>> Ah, OK.  I''ve seen this a couple of times as
well.  And it just
>>>>>> happened to me then... 
>>>>>> 
>>>>>> 
>>>>>>> Now, what is happening is that the receive path
goes dead.  The
>>>>>>> DomU can send packets to Dom0 and those are visible
using
>>>>>>> tcpdump 
>>>>>>> on the Dom0 on the virtual interface, but not the
other way
>>>>>>> around.
>>>>>> I hadn''t got to that level of diagnosis, but I
can confirm that
>>>>>> that''s what seems to be happening here too.
>>>>>> 
>>>>>>> Now, I have done more than one change at a time
(I''d like to
>>>>>>> avoid going into pinning it down since I can only
reproduce it
>>>>>>> on 
>>>>>>> a production machine, as I said, so suggestions are
welcome),
>>>>>>> but 
>>>>>>> my suspicion is that it might have to do with the
new "smart
>>>>>>> polling" feature in xen/netfront.  Note that I
have also
>>>>>>> updated Dom0 to 
>>>>>>> pull in the latest dom0/backend and netback
changes, just to
>>>>>>> make sure it''s not due to an issue that
has been fixed there,
>>>>>>> but I''m still seeing the same.
>>>>>> I agree.  I think I started seeing this once I merged
smartpoll
>>>>>> into netfront. 
>>>>>> 
>>>>>>     J
>>>>>> 
>>>>>>> The production machine is a machine that
doesn''t have much
>>>>>>> network load, but deals with a lot of small network
requests
>>>>>>> (DNS and smtp mostly).  A workload which is hard to
reproduce
>>>>>>> on the 
>>>>>>> test machine. Heavy network load (NFS, FTP and so
on) for days
>>>>>>> hasn''t triggered the problem.  Also,
segmentation offloading and
>>>>>>> similar settings don''t have any effect.
>>>>>>> 
>>>>>>> The machine has 2 physical and the VM 2 virtual
CPUs, DomU has
>>>>>>> PREEMPT enabled. 
>>>>>>> 
>>>>>>> I''ve been looking at the code, if there
might be a race
>>>>>>> condition somewhere, something like where one could
run into a
>>>>>>> situation 
>>>>>>> where the hrtimer doesn''t run and Dom0
believes the DomU should
>>>>>>> be polling and doesn''t emit an interrupt
or something, but I''m
>>>>>>> afraid I don''t know enough to judge this
(I mean, there are
>>>>>>> spinlocks which look safe to me).
>>>>>>> 
>>>>>>> Do you have any suggestions what to try?  I can
trigger the
>>>>>>> issue 
>>>>>>> on the production VM again, but debugging should
not take more
>>>>>>> than a few minutes if it happens.  Access is only
possible via
>>>>>>> the console. Neither Dom0 nor the guest show
anything unusual in
>>>>>>> the kernel message and continue to behave normally
after the
>>>>>>> network goes dead (also able to shut down the guest
normally).
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 	Christophe
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> _______________________________________________
>>>>>>> Xen-devel mailing list
>>>>>>> Xen-devel@lists.xensource.com
>>>>>>> http://lists.xensource.com/xen-devel
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xensource.com
>>>>> http://lists.xensource.com/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Sep-10 02:42 UTC

head link

Re: [Xen-devel] new netfront and occasional receive path lockup

On 09/10/2010 12:37 PM, Xu, Dongxiao wrote:>> However, I am concerned about these manipulations of a cross-cpu
>> shared variable without any barriers or other ordering constraints. 
>> Are you sure this code is correct under any reordering (either by the
>> compiler or CPUs); and if the compiler decides to access it more or
>> less often than the source says it should?    
> Do you mean the flag
"np->rx.sring->private.netif.smartpoll_active"?
> It is a flag in shared ring structure, Therefore operations towards
> this flag are the same as other component in shared ring, such as
> under spinlock, etc.
Spinlocks are no use for inter-domain synchronization, only within a
domain.  The other ring operations are carefully ordered with
appropriate memory barriers in specific places; that''s why I''m
a bit
concerned about their absence for the smartpoll_active flag.  Even if
they are not necessary, I''d like to see an analysis as to why.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerald Turner

2010-Sep-12 01:00 UTC

head link

[Xen-devel] Re: new netfront and occasional receive path lockup

"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
> Hi Jeremy and Pasi,
>
> I was frustrated that I couldn''t reproduce this bug in my site.
>
> However I investigated the code, indeed there is one race condition
> that probably cause the bug. See the attached patch.
>
> Could anybody who can see this bug help to try it? Appreciate much!
>
Hello, I experienced this problem with netfront and the smartpoll code
causing their bridge interfaces to fail.

I''ve been building a Xen server using Debian Squeeze, Xen 4.0.1-rc6.
For weeks the server had been running solid with just three domU''s.  In
the last few days I significantly increased the number of domU''s (13
total) and have been having terrible packet drop problems.  Randomly,
maybe after 10 to 60 minutes of uptime, a domU or two will fall victim
to bridge failure.  There''s no syslog/dmesg output.  The only report of
the problem can by seen through network stats on dom0 (the domU vifX.X
interfaces have huge TX drops), and ''brctl showmacs'' output is
missing
the MAC addresses for the domU''s that have failed.

I''m not doing anything interesting with networking.  eth0/peth0 on dom0
with static IP, vifX.0 on domU, no DHCP, no firewall rules (other than
fail2ban), static IP assigned within in each domU.

I''m using PV and all Debian -xen-amd64 flavor kernel in dom0 and domU
(no interest in HVM).

I''ve tried dozens of attempts to solve this:

  * Screwed with ethtool -K XXX tx off on dom0, domU, physical
    interface.

  * Removed ''network-bridge'' setup from xend and setup
''br0'' the Debian
    Way.

  * Commented out ''iptables_setup'' from
''vif-bridge'' script which was
    producing lots of iptables noise.

  * Use ''mac='' in domU vif config.

  * Tried latest vanilla 2.6.35.5 kernel (netfront driver is
    pre-smartpoll) - I didn''t give this kernel enough time to break, I
    saw TX drops on boot and assumed the problem was still there, but my
    judgement was incorrect - all domU''s get a few TX drops while the
    kernel boots (probably ARPs while vifX.X is up but before the domU
    ifup''s it''s eth0 on boot).

Friday morning a fellow named ''Nrg_'' on ##xen immediately
diagnosed this
as possibly being related to the smartpoll bug in the netfront driver.

I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
confirmed the netfront driver is patched with an earlier version the
smartpoll code.

I manually merged Debian''s kernel with Jeremy''s updates to the
netfront
driver in his git repository.

  $ git diff
5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606

Deployed this new image on all domU''s (except for two of them, as a
control group) and updated grub kernel parameter with
xen_netfront.use_smartpoll=0.

Problem solved.  Only the two domU''s I left unpatched get victimized.
The rest of the hosts have been up for over a day and have not lost any
packets.

P.S. this is my first NNTP post thru gmane, I have no idea if it will
reach the list, keep Message-Id/References intact, and CC Christophe,
Jeremy, Dongxiao et al.

> Jeremy Fitzhardinge wrote:
>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>> Hi Christophe,
>>>>
>>>> Thanks for finding and checking the problem.
>>>> I will try to reproduce the issue and check what caused the
>>>> problem.
>>>>
>>> Hello,
>>>
>>> Was this issue resolved? Some users have been complaining
"network
>>> freezing up" issues recently on ##xen on irc..
>>
>> Yeah, I''ll add a command-line parameter to disable smartpoll
(and
>> leave it off by default).
>>
>>     J
>>
>>> -- Pasi
>>>
>>>> Thanks,
>>>> Dongxiao
>>>>
>>>> Jeremy Fitzhardinge wrote:
>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I''ve been playing with some of the new pvops
code, namely DomU
>>>>>> guest code.  What I''ve been observing on one
of the virtual
>>>>>> machines is that the network (vif) is dying after about
ten to
>>>>>> sixty minutes of uptime. The unfortunate thing here is
that I can
>>>>>> only repoduce it on a production VM and have been
unlucky so far
>>>>>> to trigger the bug on a test machine.  While this has
not been
>>>>>> tragic - rebooting fixed the issue, unfortunately I
can''t spend
>>>>>> very much time on debugging after the issue pops up.
>>>>> Ah, OK.  I''ve seen this a couple of times as well.
And it just
>>>>> happened to me then...
>>>>>
>>>>>
>>>>>> Now, what is happening is that the receive path goes
dead.  The
>>>>>> DomU can send packets to Dom0 and those are visible
using tcpdump
>>>>>> on the Dom0 on the virtual interface, but not the other
way
>>>>>> around.
>>>>> I hadn''t got to that level of diagnosis, but I can
confirm that
>>>>> that''s what seems to be happening here too.
>>>>>
>>>>>> Now, I have done more than one change at a time
(I''d like to
>>>>>> avoid going into pinning it down since I can only
reproduce it on
>>>>>> a production machine, as I said, so suggestions are
welcome), but
>>>>>> my suspicion is that it might have to do with the new
"smart
>>>>>> polling" feature in xen/netfront.  Note that I
have also updated
>>>>>> Dom0 to pull in the latest dom0/backend and netback
changes, just
>>>>>> to make sure it''s not due to an issue that has
been fixed there,
>>>>>> but I''m still seeing the same.
>>>>> I agree.  I think I started seeing this once I merged
smartpoll
>>>>> into netfront.
>>>>>
>>>>>     J
>>>>>
>>>>>> The production machine is a machine that
doesn''t have much
>>>>>> network load, but deals with a lot of small network
requests (DNS
>>>>>> and smtp mostly).  A workload which is hard to
reproduce on the
>>>>>> test machine. Heavy network load (NFS, FTP and so on)
for days
>>>>>> hasn''t triggered the problem.  Also,
segmentation offloading and
>>>>>> similar settings don''t have any effect.
>>>>>>
>>>>>> The machine has 2 physical and the VM 2 virtual CPUs,
DomU has
>>>>>> PREEMPT enabled.
>>>>>>
>>>>>> I''ve been looking at the code, if there might
be a race condition
>>>>>> somewhere, something like where one could run into a
situation
>>>>>> where the hrtimer doesn''t run and Dom0
believes the DomU should
>>>>>> be polling and doesn''t emit an interrupt or
something, but I''m
>>>>>> afraid I don''t know enough to judge this (I
mean, there are
>>>>>> spinlocks which look safe to me).
>>>>>>
>>>>>> Do you have any suggestions what to try?  I can trigger
the issue
>>>>>> on the production VM again, but debugging should not
take more
>>>>>> than a few minutes if it happens.  Access is only
possible via
>>>>>> the console. Neither Dom0 nor the guest show anything
unusual in
>>>>>> the kernel message and continue to behave normally
after the
>>>>>> network goes dead (also able to shut down the guest
normally).
>>>>>>
-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Sep-12 08:55 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On 09/12/2010 11:00 AM, Gerald Turner wrote:> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
> confirmed the netfront driver is patched with an earlier version the
> smartpoll code.
>
> I manually merged Debian''s kernel with Jeremy''s updates
to the netfront
> driver in his git repository.
>
>   $ git diff
5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
>
> Deployed this new image on all domU''s (except for two of them, as
a
> control group) and updated grub kernel parameter with
> xen_netfront.use_smartpoll=0.
That''s good to hear.  But I also included a fix from Dongxiao which, if
correct, means it should work with use_smartpoll=1 (or nothing, as
that''s the default).  Could you verify whether the fix in
cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
> Problem solved.  Only the two domU''s I left unpatched get
victimized.
> The rest of the hosts have been up for over a day and have not lost any
> packets.
>
> P.S. this is my first NNTP post thru gmane, I have no idea if it will
> reach the list, keep Message-Id/References intact, and CC Christophe,
> Jeremy, Dongxiao et al.
There were no cc:s.

Thanks,
    J
>
>> Jeremy Fitzhardinge wrote:
>>>  On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:
>>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:
>>>>> Hi Christophe,
>>>>>
>>>>> Thanks for finding and checking the problem.
>>>>> I will try to reproduce the issue and check what caused the
>>>>> problem.
>>>>>
>>>> Hello,
>>>>
>>>> Was this issue resolved? Some users have been complaining
"network
>>>> freezing up" issues recently on ##xen on irc..
>>> Yeah, I''ll add a command-line parameter to disable
smartpoll (and
>>> leave it off by default).
>>>
>>>     J
>>>
>>>> -- Pasi
>>>>
>>>>> Thanks,
>>>>> Dongxiao
>>>>>
>>>>> Jeremy Fitzhardinge wrote:
>>>>>>  On 08/22/2010 09:43 AM, Christophe Saout wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I''ve been playing with some of the new
pvops code, namely DomU
>>>>>>> guest code.  What I''ve been observing on
one of the virtual
>>>>>>> machines is that the network (vif) is dying after
about ten to
>>>>>>> sixty minutes of uptime. The unfortunate thing here
is that I can
>>>>>>> only repoduce it on a production VM and have been
unlucky so far
>>>>>>> to trigger the bug on a test machine.  While this
has not been
>>>>>>> tragic - rebooting fixed the issue, unfortunately I
can''t spend
>>>>>>> very much time on debugging after the issue pops
up.
>>>>>> Ah, OK.  I''ve seen this a couple of times as
well.  And it just
>>>>>> happened to me then...
>>>>>>
>>>>>>
>>>>>>> Now, what is happening is that the receive path
goes dead.  The
>>>>>>> DomU can send packets to Dom0 and those are visible
using tcpdump
>>>>>>> on the Dom0 on the virtual interface, but not the
other way
>>>>>>> around.
>>>>>> I hadn''t got to that level of diagnosis, but I
can confirm that
>>>>>> that''s what seems to be happening here too.
>>>>>>
>>>>>>> Now, I have done more than one change at a time
(I''d like to
>>>>>>> avoid going into pinning it down since I can only
reproduce it on
>>>>>>> a production machine, as I said, so suggestions are
welcome), but
>>>>>>> my suspicion is that it might have to do with the
new "smart
>>>>>>> polling" feature in xen/netfront.  Note that I
have also updated
>>>>>>> Dom0 to pull in the latest dom0/backend and netback
changes, just
>>>>>>> to make sure it''s not due to an issue that
has been fixed there,
>>>>>>> but I''m still seeing the same.
>>>>>> I agree.  I think I started seeing this once I merged
smartpoll
>>>>>> into netfront.
>>>>>>
>>>>>>     J
>>>>>>
>>>>>>> The production machine is a machine that
doesn''t have much
>>>>>>> network load, but deals with a lot of small network
requests (DNS
>>>>>>> and smtp mostly).  A workload which is hard to
reproduce on the
>>>>>>> test machine. Heavy network load (NFS, FTP and so
on) for days
>>>>>>> hasn''t triggered the problem.  Also,
segmentation offloading and
>>>>>>> similar settings don''t have any effect.
>>>>>>>
>>>>>>> The machine has 2 physical and the VM 2 virtual
CPUs, DomU has
>>>>>>> PREEMPT enabled.
>>>>>>>
>>>>>>> I''ve been looking at the code, if there
might be a race condition
>>>>>>> somewhere, something like where one could run into
a situation
>>>>>>> where the hrtimer doesn''t run and Dom0
believes the DomU should
>>>>>>> be polling and doesn''t emit an interrupt
or something, but I''m
>>>>>>> afraid I don''t know enough to judge this
(I mean, there are
>>>>>>> spinlocks which look safe to me).
>>>>>>>
>>>>>>> Do you have any suggestions what to try?  I can
trigger the issue
>>>>>>> on the production VM again, but debugging should
not take more
>>>>>>> than a few minutes if it happens.  Access is only
possible via
>>>>>>> the console. Neither Dom0 nor the guest show
anything unusual in
>>>>>>> the kernel message and continue to behave normally
after the
>>>>>>> network goes dead (also able to shut down the guest
normally).
>>>>>>>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Sep-12 17:23 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On Sun, Sep 12, 2010 at 06:55:48PM +1000, Jeremy Fitzhardinge
wrote:>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
> > I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
> > confirmed the netfront driver is patched with an earlier version the
> > smartpoll code.
> >
> > I manually merged Debian''s kernel with Jeremy''s
updates to the netfront
> > driver in his git repository.
> >
> >   $ git diff
5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
> >
> > Deployed this new image on all domU''s (except for two of
them, as a
> > control group) and updated grub kernel parameter with
> > xen_netfront.use_smartpoll=0.
> 
> That''s good to hear.  But I also included a fix from Dongxiao
which, if
> correct, means it should work with use_smartpoll=1 (or nothing, as
> that''s the default).  Could you verify whether the fix in
> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
> 
It''d be good to get the fix(es) to xen/stable-2.6.32.x aswell..

Or can you use "use_smartpoll=0" in current xen/stable-2.6.32.x
branch?

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerald Turner

2010-Sep-12 22:40 UTC

head link

[Xen-devel] Re: new netfront and occasional receive path lockup

Jeremy Fitzhardinge <jeremy@goop.org> writes:
>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
>> confirmed the netfront driver is patched with an earlier version the
>> smartpoll code.
>>
>> I manually merged Debian''s kernel with Jeremy''s
updates to the
>> netfront driver in his git repository.
>>
>>   $ git diff
5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
>>
>> Deployed this new image on all domU''s (except for two of them,
as a
>> control group) and updated grub kernel parameter with
>> xen_netfront.use_smartpoll=0.
>
> That''s good to hear.  But I also included a fix from Dongxiao
which,
> if correct, means it should work with use_smartpoll=1 (or nothing, as
> that''s the default).  Could you verify whether the fix in
> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
>
I''ve been running with use_smartpoll=1 for a few hours this afternoon,
looks like Dongxiao''s bugfix works.

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerald Turner

2010-Sep-13 00:03 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Gerald Turner <gturner@unzane.com> writes:
> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>
>>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
>>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and
>>> confirmed the netfront driver is patched with an earlier version
the
>>> smartpoll code.
>>>
>>> I manually merged Debian''s kernel with Jeremy''s
updates to the
>>> netfront driver in his git repository.
>>>
>>>   $ git diff
5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606
>>>
>>> Deployed this new image on all domU''s (except for two of
them, as a
>>> control group) and updated grub kernel parameter with
>>> xen_netfront.use_smartpoll=0.
>>
>> That''s good to hear.  But I also included a fix from Dongxiao
which,
>> if correct, means it should work with use_smartpoll=1 (or nothing, as
>> that''s the default).  Could you verify whether the fix in
>> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
>>
>
> I''ve been running with use_smartpoll=1 for a few hours this
afternoon,
> looks like Dongxiao''s bugfix works.
>
I spoke too soon!  use_smartpoll set to 1 and still exhibits the
problem, a few domU''s lost network after about 60 minutes of uptime.
Sorry for the bad news...

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Sep-13 00:54 UTC

head link

RE: [Xen-devel] Re: new netfront and occasional receive path lockup

Gerald Turner wrote:> Gerald Turner <gturner@unzane.com> writes:
> 
>> Jeremy Fitzhardinge <jeremy@goop.org> writes:
>> 
>>>  On 09/12/2010 11:00 AM, Gerald Turner wrote:
>>>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package
and
>>>> confirmed the netfront driver is patched with an earlier
version
>>>> the smartpoll code. 
>>>> 
>>>> I manually merged Debian''s kernel with
Jeremy''s updates to the
>>>> netfront driver in his git repository.
>>>> 
>>>>   $ git diff
>>>>
5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c
>>>> 8475f0c00e0606 
>>>> 
>>>> Deployed this new image on all domU''s (except for two
of them, as a
>>>> control group) and updated grub kernel parameter with
>>>> xen_netfront.use_smartpoll=0.
>>> 
>>> That''s good to hear.  But I also included a fix from
Dongxiao which,
>>> if correct, means it should work with use_smartpoll=1 (or nothing,
>>> as that''s the default).  Could you verify whether the fix
in
>>> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?
>>> 
>> 
>> I''ve been running with use_smartpoll=1 for a few hours this
>> afternoon, looks like Dongxiao''s bugfix works.
>> 
> 
> I spoke too soon!  use_smartpoll set to 1 and still exhibits the
> problem, a few domU''s lost network after about 60 minutes of
uptime.
> Sorry for the bad news... 
Hi Gerald,

Sorry for the inconvinience. I will continue to look into it.

Does this bug only happen when you launch multiple domUs?
I tried a single domU and could not catch the bug.

Thanks,
Dongxiao
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerald Turner

2010-Sep-13 02:12 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
> Does this bug only happen when you launch multiple domUs?  I tried a
> single domU and could not catch the bug.
>
I''ve been working on this server for about two weeks, I hadn''t
noticed
the problem for the first week when I only had 3 domUs.  It started
happening when I added 10 more domUs.  The problem would happen quickly,
within 10 minutes, always affecting at least two domUs at random, and
affect more domUs over time.

Saturday I installed the updated driver with Jeremy''s use_smartpoll
parameter, ran for 24 hours with smartpoll disabled, no problems.

Today I''ve been trying with smartpoll enabled.  It took an hour to
affect two domUs - noticibly longer than the behavior previous days
before installing your patch.  I still have 9 other domUs running with
smartpoll enabled, four hours uptime, I''m surprised they
haven''t been
affected yet.  Could there be another less-frequent race in
smart_poll_function?

--
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Sep-13 02:34 UTC

head link

RE: [Xen-devel] Re: new netfront and occasional receive path lockup

Gerald Turner wrote:> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
> 
>> Does this bug only happen when you launch multiple domUs?  I tried a
>> single domU and could not catch the bug.
>> 
> 
> I''ve been working on this server for about two weeks, I
hadn''t
> noticed the problem for the first week when I only had 3 domUs.  It
> started happening when I added 10 more domUs.  The problem would
> happen quickly, within 10 minutes, always affecting at least two
> domUs at random, and affect more domUs over time.    
> 
> Saturday I installed the updated driver with Jeremy''s
use_smartpoll
> parameter, ran for 24 hours with smartpoll disabled, no problems. 
> 
> Today I''ve been trying with smartpoll enabled.  It took an hour to
> affect two domUs - noticibly longer than the behavior previous days
> before installing your patch.  I still have 9 other domUs running
> with smartpoll enabled, four hours uptime, I''m surprised they
haven''t
> been affected yet.  Could there be another less-frequent race in
> smart_poll_function?     
Hi Gerald, 

Thanks for your detail information.

Unfortunately I don''t have such platform that could launch more than 10
guests in hand.

Here is another patch (see attached file) that fix another potential race.

Do you have bandwidth to have a try? Thanks in advance!

Best Regards,
-- Dongxiao

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerald Turner

2010-Sep-13 04:38 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
> Thanks for your detail information.
>
> Unfortunately I don''t have such platform that could launch more
than
> 10 guests in hand.
>
> Here is another patch (see attached file) that fix another potential
> race.
>
> Do you have bandwidth to have a try? Thanks in advance!
>
I built a kernel with your additional patch.

I have it running on all 13 domU''s with use_smartpoll=1.

I''ll report tomorrow morning whether there were any lockups.

FYI, total today I had 6 lockups with use_smartpoll=1 and the previous
patch.

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Gerald Turner

2010-Sep-13 16:01 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Gerald Turner <gturner@unzane.com> writes:
> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
>
>> Thanks for your detail information.
>>
>> Unfortunately I don''t have such platform that could launch
more than
>> 10 guests in hand.
>>
>> Here is another patch (see attached file) that fix another potential
>> race.
>>
>> Do you have bandwidth to have a try? Thanks in advance!
>>
>
> I built a kernel with your additional patch.
>
> I have it running on all 13 domU''s with use_smartpoll=1.
>
> I''ll report tomorrow morning whether there were any lockups.
>
> FYI, total today I had 6 lockups with use_smartpoll=1 and the previous
> patch.
>
Sorry bad news again...

Had 5 lockups within 4 hours.

Then I restarted all domUs with use_smartpoll=0 and haven''t had any
lockups in 7 hours.

-- 
Gerald Turner  Email: gturner@unzane.com  JID: gturner@jabber.unzane.com
GPG: 0xFA8CD6D5  21D9 B2E8 7FE7 F19E 5F7D  4D0C 3FA0 810F FA8C D6D5

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Sep-13 16:08 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner
wrote:> Gerald Turner <gturner@unzane.com> writes:
> 
> > "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
> >
> >> Thanks for your detail information.
> >>
> >> Unfortunately I don''t have such platform that could
launch more than
> >> 10 guests in hand.
> >>
> >> Here is another patch (see attached file) that fix another
potential
> >> race.
> >>
> >> Do you have bandwidth to have a try? Thanks in advance!
> >>
> >
> > I built a kernel with your additional patch.
> >
> > I have it running on all 13 domU''s with use_smartpoll=1.
> >
> > I''ll report tomorrow morning whether there were any lockups.
> >
> > FYI, total today I had 6 lockups with use_smartpoll=1 and the previous
> > patch.
> >
> 
> Sorry bad news again...
> 
> Had 5 lockups within 4 hours.
> 
> Then I restarted all domUs with use_smartpoll=0 and haven''t had
any
> lockups in 7 hours.
>
I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time
being
until this is sorted out..

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Sep-13 19:36 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
>> Gerald Turner <gturner@unzane.com> writes:
>>
>>> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
>>>
>>>> Thanks for your detail information.
>>>>
>>>> Unfortunately I don''t have such platform that could
launch more than
>>>> 10 guests in hand.
>>>>
>>>> Here is another patch (see attached file) that fix another
potential
>>>> race.
>>>>
>>>> Do you have bandwidth to have a try? Thanks in advance!
>>>>
>>> I built a kernel with your additional patch.
>>>
>>> I have it running on all 13 domU''s with use_smartpoll=1.
>>>
>>> I''ll report tomorrow morning whether there were any
lockups.
>>>
>>> FYI, total today I had 6 lockups with use_smartpoll=1 and the
previous
>>> patch.
>>>
>> Sorry bad news again...
>>
>> Had 5 lockups within 4 hours.
>>
>> Then I restarted all domUs with use_smartpoll=0 and haven''t
had any
>> lockups in 7 hours.
>>
> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the
time being
> until this is sorted out..
Agreed.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xu, Dongxiao

2010-Sep-14 00:26 UTC

head link

RE: [Xen-devel] Re: new netfront and occasional receive path lockup

Gerald Turner wrote:> Gerald Turner <gturner@unzane.com> writes:
> 
>> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes:
>> 
>>> Thanks for your detail information.
>>> 
>>> Unfortunately I don''t have such platform that could launch
more
>>> than 10 guests in hand. 
>>> 
>>> Here is another patch (see attached file) that fix another
>>> potential race. 
>>> 
>>> Do you have bandwidth to have a try? Thanks in advance!
>>> 
>> 
>> I built a kernel with your additional patch.
>> 
>> I have it running on all 13 domU''s with use_smartpoll=1.
>> 
>> I''ll report tomorrow morning whether there were any lockups.
>> 
>> FYI, total today I had 6 lockups with use_smartpoll=1 and the
>> previous patch. 
>> 
> 
> Sorry bad news again...
> 
> Had 5 lockups within 4 hours.
> 
> Then I restarted all domUs with use_smartpoll=0 and haven''t had
any
> lockups in 7 hours. 
Thanks Gerald.
I will try to find a local environment to do more investigation.

Best Regards,
-- Dongxiao
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Sep-14 08:25 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge
wrote:> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
> > On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
> >> Then I restarted all domUs with use_smartpoll=0 and
haven''t had any
> >> lockups in 7 hours.
> >>
> > I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for
the time being
> > until this is sorted out..
> 
> Agreed.
Should we also consider adding a netback option to disable it for the
system as a whole as well? Or are the issues strictly in-guest only?

Perhaps netback should support a xenstore key to allow a toolstack to
configure this property per guest?

Ian.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Jeremy Fitzhardinge

2010-Sep-14 17:54 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On 09/14/2010 01:25 AM, Ian Campbell wrote:> On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:
>> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
>>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
>>>> Then I restarted all domUs with use_smartpoll=0 and
haven''t had any
>>>> lockups in 7 hours.
>>>>
>>> I think we should default xen/stable-2.6.32.x to use_smartpoll=0
for the time being
>>> until this is sorted out..
>> Agreed.
> Should we also consider adding a netback option to disable it for the
> system as a whole as well? Or are the issues strictly in-guest only?
>
> Perhaps netback should support a xenstore key to allow a toolstack to
> configure this property per guest?
It depends on what the problem is.  If there''s a basic problem with the
smartpoll front<->back communication protocol then we''ll probably
have
to revert the whole thing and start over.  If the bug is just something
in the frontend then we can disable it there until resolved.

Fortunately I haven''t pushed netfront smartpoll support upstream yet,
so
the userbase is still fairly limited.  I hope.

    J

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Pasi Kärkkäinen

2010-Sep-14 18:44 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On Tue, Sep 14, 2010 at 10:54:27AM -0700, Jeremy Fitzhardinge
wrote:>  On 09/14/2010 01:25 AM, Ian Campbell wrote:
> > On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:
> >> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
> >>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:
> >>>> Then I restarted all domUs with use_smartpoll=0 and
haven''t had any
> >>>> lockups in 7 hours.
> >>>>
> >>> I think we should default xen/stable-2.6.32.x to
use_smartpoll=0 for the time being
> >>> until this is sorted out..
> >> Agreed.
> > Should we also consider adding a netback option to disable it for the
> > system as a whole as well? Or are the issues strictly in-guest only?
> >
> > Perhaps netback should support a xenstore key to allow a toolstack to
> > configure this property per guest?
> 
> It depends on what the problem is.  If there''s a basic problem
with the
> smartpoll front<->back communication protocol then we''ll
probably have
> to revert the whole thing and start over.  If the bug is just something
> in the frontend then we can disable it there until resolved.
> 
> Fortunately I haven''t pushed netfront smartpoll support upstream
yet, so
> the userbase is still fairly limited.  I hope.
> 
There has been quite a few people on ##xen on irc complaining about it..

I think the smartpoll code has ended up in Debian Squeeze 2.6.32-5-xen kernel..
Hopefully they''ll pull the "Revert "xen/netfront: default
smartpoll to on"" soon..

-- Pasi


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Ian Campbell

2010-Sep-15 09:46 UTC

head link

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

On Tue, 2010-09-14 at 19:44 +0100, Pasi Kärkkäinen
wrote:> On Tue, Sep 14, 2010 at 10:54:27AM -0700, Jeremy Fitzhardinge wrote:
> >  On 09/14/2010 01:25 AM, Ian Campbell wrote:
> > > On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:
> > >> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:
> > >>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner
wrote:
> > >>>> Then I restarted all domUs with use_smartpoll=0 and
haven''t had any
> > >>>> lockups in 7 hours.
> > >>>>
> > >>> I think we should default xen/stable-2.6.32.x to
use_smartpoll=0 for the time being
> > >>> until this is sorted out..
> > >> Agreed.
> > > Should we also consider adding a netback option to disable it for
the
> > > system as a whole as well? Or are the issues strictly in-guest
only?
> > >
> > > Perhaps netback should support a xenstore key to allow a
toolstack to
> > > configure this property per guest?
> > 
> > It depends on what the problem is.  If there''s a basic
problem with the
> > smartpoll front<->back communication protocol then
we''ll probably have
> > to revert the whole thing and start over.  If the bug is just
something
> > in the frontend then we can disable it there until resolved.
> > 
> > Fortunately I haven''t pushed netfront smartpoll support
upstream yet, so
> > the userbase is still fairly limited.  I hope.
> > 
> 
> There has been quite a few people on ##xen on irc complaining about it..
> 
> I think the smartpoll code has ended up in Debian Squeeze 2.6.32-5-xen
kernel..
> Hopefully they''ll pull the "Revert "xen/netfront:
default smartpoll to on"" soon..
I''ve suggested it on debian-kernel.

Ian.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

Xen devel - Aug 2010 - new netfront and occasional receive path lockup

[Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

RE: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

RE: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

RE: [Xen-devel] new netfront and occasional receive path lockup

Re: [Xen-devel] new netfront and occasional receive path lockup

[Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

[Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

RE: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

RE: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

RE: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup

Re: [Xen-devel] Re: new netfront and occasional receive path lockup