Christophe Saout
2010-Aug-22 16:43 UTC
[Xen-devel] new netfront and occasional receive path lockup
Hi, I''ve been playing with some of the new pvops code, namely DomU guest code. What I''ve been observing on one of the virtual machines is that the network (vif) is dying after about ten to sixty minutes of uptime. The unfortunate thing here is that I can only repoduce it on a production VM and have been unlucky so far to trigger the bug on a test machine. While this has not been tragic - rebooting fixed the issue, unfortunately I can''t spend very much time on debugging after the issue pops up. Now, what is happening is that the receive path goes dead. The DomU can send packets to Dom0 and those are visible using tcpdump on the Dom0 on the virtual interface, but not the other way around. Now, I have done more than one change at a time (I''d like to avoid going into pinning it down since I can only reproduce it on a production machine, as I said, so suggestions are welcome), but my suspicion is that it might have to do with the new "smart polling" feature in xen/netfront. Note that I have also updated Dom0 to pull in the latest dom0/backend and netback changes, just to make sure it''s not due to an issue that has been fixed there, but I''m still seeing the same. The production machine is a machine that doesn''t have much network load, but deals with a lot of small network requests (DNS and smtp mostly). A workload which is hard to reproduce on the test machine. Heavy network load (NFS, FTP and so on) for days hasn''t triggered the problem. Also, segmentation offloading and similar settings don''t have any effect. The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT enabled. I''ve been looking at the code, if there might be a race condition somewhere, something like where one could run into a situation where the hrtimer doesn''t run and Dom0 believes the DomU should be polling and doesn''t emit an interrupt or something, but I''m afraid I don''t know enough to judge this (I mean, there are spinlocks which look safe to me). Do you have any suggestions what to try? I can trigger the issue on the production VM again, but debugging should not take more than a few minutes if it happens. Access is only possible via the console. Neither Dom0 nor the guest show anything unusual in the kernel message and continue to behave normally after the network goes dead (also able to shut down the guest normally). Thanks, Christophe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christophe Saout
2010-Aug-22 18:37 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
Hi again,> I''ve been looking at the code, if there might be a race condition > somewhere, something like where one could run into a situation where the > hrtimer doesn''t run and Dom0 believes the DomU should be polling and > doesn''t emit an interrupt or something, but I''m afraid I don''t know > enough to judge this (I mean, there are spinlocks which look safe to > me).Hmm, looking a bit more. rx.sring->private.netif.smartpoll_active lies in a piece of memory that is shared between netback and netfront, is that right? If that is so, the tx spinlock in netfront only protects against simultaneous modifications from another thread in netfront, so netback can read smartpoll_active while netfront is fiddling with it. Is that safe? Note that when the lockup occurs, /proc/interrupts in the guest doesn''t show any interrupts arriving from for eth0 anymore. Are there any conditions where netback waits for netfront to retrieve packages even when new packages arrive? (like e.g. when the ring is full and there is backlog into the network stack or something?) Any way to debug this from the Dom0 side? Like looking into the state of the ring from userspace? Debug options? Christophe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christophe Saout
2010-Aug-23 14:26 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
Hi yet again, [not quoting everything again] I finally managed to trigger the issue on the test VM, which is now stuck in that state since last night and can be inspected. Apparently the tx ring on the netback side is full, since every packet sent is immediately dropped (as seen from ifconfig output). No interrupts moving on the guest. Still I''m wondering what would be the best course of action trying to debug this now. Should I have compiled some debugger into the hypervisor? (gdbsx apparently needs that) Thanks, Christophe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Konrad Rzeszutek Wilk
2010-Aug-23 16:04 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
On Mon, Aug 23, 2010 at 04:26:52PM +0200, Christophe Saout wrote:> Hi yet again, > > [not quoting everything again] > > I finally managed to trigger the issue on the test VM, which is now > stuck in that state since last night and can be inspected. Apparently > the tx ring on the netback side is full, since every packet sent is > immediately dropped (as seen from ifconfig output). No interrupts > moving on the guest.What is the kernel and hypervisor in Dom0? And what is it in DomU?> > Still I''m wondering what would be the best course of action trying to > debug this now. Should I have compiled some debugger into the > hypervisor? (gdbsx apparently needs that)Sure. An easier path might be to do ''xm debug-keys q'' which should trigger the debug irq handler. In DomU that should print out all of the event channel bits which we can analyze that and see if the proper bits are not set (and hence the IRQ handler isn''t picking up from the ring buffer). _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Christophe Saout
2010-Aug-23 17:09 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
Hi Konrad,> > I finally managed to trigger the issue on the test VM, which is now > > stuck in that state since last night and can be inspected. Apparently > > the tx ring on the netback side is full, since every packet sent is > > immediately dropped (as seen from ifconfig output). No interrupts > > moving on the guest. > > What is the kernel and hypervisor in Dom0? And what is it in DomU?The hypervisor is from the Xen 4.0.0 release and the Dom0 is from Jeremy''s 2.6.32 stable branch for pvops Dom0 (and lately with the xen/dom0/backend branches merged in top because I hoped there might be some fixes that help). The same kernel has been working fine as guest, but my newer one where I took an upstream 2.6.35, applied some of the upstream fixes branches and also pulled the xen/netfront in, is now causeing this issue. Everything else is working just fine, so I am pretty sure it is related to a netfront-specific change and not to anything else.> > hypervisor? (gdbsx apparently needs that) > > Sure.Also, I noticed that "gdb /path/to/vmlinux /proc/kcore" does allow me to inspect the memory. I''ll try to see if I can pinpoint some of the interesting memory locations.> An easier path might be to do ''xm debug-keys q'' which should > trigger the debug irq handler. In DomU that should print out all of the > event channel bits which we can analyze that and see if the > proper bits are not set (and hence the IRQ handler isn''t picking up > from the ring buffer).I''m not exactly sure how to read the output of that. http://www.saout.de/assets/xm-debug-q.txt Christophe _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Aug-24 00:46 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
On 08/22/2010 09:43 AM, Christophe Saout wrote:> Hi, > > I''ve been playing with some of the new pvops code, namely DomU guest > code. What I''ve been observing on one of the virtual machines is that > the network (vif) is dying after about ten to sixty minutes of uptime. > The unfortunate thing here is that I can only repoduce it on a > production VM and have been unlucky so far to trigger the bug on a test > machine. While this has not been tragic - rebooting fixed the issue, > unfortunately I can''t spend very much time on debugging after the issue > pops up.Ah, OK. I''ve seen this a couple of times as well. And it just happened to me then...> Now, what is happening is that the receive path goes dead. The DomU can > send packets to Dom0 and those are visible using tcpdump on the Dom0 on > the virtual interface, but not the other way around.I hadn''t got to that level of diagnosis, but I can confirm that that''s what seems to be happening here too.> Now, I have done more than one change at a time (I''d like to avoid going > into pinning it down since I can only reproduce it on a production > machine, as I said, so suggestions are welcome), but my suspicion is > that it might have to do with the new "smart polling" feature in > xen/netfront. Note that I have also updated Dom0 to pull in the latest > dom0/backend and netback changes, just to make sure it''s not due to an > issue that has been fixed there, but I''m still seeing the same.I agree. I think I started seeing this once I merged smartpoll into netfront. J> The production machine is a machine that doesn''t have much network load, > but deals with a lot of small network requests (DNS and smtp mostly). A > workload which is hard to reproduce on the test machine. Heavy network > load (NFS, FTP and so on) for days hasn''t triggered the problem. Also, > segmentation offloading and similar settings don''t have any effect. > > The machine has 2 physical and the VM 2 virtual CPUs, DomU has PREEMPT > enabled. > > I''ve been looking at the code, if there might be a race condition > somewhere, something like where one could run into a situation where the > hrtimer doesn''t run and Dom0 believes the DomU should be polling and > doesn''t emit an interrupt or something, but I''m afraid I don''t know > enough to judge this (I mean, there are spinlocks which look safe to > me). > > Do you have any suggestions what to try? I can trigger the issue on the > production VM again, but debugging should not take more than a few > minutes if it happens. Access is only possible via the console. > Neither Dom0 nor the guest show anything unusual in the kernel message > and continue to behave normally after the network goes dead (also able > to shut down the guest normally). > > Thanks, > Christophe > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Aug-24 00:53 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
On 08/22/2010 11:37 AM, Christophe Saout wrote:> Hmm, looking a bit more. > > rx.sring->private.netif.smartpoll_active lies in a piece of memory that > is shared between netback and netfront, is that right? > > If that is so, the tx spinlock in netfront only protects against > simultaneous modifications from another thread in netfront, so netback > can read smartpoll_active while netfront is fiddling with it. Is that > safe?It depends on exactly how it is used. But any use cross-cpu shared memory must carefully consider access ordering, and possibly have explicit barriers to make sure that the expected ordering is actually seen by all cpus. J> Note that when the lockup occurs, /proc/interrupts in the guest doesn''t > show any interrupts arriving from for eth0 anymore. Are there any > conditions where netback waits for netfront to retrieve packages even > when new packages arrive? (like e.g. when the ring is full and there is > backlog into the network stack or something?) Any way to debug this from > the Dom0 side? Like looking into the state of the ring from userspace? > Debug options? > > Christophe > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel >_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Aug-25 00:51 UTC
RE: [Xen-devel] new netfront and occasional receive path lockup
Hi Christophe, Thanks for finding and checking the problem. I will try to reproduce the issue and check what caused the problem. Thanks, Dongxiao Jeremy Fitzhardinge wrote:> On 08/22/2010 09:43 AM, Christophe Saout wrote: >> Hi, >> >> I''ve been playing with some of the new pvops code, namely DomU guest >> code. What I''ve been observing on one of the virtual machines is >> that >> the network (vif) is dying after about ten to sixty minutes of >> uptime. >> The unfortunate thing here is that I can only repoduce it on a >> production VM and have been unlucky so far to trigger the bug on a >> test machine. While this has not been tragic - rebooting fixed the >> issue, unfortunately I can''t spend very much time on debugging after >> the issue pops up. > > Ah, OK. I''ve seen this a couple of times as well. And it just > happened to me then... > > >> Now, what is happening is that the receive path goes dead. The DomU >> can send packets to Dom0 and those are visible using tcpdump on the >> Dom0 on the virtual interface, but not the other way around. > > I hadn''t got to that level of diagnosis, but I can confirm that > that''s what seems to be happening here too. > >> Now, I have done more than one change at a time (I''d like to avoid >> going into pinning it down since I can only reproduce it on a >> production machine, as I said, so suggestions are welcome), but my >> suspicion is that it might have to do with the new "smart polling" >> feature in xen/netfront. Note that I have also updated Dom0 to pull >> in the latest dom0/backend and netback changes, just to make sure >> it''s >> not due to an issue that has been fixed there, but I''m still seeing >> the same. > > I agree. I think I started seeing this once I merged smartpoll into > netfront. > > J > >> The production machine is a machine that doesn''t have much network >> load, but deals with a lot of small network requests (DNS and smtp >> mostly). A workload which is hard to reproduce on the test machine. >> Heavy network load (NFS, FTP and so on) for days hasn''t triggered the >> problem. Also, segmentation offloading and similar settings don''t >> have any effect. >> >> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >> PREEMPT >> enabled. >> >> I''ve been looking at the code, if there might be a race condition >> somewhere, something like where one could run into a situation where >> the hrtimer doesn''t run and Dom0 believes the DomU should be polling >> and doesn''t emit an interrupt or something, but I''m afraid I don''t >> know enough to judge this (I mean, there are spinlocks which look >> safe >> to me). >> >> Do you have any suggestions what to try? I can trigger the issue on >> the production VM again, but debugging should not take more than a >> few >> minutes if it happens. Access is only possible via the console. >> Neither Dom0 nor the guest show anything unusual in the kernel >> message >> and continue to behave normally after the network goes dead (also >> able >> to shut down the guest normally). >> >> Thanks, >> Christophe >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2010-Sep-09 18:50 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote:> Hi Christophe, > > Thanks for finding and checking the problem. > I will try to reproduce the issue and check what caused the problem. >Hello, Was this issue resolved? Some users have been complaining "network freezing up" issues recently on ##xen on irc.. -- Pasi> Thanks, > Dongxiao > > Jeremy Fitzhardinge wrote: > > On 08/22/2010 09:43 AM, Christophe Saout wrote: > >> Hi, > >> > >> I''ve been playing with some of the new pvops code, namely DomU guest > >> code. What I''ve been observing on one of the virtual machines is > >> that > >> the network (vif) is dying after about ten to sixty minutes of > >> uptime. > >> The unfortunate thing here is that I can only repoduce it on a > >> production VM and have been unlucky so far to trigger the bug on a > >> test machine. While this has not been tragic - rebooting fixed the > >> issue, unfortunately I can''t spend very much time on debugging after > >> the issue pops up. > > > > Ah, OK. I''ve seen this a couple of times as well. And it just > > happened to me then... > > > > > >> Now, what is happening is that the receive path goes dead. The DomU > >> can send packets to Dom0 and those are visible using tcpdump on the > >> Dom0 on the virtual interface, but not the other way around. > > > > I hadn''t got to that level of diagnosis, but I can confirm that > > that''s what seems to be happening here too. > > > >> Now, I have done more than one change at a time (I''d like to avoid > >> going into pinning it down since I can only reproduce it on a > >> production machine, as I said, so suggestions are welcome), but my > >> suspicion is that it might have to do with the new "smart polling" > >> feature in xen/netfront. Note that I have also updated Dom0 to pull > >> in the latest dom0/backend and netback changes, just to make sure > >> it''s > >> not due to an issue that has been fixed there, but I''m still seeing > >> the same. > > > > I agree. I think I started seeing this once I merged smartpoll into > > netfront. > > > > J > > > >> The production machine is a machine that doesn''t have much network > >> load, but deals with a lot of small network requests (DNS and smtp > >> mostly). A workload which is hard to reproduce on the test machine. > >> Heavy network load (NFS, FTP and so on) for days hasn''t triggered the > >> problem. Also, segmentation offloading and similar settings don''t > >> have any effect. > >> > >> The machine has 2 physical and the VM 2 virtual CPUs, DomU has > >> PREEMPT > >> enabled. > >> > >> I''ve been looking at the code, if there might be a race condition > >> somewhere, something like where one could run into a situation where > >> the hrtimer doesn''t run and Dom0 believes the DomU should be polling > >> and doesn''t emit an interrupt or something, but I''m afraid I don''t > >> know enough to judge this (I mean, there are spinlocks which look > >> safe > >> to me). > >> > >> Do you have any suggestions what to try? I can trigger the issue on > >> the production VM again, but debugging should not take more than a > >> few > >> minutes if it happens. Access is only possible via the console. > >> Neither Dom0 nor the guest show anything unusual in the kernel > >> message > >> and continue to behave normally after the network goes dead (also > >> able > >> to shut down the guest normally). > >> > >> Thanks, > >> Christophe > >> > >> > >> > >> _______________________________________________ > >> Xen-devel mailing list > >> Xen-devel@lists.xensource.com > >> http://lists.xensource.com/xen-devel > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xensource.com > http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Sep-10 00:55 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote:> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >> Hi Christophe, >> >> Thanks for finding and checking the problem. >> I will try to reproduce the issue and check what caused the problem. >> > Hello, > > Was this issue resolved? Some users have been complaining > "network freezing up" issues recently on ##xen on irc..Yeah, I''ll add a command-line parameter to disable smartpoll (and leave it off by default). J> -- Pasi > >> Thanks, >> Dongxiao >> >> Jeremy Fitzhardinge wrote: >>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>> Hi, >>>> >>>> I''ve been playing with some of the new pvops code, namely DomU guest >>>> code. What I''ve been observing on one of the virtual machines is >>>> that >>>> the network (vif) is dying after about ten to sixty minutes of >>>> uptime. >>>> The unfortunate thing here is that I can only repoduce it on a >>>> production VM and have been unlucky so far to trigger the bug on a >>>> test machine. While this has not been tragic - rebooting fixed the >>>> issue, unfortunately I can''t spend very much time on debugging after >>>> the issue pops up. >>> Ah, OK. I''ve seen this a couple of times as well. And it just >>> happened to me then... >>> >>> >>>> Now, what is happening is that the receive path goes dead. The DomU >>>> can send packets to Dom0 and those are visible using tcpdump on the >>>> Dom0 on the virtual interface, but not the other way around. >>> I hadn''t got to that level of diagnosis, but I can confirm that >>> that''s what seems to be happening here too. >>> >>>> Now, I have done more than one change at a time (I''d like to avoid >>>> going into pinning it down since I can only reproduce it on a >>>> production machine, as I said, so suggestions are welcome), but my >>>> suspicion is that it might have to do with the new "smart polling" >>>> feature in xen/netfront. Note that I have also updated Dom0 to pull >>>> in the latest dom0/backend and netback changes, just to make sure >>>> it''s >>>> not due to an issue that has been fixed there, but I''m still seeing >>>> the same. >>> I agree. I think I started seeing this once I merged smartpoll into >>> netfront. >>> >>> J >>> >>>> The production machine is a machine that doesn''t have much network >>>> load, but deals with a lot of small network requests (DNS and smtp >>>> mostly). A workload which is hard to reproduce on the test machine. >>>> Heavy network load (NFS, FTP and so on) for days hasn''t triggered the >>>> problem. Also, segmentation offloading and similar settings don''t >>>> have any effect. >>>> >>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>> PREEMPT >>>> enabled. >>>> >>>> I''ve been looking at the code, if there might be a race condition >>>> somewhere, something like where one could run into a situation where >>>> the hrtimer doesn''t run and Dom0 believes the DomU should be polling >>>> and doesn''t emit an interrupt or something, but I''m afraid I don''t >>>> know enough to judge this (I mean, there are spinlocks which look >>>> safe >>>> to me). >>>> >>>> Do you have any suggestions what to try? I can trigger the issue on >>>> the production VM again, but debugging should not take more than a >>>> few >>>> minutes if it happens. Access is only possible via the console. >>>> Neither Dom0 nor the guest show anything unusual in the kernel >>>> message >>>> and continue to behave normally after the network goes dead (also >>>> able >>>> to shut down the guest normally). >>>> >>>> Thanks, >>>> Christophe >>>> >>>> >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Sep-10 01:45 UTC
RE: [Xen-devel] new netfront and occasional receive path lockup
Hi Jeremy and Pasi, I was frustrated that I couldn''t reproduce this bug in my site. However I investigated the code, indeed there is one race condition that probably cause the bug. See the attached patch. Could anybody who can see this bug help to try it? Appreciate much! Thanks, Dongxiao Jeremy Fitzhardinge wrote:> On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote: >> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>> Hi Christophe, >>> >>> Thanks for finding and checking the problem. >>> I will try to reproduce the issue and check what caused the problem. >>> >> Hello, >> >> Was this issue resolved? Some users have been complaining "network >> freezing up" issues recently on ##xen on irc.. > > Yeah, I''ll add a command-line parameter to disable smartpoll (and > leave it off by default). > > J > >> -- Pasi >> >>> Thanks, >>> Dongxiao >>> >>> Jeremy Fitzhardinge wrote: >>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>> Hi, >>>>> >>>>> I''ve been playing with some of the new pvops code, namely DomU >>>>> guest code. What I''ve been observing on one of the virtual >>>>> machines is that the network (vif) is dying after about ten to >>>>> sixty minutes of uptime. The unfortunate thing here is that I can >>>>> only repoduce it on a production VM and have been unlucky so far >>>>> to trigger the bug on a test machine. While this has not been >>>>> tragic - rebooting fixed the issue, unfortunately I can''t spend >>>>> very much time on debugging after the issue pops up. >>>> Ah, OK. I''ve seen this a couple of times as well. And it just >>>> happened to me then... >>>> >>>> >>>>> Now, what is happening is that the receive path goes dead. The >>>>> DomU can send packets to Dom0 and those are visible using tcpdump >>>>> on the Dom0 on the virtual interface, but not the other way >>>>> around. >>>> I hadn''t got to that level of diagnosis, but I can confirm that >>>> that''s what seems to be happening here too. >>>> >>>>> Now, I have done more than one change at a time (I''d like to avoid >>>>> going into pinning it down since I can only reproduce it on a >>>>> production machine, as I said, so suggestions are welcome), but my >>>>> suspicion is that it might have to do with the new "smart polling" >>>>> feature in xen/netfront. Note that I have also updated Dom0 to >>>>> pull in the latest dom0/backend and netback changes, just to make >>>>> sure it''s not due to an issue that has been fixed there, but I''m >>>>> still seeing the same. >>>> I agree. I think I started seeing this once I merged smartpoll >>>> into netfront. >>>> >>>> J >>>> >>>>> The production machine is a machine that doesn''t have much network >>>>> load, but deals with a lot of small network requests (DNS and smtp >>>>> mostly). A workload which is hard to reproduce on the test >>>>> machine. Heavy network load (NFS, FTP and so on) for days hasn''t >>>>> triggered the problem. Also, segmentation offloading and similar >>>>> settings don''t have any effect. >>>>> >>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>> PREEMPT enabled. >>>>> >>>>> I''ve been looking at the code, if there might be a race condition >>>>> somewhere, something like where one could run into a situation >>>>> where the hrtimer doesn''t run and Dom0 believes the DomU should be >>>>> polling and doesn''t emit an interrupt or something, but I''m afraid >>>>> I don''t know enough to judge this (I mean, there are spinlocks >>>>> which look safe to me). >>>>> >>>>> Do you have any suggestions what to try? I can trigger the issue >>>>> on the production VM again, but debugging should not take more >>>>> than a few minutes if it happens. Access is only possible via >>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>> the kernel message and continue to behave normally after the >>>>> network goes dead (also able to shut down the guest normally). >>>>> >>>>> Thanks, >>>>> Christophe >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Xen-devel mailing list >>>>> Xen-devel@lists.xensource.com >>>>> http://lists.xensource.com/xen-devel >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Sep-10 02:25 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
On 09/10/2010 11:45 AM, Xu, Dongxiao wrote:> Hi Jeremy and Pasi, > > I was frustrated that I couldn''t reproduce this bug in my site.Perhaps you have been trying to reproduce it in the wrong conditions? I have generally seen this bug when the networking is under very light load, such as a couple of fairly idle dom0<->domU ssh connections. I''m not sure that I''ve seen it under heavy load.> However I investigated the code, indeed there is one race condition that > probably cause the bug. See the attached patch. > > Could anybody who can see this bug help to try it? Appreciate much!Thanks for looking into this. Your logic seems reasonable, so I''ll apply it (however I also added a patch to make smartpoll default to "off"; I guess I can switch that to default on again to make sure it gets tested, but leave the option as a workaround if there are still problems). However, I am concerned about these manipulations of a cross-cpu shared variable without any barriers or other ordering constraints. Are you sure this code is correct under any reordering (either by the compiler or CPUs); and if the compiler decides to access it more or less often than the source says it should? Thanks, J> Thanks, > Dongxiao > > > Jeremy Fitzhardinge wrote: >> On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote: >>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>>> Hi Christophe, >>>> >>>> Thanks for finding and checking the problem. >>>> I will try to reproduce the issue and check what caused the problem. >>>> >>> Hello, >>> >>> Was this issue resolved? Some users have been complaining "network >>> freezing up" issues recently on ##xen on irc.. >> Yeah, I''ll add a command-line parameter to disable smartpoll (and >> leave it off by default). >> >> J >> >>> -- Pasi >>> >>>> Thanks, >>>> Dongxiao >>>> >>>> Jeremy Fitzhardinge wrote: >>>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>>> Hi, >>>>>> >>>>>> I''ve been playing with some of the new pvops code, namely DomU >>>>>> guest code. What I''ve been observing on one of the virtual >>>>>> machines is that the network (vif) is dying after about ten to >>>>>> sixty minutes of uptime. The unfortunate thing here is that I can >>>>>> only repoduce it on a production VM and have been unlucky so far >>>>>> to trigger the bug on a test machine. While this has not been >>>>>> tragic - rebooting fixed the issue, unfortunately I can''t spend >>>>>> very much time on debugging after the issue pops up. >>>>> Ah, OK. I''ve seen this a couple of times as well. And it just >>>>> happened to me then... >>>>> >>>>> >>>>>> Now, what is happening is that the receive path goes dead. The >>>>>> DomU can send packets to Dom0 and those are visible using tcpdump >>>>>> on the Dom0 on the virtual interface, but not the other way >>>>>> around. >>>>> I hadn''t got to that level of diagnosis, but I can confirm that >>>>> that''s what seems to be happening here too. >>>>> >>>>>> Now, I have done more than one change at a time (I''d like to avoid >>>>>> going into pinning it down since I can only reproduce it on a >>>>>> production machine, as I said, so suggestions are welcome), but my >>>>>> suspicion is that it might have to do with the new "smart polling" >>>>>> feature in xen/netfront. Note that I have also updated Dom0 to >>>>>> pull in the latest dom0/backend and netback changes, just to make >>>>>> sure it''s not due to an issue that has been fixed there, but I''m >>>>>> still seeing the same. >>>>> I agree. I think I started seeing this once I merged smartpoll >>>>> into netfront. >>>>> >>>>> J >>>>> >>>>>> The production machine is a machine that doesn''t have much network >>>>>> load, but deals with a lot of small network requests (DNS and smtp >>>>>> mostly). A workload which is hard to reproduce on the test >>>>>> machine. Heavy network load (NFS, FTP and so on) for days hasn''t >>>>>> triggered the problem. Also, segmentation offloading and similar >>>>>> settings don''t have any effect. >>>>>> >>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>>> PREEMPT enabled. >>>>>> >>>>>> I''ve been looking at the code, if there might be a race condition >>>>>> somewhere, something like where one could run into a situation >>>>>> where the hrtimer doesn''t run and Dom0 believes the DomU should be >>>>>> polling and doesn''t emit an interrupt or something, but I''m afraid >>>>>> I don''t know enough to judge this (I mean, there are spinlocks >>>>>> which look safe to me). >>>>>> >>>>>> Do you have any suggestions what to try? I can trigger the issue >>>>>> on the production VM again, but debugging should not take more >>>>>> than a few minutes if it happens. Access is only possible via >>>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>>> the kernel message and continue to behave normally after the >>>>>> network goes dead (also able to shut down the guest normally). >>>>>> >>>>>> Thanks, >>>>>> Christophe >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Xen-devel mailing list >>>>>> Xen-devel@lists.xensource.com >>>>>> http://lists.xensource.com/xen-devel >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Sep-10 02:37 UTC
RE: [Xen-devel] new netfront and occasional receive path lockup
Jeremy Fitzhardinge wrote:> On 09/10/2010 11:45 AM, Xu, Dongxiao wrote: >> Hi Jeremy and Pasi, >> >> I was frustrated that I couldn''t reproduce this bug in my site. > > Perhaps you have been trying to reproduce it in the wrong conditions? > I have generally seen this bug when the networking is under very > light load, such as a couple of fairly idle dom0<->domU ssh > connections. I''m not sure that I''ve seen it under heavy load. > >> However I investigated the code, indeed there is one race condition >> that probably cause the bug. See the attached patch. >> >> Could anybody who can see this bug help to try it? Appreciate much! > > Thanks for looking into this. Your logic seems reasonable, so I''ll > apply it (however I also added a patch to make smartpoll default to > "off"; I guess I can switch that to default on again to make sure it > gets tested, but leave the option as a workaround if there are still > problems). > > However, I am concerned about these manipulations of a cross-cpu > shared variable without any barriers or other ordering constraints. > Are you sure this code is correct under any reordering (either by the > compiler or CPUs); and if the compiler decides to access it more or > less often than the source says it should?Do you mean the flag "np->rx.sring->private.netif.smartpoll_active"? It is a flag in shared ring structure, Therefore operations towards this flag are the same as other component in shared ring, such as under spinlock, etc. I will put dom0 and domU ssh(ed) for some time to see if the bug still exists. Thanks, Dongxiao> > Thanks, > J > >> Thanks, >> Dongxiao >> >> >> Jeremy Fitzhardinge wrote: >>> On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote: >>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>>>> Hi Christophe, >>>>> >>>>> Thanks for finding and checking the problem. >>>>> I will try to reproduce the issue and check what caused the >>>>> problem. >>>>> >>>> Hello, >>>> >>>> Was this issue resolved? Some users have been complaining "network >>>> freezing up" issues recently on ##xen on irc.. >>> Yeah, I''ll add a command-line parameter to disable smartpoll (and >>> leave it off by default). >>> >>> J >>> >>>> -- Pasi >>>> >>>>> Thanks, >>>>> Dongxiao >>>>> >>>>> Jeremy Fitzhardinge wrote: >>>>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I''ve been playing with some of the new pvops code, namely DomU >>>>>>> guest code. What I''ve been observing on one of the virtual >>>>>>> machines is that the network (vif) is dying after about ten to >>>>>>> sixty minutes of uptime. The unfortunate thing here is that I >>>>>>> can only repoduce it on a production VM and have been unlucky >>>>>>> so far >>>>>>> to trigger the bug on a test machine. While this has not been >>>>>>> tragic - rebooting fixed the issue, unfortunately I can''t spend >>>>>>> very much time on debugging after the issue pops up. >>>>>> Ah, OK. I''ve seen this a couple of times as well. And it just >>>>>> happened to me then... >>>>>> >>>>>> >>>>>>> Now, what is happening is that the receive path goes dead. The >>>>>>> DomU can send packets to Dom0 and those are visible using >>>>>>> tcpdump >>>>>>> on the Dom0 on the virtual interface, but not the other way >>>>>>> around. >>>>>> I hadn''t got to that level of diagnosis, but I can confirm that >>>>>> that''s what seems to be happening here too. >>>>>> >>>>>>> Now, I have done more than one change at a time (I''d like to >>>>>>> avoid going into pinning it down since I can only reproduce it >>>>>>> on >>>>>>> a production machine, as I said, so suggestions are welcome), >>>>>>> but >>>>>>> my suspicion is that it might have to do with the new "smart >>>>>>> polling" feature in xen/netfront. Note that I have also >>>>>>> updated Dom0 to >>>>>>> pull in the latest dom0/backend and netback changes, just to >>>>>>> make sure it''s not due to an issue that has been fixed there, >>>>>>> but I''m still seeing the same. >>>>>> I agree. I think I started seeing this once I merged smartpoll >>>>>> into netfront. >>>>>> >>>>>> J >>>>>> >>>>>>> The production machine is a machine that doesn''t have much >>>>>>> network load, but deals with a lot of small network requests >>>>>>> (DNS and smtp mostly). A workload which is hard to reproduce >>>>>>> on the >>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days >>>>>>> hasn''t triggered the problem. Also, segmentation offloading and >>>>>>> similar settings don''t have any effect. >>>>>>> >>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>>>> PREEMPT enabled. >>>>>>> >>>>>>> I''ve been looking at the code, if there might be a race >>>>>>> condition somewhere, something like where one could run into a >>>>>>> situation >>>>>>> where the hrtimer doesn''t run and Dom0 believes the DomU should >>>>>>> be polling and doesn''t emit an interrupt or something, but I''m >>>>>>> afraid I don''t know enough to judge this (I mean, there are >>>>>>> spinlocks which look safe to me). >>>>>>> >>>>>>> Do you have any suggestions what to try? I can trigger the >>>>>>> issue >>>>>>> on the production VM again, but debugging should not take more >>>>>>> than a few minutes if it happens. Access is only possible via >>>>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>>>> the kernel message and continue to behave normally after the >>>>>>> network goes dead (also able to shut down the guest normally). >>>>>>> >>>>>>> Thanks, >>>>>>> Christophe >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Xen-devel mailing list >>>>>>> Xen-devel@lists.xensource.com >>>>>>> http://lists.xensource.com/xen-devel >>>>> _______________________________________________ >>>>> Xen-devel mailing list >>>>> Xen-devel@lists.xensource.com >>>>> http://lists.xensource.com/xen-devel_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Sep-10 02:42 UTC
Re: [Xen-devel] new netfront and occasional receive path lockup
On 09/10/2010 12:37 PM, Xu, Dongxiao wrote:>> However, I am concerned about these manipulations of a cross-cpu >> shared variable without any barriers or other ordering constraints. >> Are you sure this code is correct under any reordering (either by the >> compiler or CPUs); and if the compiler decides to access it more or >> less often than the source says it should? > Do you mean the flag "np->rx.sring->private.netif.smartpoll_active"? > It is a flag in shared ring structure, Therefore operations towards > this flag are the same as other component in shared ring, such as > under spinlock, etc.Spinlocks are no use for inter-domain synchronization, only within a domain. The other ring operations are carefully ordered with appropriate memory barriers in specific places; that''s why I''m a bit concerned about their absence for the smartpoll_active flag. Even if they are not necessary, I''d like to see an analysis as to why. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerald Turner
2010-Sep-12 01:00 UTC
[Xen-devel] Re: new netfront and occasional receive path lockup
"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:> Hi Jeremy and Pasi, > > I was frustrated that I couldn''t reproduce this bug in my site. > > However I investigated the code, indeed there is one race condition > that probably cause the bug. See the attached patch. > > Could anybody who can see this bug help to try it? Appreciate much! >Hello, I experienced this problem with netfront and the smartpoll code causing their bridge interfaces to fail. I''ve been building a Xen server using Debian Squeeze, Xen 4.0.1-rc6. For weeks the server had been running solid with just three domU''s. In the last few days I significantly increased the number of domU''s (13 total) and have been having terrible packet drop problems. Randomly, maybe after 10 to 60 minutes of uptime, a domU or two will fall victim to bridge failure. There''s no syslog/dmesg output. The only report of the problem can by seen through network stats on dom0 (the domU vifX.X interfaces have huge TX drops), and ''brctl showmacs'' output is missing the MAC addresses for the domU''s that have failed. I''m not doing anything interesting with networking. eth0/peth0 on dom0 with static IP, vifX.0 on domU, no DHCP, no firewall rules (other than fail2ban), static IP assigned within in each domU. I''m using PV and all Debian -xen-amd64 flavor kernel in dom0 and domU (no interest in HVM). I''ve tried dozens of attempts to solve this: * Screwed with ethtool -K XXX tx off on dom0, domU, physical interface. * Removed ''network-bridge'' setup from xend and setup ''br0'' the Debian Way. * Commented out ''iptables_setup'' from ''vif-bridge'' script which was producing lots of iptables noise. * Use ''mac='' in domU vif config. * Tried latest vanilla 2.6.35.5 kernel (netfront driver is pre-smartpoll) - I didn''t give this kernel enough time to break, I saw TX drops on boot and assumed the problem was still there, but my judgement was incorrect - all domU''s get a few TX drops while the kernel boots (probably ARPs while vifX.X is up but before the domU ifup''s it''s eth0 on boot). Friday morning a fellow named ''Nrg_'' on ##xen immediately diagnosed this as possibly being related to the smartpoll bug in the netfront driver. I examined the Debian linux-image-2.6.32-5-xen-amd64 package and confirmed the netfront driver is patched with an earlier version the smartpoll code. I manually merged Debian''s kernel with Jeremy''s updates to the netfront driver in his git repository. $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606 Deployed this new image on all domU''s (except for two of them, as a control group) and updated grub kernel parameter with xen_netfront.use_smartpoll=0. Problem solved. Only the two domU''s I left unpatched get victimized. The rest of the hosts have been up for over a day and have not lost any packets. P.S. this is my first NNTP post thru gmane, I have no idea if it will reach the list, keep Message-Id/References intact, and CC Christophe, Jeremy, Dongxiao et al.> Jeremy Fitzhardinge wrote: >> On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote: >>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>>> Hi Christophe, >>>> >>>> Thanks for finding and checking the problem. >>>> I will try to reproduce the issue and check what caused the >>>> problem. >>>> >>> Hello, >>> >>> Was this issue resolved? Some users have been complaining "network >>> freezing up" issues recently on ##xen on irc.. >> >> Yeah, I''ll add a command-line parameter to disable smartpoll (and >> leave it off by default). >> >> J >> >>> -- Pasi >>> >>>> Thanks, >>>> Dongxiao >>>> >>>> Jeremy Fitzhardinge wrote: >>>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>>> Hi, >>>>>> >>>>>> I''ve been playing with some of the new pvops code, namely DomU >>>>>> guest code. What I''ve been observing on one of the virtual >>>>>> machines is that the network (vif) is dying after about ten to >>>>>> sixty minutes of uptime. The unfortunate thing here is that I can >>>>>> only repoduce it on a production VM and have been unlucky so far >>>>>> to trigger the bug on a test machine. While this has not been >>>>>> tragic - rebooting fixed the issue, unfortunately I can''t spend >>>>>> very much time on debugging after the issue pops up. >>>>> Ah, OK. I''ve seen this a couple of times as well. And it just >>>>> happened to me then... >>>>> >>>>> >>>>>> Now, what is happening is that the receive path goes dead. The >>>>>> DomU can send packets to Dom0 and those are visible using tcpdump >>>>>> on the Dom0 on the virtual interface, but not the other way >>>>>> around. >>>>> I hadn''t got to that level of diagnosis, but I can confirm that >>>>> that''s what seems to be happening here too. >>>>> >>>>>> Now, I have done more than one change at a time (I''d like to >>>>>> avoid going into pinning it down since I can only reproduce it on >>>>>> a production machine, as I said, so suggestions are welcome), but >>>>>> my suspicion is that it might have to do with the new "smart >>>>>> polling" feature in xen/netfront. Note that I have also updated >>>>>> Dom0 to pull in the latest dom0/backend and netback changes, just >>>>>> to make sure it''s not due to an issue that has been fixed there, >>>>>> but I''m still seeing the same. >>>>> I agree. I think I started seeing this once I merged smartpoll >>>>> into netfront. >>>>> >>>>> J >>>>> >>>>>> The production machine is a machine that doesn''t have much >>>>>> network load, but deals with a lot of small network requests (DNS >>>>>> and smtp mostly). A workload which is hard to reproduce on the >>>>>> test machine. Heavy network load (NFS, FTP and so on) for days >>>>>> hasn''t triggered the problem. Also, segmentation offloading and >>>>>> similar settings don''t have any effect. >>>>>> >>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>>> PREEMPT enabled. >>>>>> >>>>>> I''ve been looking at the code, if there might be a race condition >>>>>> somewhere, something like where one could run into a situation >>>>>> where the hrtimer doesn''t run and Dom0 believes the DomU should >>>>>> be polling and doesn''t emit an interrupt or something, but I''m >>>>>> afraid I don''t know enough to judge this (I mean, there are >>>>>> spinlocks which look safe to me). >>>>>> >>>>>> Do you have any suggestions what to try? I can trigger the issue >>>>>> on the production VM again, but debugging should not take more >>>>>> than a few minutes if it happens. Access is only possible via >>>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>>> the kernel message and continue to behave normally after the >>>>>> network goes dead (also able to shut down the guest normally). >>>>>>-- Gerald Turner Email: gturner@unzane.com JID: gturner@jabber.unzane.com GPG: 0xFA8CD6D5 21D9 B2E8 7FE7 F19E 5F7D 4D0C 3FA0 810F FA8C D6D5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Sep-12 08:55 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On 09/12/2010 11:00 AM, Gerald Turner wrote:> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and > confirmed the netfront driver is patched with an earlier version the > smartpoll code. > > I manually merged Debian''s kernel with Jeremy''s updates to the netfront > driver in his git repository. > > $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606 > > Deployed this new image on all domU''s (except for two of them, as a > control group) and updated grub kernel parameter with > xen_netfront.use_smartpoll=0.That''s good to hear. But I also included a fix from Dongxiao which, if correct, means it should work with use_smartpoll=1 (or nothing, as that''s the default). Could you verify whether the fix in cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not?> Problem solved. Only the two domU''s I left unpatched get victimized. > The rest of the hosts have been up for over a day and have not lost any > packets. > > P.S. this is my first NNTP post thru gmane, I have no idea if it will > reach the list, keep Message-Id/References intact, and CC Christophe, > Jeremy, Dongxiao et al.There were no cc:s. Thanks, J> >> Jeremy Fitzhardinge wrote: >>> On 09/10/2010 04:50 AM, Pasi Kärkkäinen wrote: >>>> On Wed, Aug 25, 2010 at 08:51:09AM +0800, Xu, Dongxiao wrote: >>>>> Hi Christophe, >>>>> >>>>> Thanks for finding and checking the problem. >>>>> I will try to reproduce the issue and check what caused the >>>>> problem. >>>>> >>>> Hello, >>>> >>>> Was this issue resolved? Some users have been complaining "network >>>> freezing up" issues recently on ##xen on irc.. >>> Yeah, I''ll add a command-line parameter to disable smartpoll (and >>> leave it off by default). >>> >>> J >>> >>>> -- Pasi >>>> >>>>> Thanks, >>>>> Dongxiao >>>>> >>>>> Jeremy Fitzhardinge wrote: >>>>>> On 08/22/2010 09:43 AM, Christophe Saout wrote: >>>>>>> Hi, >>>>>>> >>>>>>> I''ve been playing with some of the new pvops code, namely DomU >>>>>>> guest code. What I''ve been observing on one of the virtual >>>>>>> machines is that the network (vif) is dying after about ten to >>>>>>> sixty minutes of uptime. The unfortunate thing here is that I can >>>>>>> only repoduce it on a production VM and have been unlucky so far >>>>>>> to trigger the bug on a test machine. While this has not been >>>>>>> tragic - rebooting fixed the issue, unfortunately I can''t spend >>>>>>> very much time on debugging after the issue pops up. >>>>>> Ah, OK. I''ve seen this a couple of times as well. And it just >>>>>> happened to me then... >>>>>> >>>>>> >>>>>>> Now, what is happening is that the receive path goes dead. The >>>>>>> DomU can send packets to Dom0 and those are visible using tcpdump >>>>>>> on the Dom0 on the virtual interface, but not the other way >>>>>>> around. >>>>>> I hadn''t got to that level of diagnosis, but I can confirm that >>>>>> that''s what seems to be happening here too. >>>>>> >>>>>>> Now, I have done more than one change at a time (I''d like to >>>>>>> avoid going into pinning it down since I can only reproduce it on >>>>>>> a production machine, as I said, so suggestions are welcome), but >>>>>>> my suspicion is that it might have to do with the new "smart >>>>>>> polling" feature in xen/netfront. Note that I have also updated >>>>>>> Dom0 to pull in the latest dom0/backend and netback changes, just >>>>>>> to make sure it''s not due to an issue that has been fixed there, >>>>>>> but I''m still seeing the same. >>>>>> I agree. I think I started seeing this once I merged smartpoll >>>>>> into netfront. >>>>>> >>>>>> J >>>>>> >>>>>>> The production machine is a machine that doesn''t have much >>>>>>> network load, but deals with a lot of small network requests (DNS >>>>>>> and smtp mostly). A workload which is hard to reproduce on the >>>>>>> test machine. Heavy network load (NFS, FTP and so on) for days >>>>>>> hasn''t triggered the problem. Also, segmentation offloading and >>>>>>> similar settings don''t have any effect. >>>>>>> >>>>>>> The machine has 2 physical and the VM 2 virtual CPUs, DomU has >>>>>>> PREEMPT enabled. >>>>>>> >>>>>>> I''ve been looking at the code, if there might be a race condition >>>>>>> somewhere, something like where one could run into a situation >>>>>>> where the hrtimer doesn''t run and Dom0 believes the DomU should >>>>>>> be polling and doesn''t emit an interrupt or something, but I''m >>>>>>> afraid I don''t know enough to judge this (I mean, there are >>>>>>> spinlocks which look safe to me). >>>>>>> >>>>>>> Do you have any suggestions what to try? I can trigger the issue >>>>>>> on the production VM again, but debugging should not take more >>>>>>> than a few minutes if it happens. Access is only possible via >>>>>>> the console. Neither Dom0 nor the guest show anything unusual in >>>>>>> the kernel message and continue to behave normally after the >>>>>>> network goes dead (also able to shut down the guest normally). >>>>>>>_______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2010-Sep-12 17:23 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On Sun, Sep 12, 2010 at 06:55:48PM +1000, Jeremy Fitzhardinge wrote:> On 09/12/2010 11:00 AM, Gerald Turner wrote: > > I examined the Debian linux-image-2.6.32-5-xen-amd64 package and > > confirmed the netfront driver is patched with an earlier version the > > smartpoll code. > > > > I manually merged Debian''s kernel with Jeremy''s updates to the netfront > > driver in his git repository. > > > > $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606 > > > > Deployed this new image on all domU''s (except for two of them, as a > > control group) and updated grub kernel parameter with > > xen_netfront.use_smartpoll=0. > > That''s good to hear. But I also included a fix from Dongxiao which, if > correct, means it should work with use_smartpoll=1 (or nothing, as > that''s the default). Could you verify whether the fix in > cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not? >It''d be good to get the fix(es) to xen/stable-2.6.32.x aswell.. Or can you use "use_smartpoll=0" in current xen/stable-2.6.32.x branch? -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerald Turner
2010-Sep-12 22:40 UTC
[Xen-devel] Re: new netfront and occasional receive path lockup
Jeremy Fitzhardinge <jeremy@goop.org> writes:> On 09/12/2010 11:00 AM, Gerald Turner wrote: >> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and >> confirmed the netfront driver is patched with an earlier version the >> smartpoll code. >> >> I manually merged Debian''s kernel with Jeremy''s updates to the >> netfront driver in his git repository. >> >> $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606 >> >> Deployed this new image on all domU''s (except for two of them, as a >> control group) and updated grub kernel parameter with >> xen_netfront.use_smartpoll=0. > > That''s good to hear. But I also included a fix from Dongxiao which, > if correct, means it should work with use_smartpoll=1 (or nothing, as > that''s the default). Could you verify whether the fix in > cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not? >I''ve been running with use_smartpoll=1 for a few hours this afternoon, looks like Dongxiao''s bugfix works. -- Gerald Turner Email: gturner@unzane.com JID: gturner@jabber.unzane.com GPG: 0xFA8CD6D5 21D9 B2E8 7FE7 F19E 5F7D 4D0C 3FA0 810F FA8C D6D5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerald Turner
2010-Sep-13 00:03 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
Gerald Turner <gturner@unzane.com> writes:> Jeremy Fitzhardinge <jeremy@goop.org> writes: > >> On 09/12/2010 11:00 AM, Gerald Turner wrote: >>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and >>> confirmed the netfront driver is patched with an earlier version the >>> smartpoll code. >>> >>> I manually merged Debian''s kernel with Jeremy''s updates to the >>> netfront driver in his git repository. >>> >>> $ git diff 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c8475f0c00e0606 >>> >>> Deployed this new image on all domU''s (except for two of them, as a >>> control group) and updated grub kernel parameter with >>> xen_netfront.use_smartpoll=0. >> >> That''s good to hear. But I also included a fix from Dongxiao which, >> if correct, means it should work with use_smartpoll=1 (or nothing, as >> that''s the default). Could you verify whether the fix in >> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not? >> > > I''ve been running with use_smartpoll=1 for a few hours this afternoon, > looks like Dongxiao''s bugfix works. >I spoke too soon! use_smartpoll set to 1 and still exhibits the problem, a few domU''s lost network after about 60 minutes of uptime. Sorry for the bad news... -- Gerald Turner Email: gturner@unzane.com JID: gturner@jabber.unzane.com GPG: 0xFA8CD6D5 21D9 B2E8 7FE7 F19E 5F7D 4D0C 3FA0 810F FA8C D6D5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Sep-13 00:54 UTC
RE: [Xen-devel] Re: new netfront and occasional receive path lockup
Gerald Turner wrote:> Gerald Turner <gturner@unzane.com> writes: > >> Jeremy Fitzhardinge <jeremy@goop.org> writes: >> >>> On 09/12/2010 11:00 AM, Gerald Turner wrote: >>>> I examined the Debian linux-image-2.6.32-5-xen-amd64 package and >>>> confirmed the netfront driver is patched with an earlier version >>>> the smartpoll code. >>>> >>>> I manually merged Debian''s kernel with Jeremy''s updates to the >>>> netfront driver in his git repository. >>>> >>>> $ git diff >>>> 5473680bdedb7a62e641970119e6e9381a8d80f4..3b966565a89659f938a4fd662c >>>> 8475f0c00e0606 >>>> >>>> Deployed this new image on all domU''s (except for two of them, as a >>>> control group) and updated grub kernel parameter with >>>> xen_netfront.use_smartpoll=0. >>> >>> That''s good to hear. But I also included a fix from Dongxiao which, >>> if correct, means it should work with use_smartpoll=1 (or nothing, >>> as that''s the default). Could you verify whether the fix in >>> cb09635065163a933d0d00d077ddd9f0c0a908a1 does actually work or not? >>> >> >> I''ve been running with use_smartpoll=1 for a few hours this >> afternoon, looks like Dongxiao''s bugfix works. >> > > I spoke too soon! use_smartpoll set to 1 and still exhibits the > problem, a few domU''s lost network after about 60 minutes of uptime. > Sorry for the bad news...Hi Gerald, Sorry for the inconvinience. I will continue to look into it. Does this bug only happen when you launch multiple domUs? I tried a single domU and could not catch the bug. Thanks, Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerald Turner
2010-Sep-13 02:12 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:> Does this bug only happen when you launch multiple domUs? I tried a > single domU and could not catch the bug. >I''ve been working on this server for about two weeks, I hadn''t noticed the problem for the first week when I only had 3 domUs. It started happening when I added 10 more domUs. The problem would happen quickly, within 10 minutes, always affecting at least two domUs at random, and affect more domUs over time. Saturday I installed the updated driver with Jeremy''s use_smartpoll parameter, ran for 24 hours with smartpoll disabled, no problems. Today I''ve been trying with smartpoll enabled. It took an hour to affect two domUs - noticibly longer than the behavior previous days before installing your patch. I still have 9 other domUs running with smartpoll enabled, four hours uptime, I''m surprised they haven''t been affected yet. Could there be another less-frequent race in smart_poll_function? -- Gerald Turner Email: gturner@unzane.com JID: gturner@jabber.unzane.com GPG: 0xFA8CD6D5 21D9 B2E8 7FE7 F19E 5F7D 4D0C 3FA0 810F FA8C D6D5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Sep-13 02:34 UTC
RE: [Xen-devel] Re: new netfront and occasional receive path lockup
Gerald Turner wrote:> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes: > >> Does this bug only happen when you launch multiple domUs? I tried a >> single domU and could not catch the bug. >> > > I''ve been working on this server for about two weeks, I hadn''t > noticed the problem for the first week when I only had 3 domUs. It > started happening when I added 10 more domUs. The problem would > happen quickly, within 10 minutes, always affecting at least two > domUs at random, and affect more domUs over time. > > Saturday I installed the updated driver with Jeremy''s use_smartpoll > parameter, ran for 24 hours with smartpoll disabled, no problems. > > Today I''ve been trying with smartpoll enabled. It took an hour to > affect two domUs - noticibly longer than the behavior previous days > before installing your patch. I still have 9 other domUs running > with smartpoll enabled, four hours uptime, I''m surprised they haven''t > been affected yet. Could there be another less-frequent race in > smart_poll_function?Hi Gerald, Thanks for your detail information. Unfortunately I don''t have such platform that could launch more than 10 guests in hand. Here is another patch (see attached file) that fix another potential race. Do you have bandwidth to have a try? Thanks in advance! Best Regards, -- Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerald Turner
2010-Sep-13 04:38 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
"Xu, Dongxiao" <dongxiao.xu@intel.com> writes:> Thanks for your detail information. > > Unfortunately I don''t have such platform that could launch more than > 10 guests in hand. > > Here is another patch (see attached file) that fix another potential > race. > > Do you have bandwidth to have a try? Thanks in advance! >I built a kernel with your additional patch. I have it running on all 13 domU''s with use_smartpoll=1. I''ll report tomorrow morning whether there were any lockups. FYI, total today I had 6 lockups with use_smartpoll=1 and the previous patch. -- Gerald Turner Email: gturner@unzane.com JID: gturner@jabber.unzane.com GPG: 0xFA8CD6D5 21D9 B2E8 7FE7 F19E 5F7D 4D0C 3FA0 810F FA8C D6D5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Gerald Turner
2010-Sep-13 16:01 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
Gerald Turner <gturner@unzane.com> writes:> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes: > >> Thanks for your detail information. >> >> Unfortunately I don''t have such platform that could launch more than >> 10 guests in hand. >> >> Here is another patch (see attached file) that fix another potential >> race. >> >> Do you have bandwidth to have a try? Thanks in advance! >> > > I built a kernel with your additional patch. > > I have it running on all 13 domU''s with use_smartpoll=1. > > I''ll report tomorrow morning whether there were any lockups. > > FYI, total today I had 6 lockups with use_smartpoll=1 and the previous > patch. >Sorry bad news again... Had 5 lockups within 4 hours. Then I restarted all domUs with use_smartpoll=0 and haven''t had any lockups in 7 hours. -- Gerald Turner Email: gturner@unzane.com JID: gturner@jabber.unzane.com GPG: 0xFA8CD6D5 21D9 B2E8 7FE7 F19E 5F7D 4D0C 3FA0 810F FA8C D6D5 _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2010-Sep-13 16:08 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote:> Gerald Turner <gturner@unzane.com> writes: > > > "Xu, Dongxiao" <dongxiao.xu@intel.com> writes: > > > >> Thanks for your detail information. > >> > >> Unfortunately I don''t have such platform that could launch more than > >> 10 guests in hand. > >> > >> Here is another patch (see attached file) that fix another potential > >> race. > >> > >> Do you have bandwidth to have a try? Thanks in advance! > >> > > > > I built a kernel with your additional patch. > > > > I have it running on all 13 domU''s with use_smartpoll=1. > > > > I''ll report tomorrow morning whether there were any lockups. > > > > FYI, total today I had 6 lockups with use_smartpoll=1 and the previous > > patch. > > > > Sorry bad news again... > > Had 5 lockups within 4 hours. > > Then I restarted all domUs with use_smartpoll=0 and haven''t had any > lockups in 7 hours. >I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being until this is sorted out.. -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Sep-13 19:36 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote:> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote: >> Gerald Turner <gturner@unzane.com> writes: >> >>> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes: >>> >>>> Thanks for your detail information. >>>> >>>> Unfortunately I don''t have such platform that could launch more than >>>> 10 guests in hand. >>>> >>>> Here is another patch (see attached file) that fix another potential >>>> race. >>>> >>>> Do you have bandwidth to have a try? Thanks in advance! >>>> >>> I built a kernel with your additional patch. >>> >>> I have it running on all 13 domU''s with use_smartpoll=1. >>> >>> I''ll report tomorrow morning whether there were any lockups. >>> >>> FYI, total today I had 6 lockups with use_smartpoll=1 and the previous >>> patch. >>> >> Sorry bad news again... >> >> Had 5 lockups within 4 hours. >> >> Then I restarted all domUs with use_smartpoll=0 and haven''t had any >> lockups in 7 hours. >> > I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being > until this is sorted out..Agreed. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Xu, Dongxiao
2010-Sep-14 00:26 UTC
RE: [Xen-devel] Re: new netfront and occasional receive path lockup
Gerald Turner wrote:> Gerald Turner <gturner@unzane.com> writes: > >> "Xu, Dongxiao" <dongxiao.xu@intel.com> writes: >> >>> Thanks for your detail information. >>> >>> Unfortunately I don''t have such platform that could launch more >>> than 10 guests in hand. >>> >>> Here is another patch (see attached file) that fix another >>> potential race. >>> >>> Do you have bandwidth to have a try? Thanks in advance! >>> >> >> I built a kernel with your additional patch. >> >> I have it running on all 13 domU''s with use_smartpoll=1. >> >> I''ll report tomorrow morning whether there were any lockups. >> >> FYI, total today I had 6 lockups with use_smartpoll=1 and the >> previous patch. >> > > Sorry bad news again... > > Had 5 lockups within 4 hours. > > Then I restarted all domUs with use_smartpoll=0 and haven''t had any > lockups in 7 hours.Thanks Gerald. I will try to find a local environment to do more investigation. Best Regards, -- Dongxiao _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Sep-14 08:25 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote:> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote: > > On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote: > >> Then I restarted all domUs with use_smartpoll=0 and haven''t had any > >> lockups in 7 hours. > >> > > I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being > > until this is sorted out.. > > Agreed.Should we also consider adding a netback option to disable it for the system as a whole as well? Or are the issues strictly in-guest only? Perhaps netback should support a xenstore key to allow a toolstack to configure this property per guest? Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Jeremy Fitzhardinge
2010-Sep-14 17:54 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On 09/14/2010 01:25 AM, Ian Campbell wrote:> On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote: >> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote: >>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote: >>>> Then I restarted all domUs with use_smartpoll=0 and haven''t had any >>>> lockups in 7 hours. >>>> >>> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being >>> until this is sorted out.. >> Agreed. > Should we also consider adding a netback option to disable it for the > system as a whole as well? Or are the issues strictly in-guest only? > > Perhaps netback should support a xenstore key to allow a toolstack to > configure this property per guest?It depends on what the problem is. If there''s a basic problem with the smartpoll front<->back communication protocol then we''ll probably have to revert the whole thing and start over. If the bug is just something in the frontend then we can disable it there until resolved. Fortunately I haven''t pushed netfront smartpoll support upstream yet, so the userbase is still fairly limited. I hope. J _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Pasi Kärkkäinen
2010-Sep-14 18:44 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On Tue, Sep 14, 2010 at 10:54:27AM -0700, Jeremy Fitzhardinge wrote:> On 09/14/2010 01:25 AM, Ian Campbell wrote: > > On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote: > >> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote: > >>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote: > >>>> Then I restarted all domUs with use_smartpoll=0 and haven''t had any > >>>> lockups in 7 hours. > >>>> > >>> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being > >>> until this is sorted out.. > >> Agreed. > > Should we also consider adding a netback option to disable it for the > > system as a whole as well? Or are the issues strictly in-guest only? > > > > Perhaps netback should support a xenstore key to allow a toolstack to > > configure this property per guest? > > It depends on what the problem is. If there''s a basic problem with the > smartpoll front<->back communication protocol then we''ll probably have > to revert the whole thing and start over. If the bug is just something > in the frontend then we can disable it there until resolved. > > Fortunately I haven''t pushed netfront smartpoll support upstream yet, so > the userbase is still fairly limited. I hope. >There has been quite a few people on ##xen on irc complaining about it.. I think the smartpoll code has ended up in Debian Squeeze 2.6.32-5-xen kernel.. Hopefully they''ll pull the "Revert "xen/netfront: default smartpoll to on"" soon.. -- Pasi _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Campbell
2010-Sep-15 09:46 UTC
Re: [Xen-devel] Re: new netfront and occasional receive path lockup
On Tue, 2010-09-14 at 19:44 +0100, Pasi Kärkkäinen wrote:> On Tue, Sep 14, 2010 at 10:54:27AM -0700, Jeremy Fitzhardinge wrote: > > On 09/14/2010 01:25 AM, Ian Campbell wrote: > > > On Mon, 2010-09-13 at 20:36 +0100, Jeremy Fitzhardinge wrote: > > >> On 09/13/2010 09:08 AM, Pasi Kärkkäinen wrote: > > >>> On Mon, Sep 13, 2010 at 09:01:57AM -0700, Gerald Turner wrote: > > >>>> Then I restarted all domUs with use_smartpoll=0 and haven''t had any > > >>>> lockups in 7 hours. > > >>>> > > >>> I think we should default xen/stable-2.6.32.x to use_smartpoll=0 for the time being > > >>> until this is sorted out.. > > >> Agreed. > > > Should we also consider adding a netback option to disable it for the > > > system as a whole as well? Or are the issues strictly in-guest only? > > > > > > Perhaps netback should support a xenstore key to allow a toolstack to > > > configure this property per guest? > > > > It depends on what the problem is. If there''s a basic problem with the > > smartpoll front<->back communication protocol then we''ll probably have > > to revert the whole thing and start over. If the bug is just something > > in the frontend then we can disable it there until resolved. > > > > Fortunately I haven''t pushed netfront smartpoll support upstream yet, so > > the userbase is still fairly limited. I hope. > > > > There has been quite a few people on ##xen on irc complaining about it.. > > I think the smartpoll code has ended up in Debian Squeeze 2.6.32-5-xen kernel.. > Hopefully they''ll pull the "Revert "xen/netfront: default smartpoll to on"" soon..I''ve suggested it on debian-kernel. Ian. _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel