Teck Choon Giam
2009-Jul-04 06:32 UTC
[Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
Hi, I have experienced network stall when running in xen 3.4.x on all DELL PE850/860/R200 which are using onboard broadcom driver tg3 driver. I have done some testing on both xen-3.3.2-rc3 and xen-3.4.1-rc5 with linux-2.6.18-xen.hg changeset 913 which in domU doing scp transfer of couple 1MB/10MB/100MB files to another server in few instances concurrently. Within an hour the network will stall in xen-3.4.1-rc5 but not in xen-3.3.2-rc3. ifconfig, route -n and ip link show normal but unable to ping gateway. Sometimes, doing the following (in crontab using custom script to check ping gateway and if 100% packet lost will execute the below can bring back the network but not always and needed a reboot): 1. xm shutdown all domUs 2. service xendomains stop 3. stop network-bridge 4. service xend stop 5. service xend start 6. xm create all domUs However the above might cause some domU ext3 file system dirty and e2fsck is required. I have done many tests (at least more than 5 times on 3 DELL PE850/860 servers) and the results are the same. With xen-3.3.2-rc3 no issue and network will not be down/stalled doing the scp transfer test to other server. Whereby with xen-3.4.1-rc5, it will happen within an hour if such test are carried out at least 5 instances running concurrently. In fact from xen-3.4.0 to xen-3.4.1-rc1 to rc5 are the same. /var/log/messages will show the following when network stall: tg3: peth0: transmit timed out, resetting I have tried: /sbin/ethtool -K eth0 tx off /sbin/ethtool -K eth0 rx off /sbin/ethtool -K eth0 gso off /sbin/ethtool -K eth0 tso off Is there any netfront/netback changes between xen-3.3.x and xen-3.4.x which cause such issue? Anybody experience such network stall in your tg3 in bridge network environment? The above test also carried out in non tg3 servers such as with e100/e1000 drivers do not cause such network stall problem. All servers are running CentOS 5.3 with linux-2.6.18.8-xen for all dom0s and domUs. Any idea? Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jul-04 06:41 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
On 04/07/2009 07:32, "Teck Choon Giam" <giamteckchoon@gmail.com> wrote:> Is there any netfront/netback changes between xen-3.3.x and xen-3.4.x > which cause such issue? Anybody experience such network stall in your > tg3 in bridge network environment?You are running exactly the same kernels on both 3.3 and 3.4? Then there can be no netfront/back differences as you are running exactly the same drivers. Certain things are on by default in 3.4 which were not in 3.3. One such thing is MSI. You might try disabling CONFIG_PCI_MSI in your dom0 kernel config. It''ll be listed as "Message Signaled Interrupts (MSI and MSI-X)" in the PCI section of the Linux config menus. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2009-Jul-04 06:59 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
Hi Keir, On Sat, Jul 4, 2009 at 2:41 PM, Keir Fraser<keir.fraser@eu.citrix.com> wrote:> On 04/07/2009 07:32, "Teck Choon Giam" <giamteckchoon@gmail.com> wrote: > >> Is there any netfront/netback changes between xen-3.3.x and xen-3.4.x >> which cause such issue? Anybody experience such network stall in your >> tg3 in bridge network environment? > > You are running exactly the same kernels on both 3.3 and 3.4? Then there can > be no netfront/back differences as you are running exactly the same drivers.Yes, same kernel version for both since compiling both xen versions using the same linux-2.6.18-xen.hg.> > Certain things are on by default in 3.4 which were not in 3.3. One such > thing is MSI. You might try disabling CONFIG_PCI_MSI in your dom0 kernel > config. It''ll be listed as "Message Signaled Interrupts (MSI and MSI-X)" in > the PCI section of the Linux config menus.I checked on both configs and both are showing me: # CONFIG_PCI_MSI is not set I also done a diff for both .config and they are the same. Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jul-04 07:09 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
On 04/07/2009 07:59, "Teck Choon Giam" <giamteckchoon@gmail.com> wrote:>> Certain things are on by default in 3.4 which were not in 3.3. One such >> thing is MSI. You might try disabling CONFIG_PCI_MSI in your dom0 kernel >> config. It''ll be listed as "Message Signaled Interrupts (MSI and MSI-X)" in >> the PCI section of the Linux config menus. > > I checked on both configs and both are showing me: > # CONFIG_PCI_MSI is not set > > I also done a diff for both .config and they are the same.Power management is another difference between 3.3 and 3.4. You can disable 3.4 power management by adding Xen boot parameters: cpuidle=0 cpufreq=none -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2009-Jul-04 07:30 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
Hi Keir,> Power management is another difference between 3.3 and 3.4. You can disable > 3.4 power management by adding Xen boot parameters: cpuidle=0 cpufreq=noneIn xen-3.3.x I got: # xenpm get-cpuidle-states Xen cpuidle is not enabled! Xen cpufreq is not enabled! In xen-3.4.x I got: # xenpm get-cpuidle-states Max C-state: C1 cpu id : 0 total C-states : 2 idle time(ms) : 131588676 C0 : transition [00000000000019346170] residency [00000000000003897999 ms] C1 : transition [00000000000019346170] residency [00000000000131507268 ms] cpu id : 1 total C-states : 2 idle time(ms) : 131696919 C0 : transition [00000000000012247741] residency [00000000000003766854 ms] C1 : transition [00000000000012247741] residency [00000000000131638414 ms] cpu id : 2 total C-states : 2 idle time(ms) : 131540647 C0 : transition [00000000000013405442] residency [00000000000003922680 ms] C1 : transition [00000000000013405442] residency [00000000000131482588 ms] cpu id : 3 total C-states : 2 idle time(ms) : 131527968 C0 : transition [00000000000031194790] residency [00000000000004030618 ms] C1 : transition [00000000000031194790] residency [00000000000131374650 ms] I will disable and run the test tomorrow to see whether network stall issue is there or not. Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2009-Jul-05 02:56 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
On Sat, Jul 4, 2009 at 3:30 PM, Teck Choon Giam<giamteckchoon@gmail.com> wrote: Hi Keir,>> Power management is another difference between 3.3 and 3.4. You can disable >> 3.4 power management by adding Xen boot parameters: cpuidle=0 cpufreq=none> I will disable and run the test tomorrow to see whether network stall > issue is there or not.Using cpuidle=0 cpufreq=none seems to solve the network stall problem. Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Ian Pratt
2009-Jul-05 21:36 UTC
RE: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
> >> Power management is another difference between 3.3 and 3.4. You can > disable > >> 3.4 power management by adding Xen boot parameters: cpuidle=0 > cpufreq=none > > > I will disable and run the test tomorrow to see whether network stall > > issue is there or not. > > Using cpuidle=0 cpufreq=none seems to solve the network stall problem.Hmm, that''s rather disturbing. Its presumably the cpuidle parameter which is having the effect. Quite why deeper sleep states can result in one particular device interrupt getting stuck (as opposed to all of them) is a mystery. It might be interesting to see the boot messages, and also to find out which of the C states is causing the problem (presumably C2 or C3). In your tests, rather than rebooting the machine you may possibly be able to recover the machine by unloading and reloading the NIC module. (you may need to remove it from the bridge and ifconfig it down first). Ian _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2009-Jul-06 03:55 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
Hi Ian, On Mon, Jul 6, 2009 at 5:36 AM, Ian Pratt<Ian.Pratt@eu.citrix.com> wrote:>> >> Power management is another difference between 3.3 and 3.4. You can >> disable >> >> 3.4 power management by adding Xen boot parameters: cpuidle=0 >> cpufreq=none >> >> > I will disable and run the test tomorrow to see whether network stall >> > issue is there or not. >> >> Using cpuidle=0 cpufreq=none seems to solve the network stall problem. > > Hmm, that''s rather disturbing. Its presumably the cpuidle parameter which is having the effect. Quite why deeper sleep states can result in one particular device interrupt getting stuck (as opposed to all of them) is a mystery. It might be interesting to see the boot messages, and also to find out which of the C states is causing the problem (presumably C2 or C3).If I do not add cpuidle and cpufreq in xen boot para. I got the below: # xenpm get-cpuidle-states Max C-state: C1 cpu id : 0 total C-states : 2 idle time(ms) : 131588676 C0 : transition [00000000000019346170] residency [00000000000003897999 ms] C1 : transition [00000000000019346170] residency [00000000000131507268 ms] cpu id : 1 total C-states : 2 idle time(ms) : 131696919 C0 : transition [00000000000012247741] residency [00000000000003766854 ms] C1 : transition [00000000000012247741] residency [00000000000131638414 ms] cpu id : 2 total C-states : 2 idle time(ms) : 131540647 C0 : transition [00000000000013405442] residency [00000000000003922680 ms] C1 : transition [00000000000013405442] residency [00000000000131482588 ms] cpu id : 3 total C-states : 2 idle time(ms) : 131527968 C0 : transition [00000000000031194790] residency [00000000000004030618 ms] C1 : transition [00000000000031194790] residency [00000000000131374650 ms] Sorry, I am unable to give you more details as currently all are booted with cpuidle and cpufreq in xen boot para. I will try to migrate one of the server VMs to another then use that to test without cpuidle and cpufreq in xen boot para. then will report back my findings. In fact now all are with: kernel /xen.gz dom0_mem=256M loglvl=all guest_loglvl=all cpuidle=0 cpufreq=none If you have any suggestion to add in xen boot para. or any other, feel free to let me know ;)> In your tests, rather than rebooting the machine you may possibly be able to recover the machine by unloading and reloading the NIC module. (you may need to remove it from the bridge and ifconfig it down first).Yes, shutdown all xendomains, shutdown network-bridge and xend then restart them without the need to restart network can bring back the network most of the time but it is disturbing as all VMs will need to shutdown clearly to prevent ext3 file system dirty. I noticed for other servers that without the cpuidle=0 cpufreq=none in xen-3.4.x, xenpm get-cpuidle-states showing: # xenpm get-cpuidle-states Max C-state: C7 Is this due to the processor type since they are not dual core and/or quad core or multi-processors and whether is it a VT-d enabled system type? # cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 3 cpu MHz : 3000.112 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht nx constant_tsc pni cid bogomips : 6004.86 processor : 1 vendor_id : GenuineIntel cpu family : 15 model : 4 model name : Intel(R) Xeon(TM) CPU 3.00GHz stepping : 3 cpu MHz : 3000.112 cache size : 2048 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu de tsc msr pae mce cx8 apic mtrr mca cmov pat clflush acpi mmx fxsr sse sse2 ss ht nx constant_tsc pni cid bogomips : 6004.86 The above server is not DELL but is a Tyan server: # lspci -vvv|grep -i ethernet 01:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11) Subsystem: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express 02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11) Subsystem: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express Doing test on this server is ok with no network stall however this server will crash within a month time and when I plug in monitor/keyboard can''t see any output nor cltr+alt+delete got any response. The only thing I can do is to reboot the server then this cycle will repeat... sudden crash within a month and sometimes can happen 2 or more times within a month. So this server is running a backup domU and a mirror domU which are not so critical. Due to sudden crash issue on this type of server(s) (I got two such server having same issue), thus can''t really run this in real production :( Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Keir Fraser
2009-Jul-06 07:19 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
On 06/07/2009 04:55, "Teck Choon Giam" <giamteckchoon@gmail.com> wrote:>> Hmm, that''s rather disturbing. Its presumably the cpuidle parameter which is >> having the effect. Quite why deeper sleep states can result in one particular >> device interrupt getting stuck (as opposed to all of them) is a mystery. It >> might be interesting to see the boot messages, and also to find out which of >> the C states is causing the problem (presumably C2 or C3). > > If I do not add cpuidle and cpufreq in xen boot para. I got the below: > > # xenpm get-cpuidle-states > Max C-state: C1 > > cpu id : 0 > total C-states : 2That''s interesting, since it appears the troublesome system does not even support deep sleep states (e.g., C3). Just C0 and C1: which would normally mean C0=running, C1=normal-HLT. I''ve cc''ed a couple of Intel guys to confirm we couldn''t be misreading the xenpm output. If we''re reading this correctly I think it really means that the special acpi-cx idle handler has a bug in it somewhere. Actually one bug has been found already, and I will forward the patch to you. It could be worth applying it and rebuilding Xen and see if we''re lucky enough for that to solve your problem. -- Keir _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel
Teck Choon Giam
2009-Jul-07 01:29 UTC
Re: [Xen-devel] tg3 network stall in xen-3.4.x but not in xen-3.3.x
Hi keir,> That''s interesting, since it appears the troublesome system does not even > support deep sleep states (e.g., C3). Just C0 and C1: which would normally > mean C0=running, C1=normal-HLT. I''ve cc''ed a couple of Intel guys to confirm > we couldn''t be misreading the xenpm output.Oh ok.> If we''re reading this correctly I think it really means that the special > acpi-cx idle handler has a bug in it somewhere. Actually one bug has been > found already, and I will forward the patch to you. It could be worth > applying it and rebuilding Xen and see if we''re lucky enough for that to > solve your problem.I can test the patch then report back. Please send me the patch. Thanks. Kindest regards, Giam Teck Choon _______________________________________________ Xen-devel mailing list Xen-devel@lists.xensource.com http://lists.xensource.com/xen-devel