Kevin Stange
2017-Jan-27 01:57 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
On 01/26/2017 02:08 PM, Kevin Stange wrote:> On 01/26/2017 09:35 AM, Johnny Hughes wrote: >> On 01/26/2017 09:32 AM, Johnny Hughes wrote: >>> On 01/25/2017 11:49 AM, Kevin Stange wrote: >>>> On 01/24/2017 11:16 AM, Kevin Stange wrote: >>>>> On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote: >>>>>> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote: >>>>>>> Kevin Stange, >>>>>>> It can be either kernel or update the NIC driver or firmware of the NIC >>>>>>> card. Hope that helps! >>>>>>> >>>>>>> Xlord >>>>>>> -----Original Message----- >>>>>>> From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin >>>>>>> Stange >>>>>>> Sent: Tuesday, January 24, 2017 1:04 AM >>>>>>> To: centos-virt at centos.org >>>>>>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / >>>>>>> Linux 3.18 >>>>>>> >>>>> <snip> >>>>>>> >>>>>>> Has anyone experienced similar issues with this configuration, and if so, >>>>>>> does anyone have tips on how to resolve the issues? >>>>>> >>>>>> Honeslty I would email Intel and see if they can help. This looks like >>>>>> the NIC decides something is wrong, throws off an PCIe error and >>>>>> then resets itself. >>>>> >>>>> This happens for several different NICs. Is there a good contact at >>>>> Intel for this kind of thing, or should I just try to reach them through >>>>> their web site? >>>>> >>>>>> It could also be an error in the Linux stack which would "eat" an >>>>>> interrupt when migrating interrupts (which was fixed >>>>>> upstream, see below). Are you running irqbalance? Could you try >>>>>> turning it off? >>>>> >>>>> irqbalance is enabled on these servers. I'll try disabling it. >>>> >>>> I had stopped irqbalance yesterday afternoon, but had a hypervisor's >>>> NICs fail anyway in early morning this morning, so I'm pretty sure this >>>> is not the right tree to bark up. >>>> >>> >>> Here is a set of drivers/fireware from Intel for those NICs: >>> >>> https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver-for-PCI-E-Gigabit-Network-Connections-under-Linux- >>> >>> I will see if I can get a CentOS-6 build of the latest version of that >>> from our older SRPM: >>> >>> http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6.centos.alt.src.rpm >>> >>> I am currently very busy with several c5, c6, c7 updates and the i686 >>> altarch c7 tree .. but I have this on my list. In the meantime, maybe >>> someone else could also see if those drivers help you (or you could try >>> to compile / install it). >>> >>> Do you have another machine that you can use to see if you can duplicate >>> the issue NOT running the xen.gz hypervisor boot, but just the straight >>> kernel? > > I can't actually reproduce this problem reliably. It happens randomly > when the servers are up and running anywhere between a few hours and a > month or more, and I haven't been able to isolate any specific way to > cause it to happen. As a result I can't really test different solutions > on different servers to see what helps. I was hoping other people were > seeing it so that I could get some direction. If I can reproduce it, it > won't take me very long to identify what the cause is. Right now if I > do upgrade the drivers on the systems I won't really know if it's fixed > until I don't see another issue for several months. > >> Actually .. I think this is the driver for you: >> >> https://downloadcenter.intel.com/download/13663 >> >> And this explains how to make it work: >> >> http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000005767.html > > The different combinations of NICs overlap both the e1000e and igb > drivers, but the most egregious issues have been with the igb ones. > I'll try to give this a shot and report back if I still see issues with > a server after doing so, but it might be a week or two before I find out.So the NICs giving issues in most cases were igb drivers. I've tried replacing the drivers on some HVs with the version you suggested, but it doesn't seem to have helped with stability. Any other ideas? -- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin at steadfast.net | www.steadfast.net
Karel Hendrych
2017-Jan-27 12:08 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
Have you tried to eliminate all power management features all over? Are the devices connected to the same network infrastructure? There has to be something common. I've been using Intel NICs with Xen/CentOS for ages with no issues. Karel On 27.1.2017 02:57, Kevin Stange wrote:> On 01/26/2017 02:08 PM, Kevin Stange wrote: >> On 01/26/2017 09:35 AM, Johnny Hughes wrote: >>> On 01/26/2017 09:32 AM, Johnny Hughes wrote: >>>> On 01/25/2017 11:49 AM, Kevin Stange wrote: >>>>> On 01/24/2017 11:16 AM, Kevin Stange wrote: >>>>>> On 01/24/2017 09:10 AM, Konrad Rzeszutek Wilk wrote: >>>>>>> On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote: >>>>>>>> Kevin Stange, >>>>>>>> It can be either kernel or update the NIC driver or firmware of the NIC >>>>>>>> card. Hope that helps! >>>>>>>> >>>>>>>> Xlord >>>>>>>> -----Original Message----- >>>>>>>> From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin >>>>>>>> Stange >>>>>>>> Sent: Tuesday, January 24, 2017 1:04 AM >>>>>>>> To: centos-virt at centos.org >>>>>>>> Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / >>>>>>>> Linux 3.18 >>>>>>>> >>>>>> <snip> >>>>>>>> >>>>>>>> Has anyone experienced similar issues with this configuration, and if so, >>>>>>>> does anyone have tips on how to resolve the issues? >>>>>>> >>>>>>> Honeslty I would email Intel and see if they can help. This looks like >>>>>>> the NIC decides something is wrong, throws off an PCIe error and >>>>>>> then resets itself. >>>>>> >>>>>> This happens for several different NICs. Is there a good contact at >>>>>> Intel for this kind of thing, or should I just try to reach them through >>>>>> their web site? >>>>>> >>>>>>> It could also be an error in the Linux stack which would "eat" an >>>>>>> interrupt when migrating interrupts (which was fixed >>>>>>> upstream, see below). Are you running irqbalance? Could you try >>>>>>> turning it off? >>>>>> >>>>>> irqbalance is enabled on these servers. I'll try disabling it. >>>>> >>>>> I had stopped irqbalance yesterday afternoon, but had a hypervisor's >>>>> NICs fail anyway in early morning this morning, so I'm pretty sure this >>>>> is not the right tree to bark up. >>>>> >>>> >>>> Here is a set of drivers/fireware from Intel for those NICs: >>>> >>>> https://downloadcenter.intel.com/download/15817/Intel-Network-Adapter-Driver-for-PCI-E-Gigabit-Network-Connections-under-Linux- >>>> >>>> I will see if I can get a CentOS-6 build of the latest version of that >>>> from our older SRPM: >>>> >>>> http://vault.centos.org/6.7/xen4/Source/SPackages/e1000e-2.5.4-3.10.68.2.el6.centos.alt.src.rpm >>>> >>>> I am currently very busy with several c5, c6, c7 updates and the i686 >>>> altarch c7 tree .. but I have this on my list. In the meantime, maybe >>>> someone else could also see if those drivers help you (or you could try >>>> to compile / install it). >>>> >>>> Do you have another machine that you can use to see if you can duplicate >>>> the issue NOT running the xen.gz hypervisor boot, but just the straight >>>> kernel? >> >> I can't actually reproduce this problem reliably. It happens randomly >> when the servers are up and running anywhere between a few hours and a >> month or more, and I haven't been able to isolate any specific way to >> cause it to happen. As a result I can't really test different solutions >> on different servers to see what helps. I was hoping other people were >> seeing it so that I could get some direction. If I can reproduce it, it >> won't take me very long to identify what the cause is. Right now if I >> do upgrade the drivers on the systems I won't really know if it's fixed >> until I don't see another issue for several months. >> >>> Actually .. I think this is the driver for you: >>> >>> https://downloadcenter.intel.com/download/13663 >>> >>> And this explains how to make it work: >>> >>> http://www.intel.com/content/www/us/en/support/network-and-i-o/ethernet-products/000005767.html >> >> The different combinations of NICs overlap both the e1000e and igb >> drivers, but the most egregious issues have been with the igb ones. >> I'll try to give this a shot and report back if I still see issues with >> a server after doing so, but it might be a week or two before I find out. > > So the NICs giving issues in most cases were igb drivers. I've tried > replacing the drivers on some HVs with the version you suggested, but it > doesn't seem to have helped with stability. Any other ideas? >
Kevin Stange
2017-Jan-27 18:21 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
On 01/27/2017 06:08 AM, Karel Hendrych wrote:> Have you tried to eliminate all power management features all over?I've been trying to find and disable all power management features but having relatively little luck with that solving the problems. Stabbing the the dark I've tried different ACPI settings, including completely disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off on the kernel command line. Are there other kernel options that might be useful to try?> Are the devices connected to the same network infrastructure?There are two onboard NICs and two NICs on a dual-port card in each server. All devices connect to a cisco switch pair in VSS and the links are paired in LACP.> There has to be something common.The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI and NFS traffic, as well as some basic management stuff over SSH, and they are configured with an MTU of 9000 on the native VLAN. It's a lot of features, but I can't really turn them off and then actually have enough load on the NICs to reproduce the issue. Several of these servers were installed and being burned in for 3 months without ever having an issue, but suddenly collapsed when I tried to bring 20 or so real-world VMs up on them. The other NICs in the system that are connected don't exhibit issues and run only VM network interfaces. They are also in LACP and running VLAN tags, but normal 1500 MTU. So far it seems to correlate with NICs on the expansion cards, but it's a coincidence that these cards are the ones with the storage and management traffic. I'm trying to swap some of this load to the onboard NICs to see if the issues migrate over with it, or if they stay with the expansion cards. If the issue exists on both NIC types, then it rules out the specific NIC chipset as the culprit. It could point to the driver, but upgrading it to a newer version did not help and actually appeared to make everything worse. This issue might actually be more to do with the PCIe bridge than the NICs, but these are still different motherboards with different PCIe bridges (5520 vs C600) experiencing the same issues.> I've been using Intel NICs with Xen/CentOS for ages with no issues.I figured that must be so. Everyone uses Intel NICs. If this was a common issue, it would probably be causing a lot of people a lot of trouble. -- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin at steadfast.net | www.steadfast.net
Maybe Matching Threads
- NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
- NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
- NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
- NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
- NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18