thr3ads.net - CentOS virt - [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18 [Jan 2017]

If this information is useful, please help other people find it:
Share via:

Kevin Stange

2017-Jan-27 18:21 UTC

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 01/27/2017 06:08 AM, Karel Hendrych wrote:> Have you tried to eliminate all power management features all over?
I've been trying to find and disable all power management features but
having relatively little luck with that solving the problems. Stabbing
the the dark I've tried different ACPI settings, including completely
disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
on the kernel command line. Are there other kernel options that might
be useful to try?
> Are the devices connected to the same network infrastructure?
There are two onboard NICs and two NICs on a dual-port card in each
server. All devices connect to a cisco switch pair in VSS and the links
are paired in LACP.
> There has to be something common.
The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
and NFS traffic, as well as some basic management stuff over SSH, and
they are configured with an MTU of 9000 on the native VLAN. It's a lot
of features, but I can't really turn them off and then actually have
enough load on the NICs to reproduce the issue. Several of these
servers were installed and being burned in for 3 months without ever
having an issue, but suddenly collapsed when I tried to bring 20 or so
real-world VMs up on them.

The other NICs in the system that are connected don't exhibit issues and
run only VM network interfaces. They are also in LACP and running VLAN
tags, but normal 1500 MTU.

So far it seems to correlate with NICs on the expansion cards, but it's
a coincidence that these cards are the ones with the storage and
management traffic. I'm trying to swap some of this load to the onboard
NICs to see if the issues migrate over with it, or if they stay with the
expansion cards.

If the issue exists on both NIC types, then it rules out the specific
NIC chipset as the culprit. It could point to the driver, but upgrading
it to a newer version did not help and actually appeared to make
everything worse. This issue might actually be more to do with the PCIe
bridge than the NICs, but these are still different motherboards with
different PCIe bridges (5520 vs C600) experiencing the same issues.
> I've been using Intel NICs with Xen/CentOS for ages with no issues.
I figured that must be so. Everyone uses Intel NICs. If this was a
common issue, it would probably be causing a lot of people a lot of trouble.

--
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net

Jinesh Choksi

2017-Jan-30 09:18 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

>Are there other kernel options that might be useful to try?
pci=nomsi

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13



On 27 January 2017 at 18:21, Kevin Stange <kevin at steadfast.net> wrote:
> On 01/27/2017 06:08 AM, Karel Hendrych wrote:
> > Have you tried to eliminate all power management features all over?
>
> I've been trying to find and disable all power management features but
> having relatively little luck with that solving the problems.  Stabbing
> the the dark I've tried different ACPI settings, including completely
> disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
> on the kernel command line.  Are there other kernel options that might
> be useful to try?
>
> > Are the devices connected to the same network infrastructure?
>
> There are two onboard NICs and two NICs on a dual-port card in each
> server.  All devices connect to a cisco switch pair in VSS and the links
> are paired in LACP.
>
> > There has to be something common.
>
> The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
> and NFS traffic, as well as some basic management stuff over SSH, and
> they are configured with an MTU of 9000 on the native VLAN.  It's a lot
> of features, but I can't really turn them off and then actually have
> enough load on the NICs to reproduce the issue.  Several of these
> servers were installed and being burned in for 3 months without ever
> having an issue, but suddenly collapsed when I tried to bring 20 or so
> real-world VMs up on them.
>
> The other NICs in the system that are connected don't exhibit issues
and
> run only VM network interfaces.  They are also in LACP and running VLAN
> tags, but normal 1500 MTU.
>
> So far it seems to correlate with NICs on the expansion cards, but it's
> a coincidence that these cards are the ones with the storage and
> management traffic.  I'm trying to swap some of this load to the
onboard
> NICs to see if the issues migrate over with it, or if they stay with the
> expansion cards.
>
> If the issue exists on both NIC types, then it rules out the specific
> NIC chipset as the culprit.  It could point to the driver, but upgrading
> it to a newer version did not help and actually appeared to make
> everything worse.  This issue might actually be more to do with the PCIe
> bridge than the NICs, but these are still different motherboards with
> different PCIe bridges (5520 vs C600) experiencing the same issues.
>
> > I've been using Intel NICs with Xen/CentOS for ages with no
issues.
>
> I figured that must be so.  Everyone uses Intel NICs.  If this was a
> common issue, it would probably be causing a lot of people a lot of
> trouble.
>
> --
> Kevin Stange
> Chief Technology Officer
> Steadfast | Managed Infrastructure, Datacenter and Cloud Services
> 800 S Wells, Suite 190 | Chicago, IL 60607
> 312.602.2689 X203 | Fax: 312.602.2688
> kevin at steadfast.net | www.steadfast.net
> _______________________________________________
> CentOS-virt mailing list
> CentOS-virt at centos.org
> https://lists.centos.org/mailman/listinfo/centos-virt
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20170130/21e9d5f4/attachment-0002.html>

Kevin Stange

2017-Jan-30 18:59 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 01/30/2017 03:18 AM, Jinesh Choksi wrote:>>Are there other kernel options that might be useful to try?
> 
> pci=nomsi
> 
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1521173/comments/13
Incidentally, already found that one and I'm trying it currently on one
of the boxes.  So far there's been no issues, but it's only been since
Friday.

Also, I found this:

https://xen.crc.id.au/support/guides/install/

There's a 4.4 kernel here built for Xen Dom0, which I'm giving a whirl
to see how stable it is, also only since Friday.  I'm not using anything
else he's packaged from his repo.

On a related note, does the SIG have plans to replace the 3.18 kernel
which is marked as projected EOL of January 2017
(https://www.kernel.org/category/releases.html)?

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net

Adi Pircalabu

2017-Jan-30 22:17 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 28/01/17 05:21, Kevin Stange wrote:> On 01/27/2017 06:08 AM, Karel Hendrych wrote:
>> Have you tried to eliminate all power management features all over?
> 
> I've been trying to find and disable all power management features but
> having relatively little luck with that solving the problems.  Stabbing
> the the dark I've tried different ACPI settings, including completely
> disabling it, disabling CPU frequency scaling, and setting pcie_aspm=off
> on the kernel command line.  Are there other kernel options that might
> be useful to try?
May I chip in here? In our environment we're randomly seeing:

Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit Hang
Jan 17 23:40:14 xen01 kernel:  Tx Queue             <0>
Jan 17 23:40:14 xen01 kernel:  TDH, TDT             <9a>, <127>
Jan 17 23:40:14 xen01 kernel:  next_to_use          <127>
Jan 17 23:40:14 xen01 kernel:  next_to_clean        <98>
Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: 
tx_buffer_info[next_to_clean]
Jan 17 23:40:14 xen01 kernel:  time_stamp           <218443db3>
Jan 17 23:40:14 xen01 kernel:  jiffies              <218445368>
Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx hang 1 
detected on queue 0, resetting adapter
Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Reset adapter
Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1 eth6: PCIe transaction 
pending bit also did not clear.
Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1: master disable timed out
Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status down for 
interface eth6, disabling it in 200 ms.
Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status definitely 
down for interface eth6, disabling it
[...] repeated every second or so.
>> Are the devices connected to the same network infrastructure?
> 
> There are two onboard NICs and two NICs on a dual-port card in each
> server.  All devices connect to a cisco switch pair in VSS and the links
> are paired in LACP.
We've been experienced ixgbe stability issues on CentOS 6.x with various 
3.x kernels for years with different ixgbe driver versions and, to date, 
the only way to completely get rid of the issue was to switch from Intel 
to Broadcom. Just like in your case, the problem pops up randomly and 
the only reliable temporary fix is to reboot the affected Xen node. 
Another temporary fix that worked several times but not always was to 
migrate / shutdown the domUs, deactivate the volume groups, log out of 
all the iSCSI targets, "ifdown bond1" and "modprobe -r
ixgbe" followed
by "ifup bond1".

The set up is:
- Intel Dual 10Gb Ethernet - either X520-T2 or X540-T2
- Tried Xen kernels from both xen.crc.id.au and CentoS 6 Xen repos
- LACP bonding to connect to the NFS & iSCSI storage using Brocade 
VDX6740T fabric. MTU=9000
>> There has to be something common.
> 
> The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
> and NFS traffic, as well as some basic management stuff over SSH, and
> they are configured with an MTU of 9000 on the native VLAN.  It's a lot
> of features, but I can't really turn them off and then actually have
> enough load on the NICs to reproduce the issue.  Several of these
> servers were installed and being burned in for 3 months without ever
> having an issue, but suddenly collapsed when I tried to bring 20 or so
> real-world VMs up on them.
There "appears" to be some sort of load-dependent pattern here too,
but
it's impossible to confirm it.
The only stability improvement I was able to use "dom0_max_vcpus=1 
dom0_vcpus_pin". Haven't tried pci=nomsi yet.
> The other NICs in the system that are connected don't exhibit issues
and
> run only VM network interfaces.  They are also in LACP and running VLAN
> tags, but normal 1500 MTU.
> 
> So far it seems to correlate with NICs on the expansion cards, but it's
> a coincidence that these cards are the ones with the storage and
> management traffic.  I'm trying to swap some of this load to the
onboard
> NICs to see if the issues migrate over with it, or if they stay with the
> expansion cards.
> 
> If the issue exists on both NIC types, then it rules out the specific
> NIC chipset as the culprit.  It could point to the driver, but upgrading
> it to a newer version did not help and actually appeared to make
> everything worse.  This issue might actually be more to do with the PCIe
> bridge than the NICs, but these are still different motherboards with
> different PCIe bridges (5520 vs C600) experiencing the same issues.
> 
>> I've been using Intel NICs with Xen/CentOS for ages with no issues.
> 
> I figured that must be so.  Everyone uses Intel NICs.  If this was a
> common issue, it would probably be causing a lot of people a lot of
trouble.
> 
Adi Pircalabu

Kevin Stange

2017-Jan-30 23:49 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 01/30/2017 04:17 PM, Adi Pircalabu wrote:> On 28/01/17 05:21, Kevin Stange wrote:
>> On 01/27/2017 06:08 AM, Karel Hendrych wrote:
>>> Have you tried to eliminate all power management features all over?
>>
>> I've been trying to find and disable all power management features
but
>> having relatively little luck with that solving the problems.  Stabbing
>> the the dark I've tried different ACPI settings, including
completely
>> disabling it, disabling CPU frequency scaling, and setting
pcie_aspm=off
>> on the kernel command line.  Are there other kernel options that might
>> be useful to try?
> 
> May I chip in here? In our environment we're randomly seeing:
Welcome.  It's a relief to know someone else has been having a similar
nightmare!  Perhaps that's not encouraging...
> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit
> Hang
> Jan 17 23:40:14 xen01 kernel:  Tx Queue             <0>
> Jan 17 23:40:14 xen01 kernel:  TDH, TDT             <9a>, <127>
> Jan 17 23:40:14 xen01 kernel:  next_to_use          <127>
> Jan 17 23:40:14 xen01 kernel:  next_to_clean        <98>
> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6:
> tx_buffer_info[next_to_clean]
> Jan 17 23:40:14 xen01 kernel:  time_stamp           <218443db3>
> Jan 17 23:40:14 xen01 kernel:  jiffies              <218445368>
> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: tx hang 1
> detected on queue 0, resetting adapter
> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Reset adapter
> Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1 eth6: PCIe transaction
> pending bit also did not clear.
> Jan 17 23:40:15 xen01 kernel: ixgbe 0000:04:00.1: master disable timed out
> Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status down for
> interface eth6, disabling it in 200 ms.
> Jan 17 23:40:15 xen01 kernel: bonding: bond1: link status definitely
> down for interface eth6, disabling it
> [...] repeated every second or so.
> 
>>> Are the devices connected to the same network infrastructure?
>>
>> There are two onboard NICs and two NICs on a dual-port card in each
>> server.  All devices connect to a cisco switch pair in VSS and the
links
>> are paired in LACP.
> 
> We've been experienced ixgbe stability issues on CentOS 6.x with
various
> 3.x kernels for years with different ixgbe driver versions and, to date,
> the only way to completely get rid of the issue was to switch from Intel
> to Broadcom. Just like in your case, the problem pops up randomly and
> the only reliable temporary fix is to reboot the affected Xen node.
> Another temporary fix that worked several times but not always was to
> migrate / shutdown the domUs, deactivate the volume groups, log out of
> all the iSCSI targets, "ifdown bond1" and "modprobe -r
ixgbe" followed
> by "ifup bond1".
> 
> The set up is:
> - Intel Dual 10Gb Ethernet - either X520-T2 or X540-T2
> - Tried Xen kernels from both xen.crc.id.au and CentoS 6 Xen repos
> - LACP bonding to connect to the NFS & iSCSI storage using Brocade
> VDX6740T fabric. MTU=9000
You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
4.4 kernel.  Any chance you have tested with that one?

Did you ever try without MTU=9000 (default 1500 instead)?

I am having certain issues on certain hardware where there's no shutting
down the affected NICs.  Trying to do so or unload the igb module hangs
the entire box.  But in that case they're throwing AER errors instead of
just unit hangs:

pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received: id=0000
igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
type=Transaction Layer, id=0401(Requester ID)
igb 0000:04:00.1:   device [8086:10a7] error status/mask=00004000/00000000
igb 0000:04:00.1:    [14] Completion Timeout     (First)
igb 0000:04:00.1: broadcast error_detected message
igb 0000:04:00.1: broadcast slot_reset message
igb 0000:04:00.1: broadcast resume message
igb 0000:04:00.1: AER: Device recovery successful

Spammed continuously.

Switching to Broadcom would be a possibility, though it's tricky because
two of the NICs are onboard, so we'd need to replace the dual-port 1G
card with a quad-port 1G card.  Since you're saying you're all 10G,
maybe you don't know, but if you have any specific Broadcom 1G cards
you've had good fortune with, I'd be interested in knowing which models.
 Broadcom cards are rarely labeled as such which makes finding them a
bit more difficult than Intel ones.
>>> There has to be something common.
>>
>> The NICs having issues are running a native VLAN, a tagged VLAN, iSCSI
>> and NFS traffic, as well as some basic management stuff over SSH, and
>> they are configured with an MTU of 9000 on the native VLAN.  It's a
lot
>> of features, but I can't really turn them off and then actually
have
>> enough load on the NICs to reproduce the issue.  Several of these
>> servers were installed and being burned in for 3 months without ever
>> having an issue, but suddenly collapsed when I tried to bring 20 or so
>> real-world VMs up on them.
> 
> There "appears" to be some sort of load-dependent pattern here
too, but
> it's impossible to confirm it.
> The only stability improvement I was able to use "dom0_max_vcpus=1
> dom0_vcpus_pin". Haven't tried pci=nomsi yet.
So far the one hypervisor with pci=nomsi has been quiet but that doesn't
mean it's fixed.  I need to give it 6 weeks or so. :)

Thanks for your input on the issue!

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net

Jinesh Choksi

2017-Jan-31 10:00 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 30 January 2017 at 22:17, Adi Pircalabu <adi at ddns.com.au> wrote:
> May I chip in here? In our environment we're randomly seeing:
>
> Jan 17 23:40:14 xen01 kernel: ixgbe 0000:04:00.1 eth6: Detected Tx Unit
> Hang
>
Someone in this thread: https://sourceforge.net/p/e1000/bugs/530/#2855
 reported that *"With these kernels I was only able to work around the
issue by disabling tx-checksumming offload with ethtool."*

However, that was reported for Kernels 4.2.6 / 4.2.8 / 4.4.8 and 4.4.10. I
just thought it could be something you could rule out and hence mentioned
it:

ethtool --offload eth6 rx off tx off


Another thing to rule out in case its a regression with Intel NICs and TSO:

# tso => tcp-segmentation-offload
# gso => generic-segmentation-offload
# gro => generic-receive-offload
# sg => scatter-gather
# ufo => udp-fragmentation-offload (Cannot change)
# lro => large-receive-offload (Cannot change)

ethtool -K eth6 tso off gso off gro off sg off
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.centos.org/pipermail/centos-virt/attachments/20170131/6ec8d152/attachment-0002.html>

Maybe Matching Threads

Search for more apparently analagous threads

CentOS virt - Jan 2017 - NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Maybe Matching Threads