thr3ads.net - CentOS virt - [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18 [Feb 2017]

If this information is useful, please help other people find it:
Share via:

Kevin Stange

2017-Jan-31 00:41 UTC

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 01/30/2017 06:12 PM, Adi Pircalabu wrote:> On 31/01/17 10:49, Kevin Stange wrote:
>> You said 3.x kernels specifically. The kernel on Xen Made Easy now is a
>> 4.4 kernel.  Any chance you have tested with that one?
> 
> Not yet, however the future Xen nodes we'll deploy will run CentOS 7
and
> Xen with kernel 4.4.
I'll keep you (and others here) posted on my own experiences with that
4.4 build over the next few weeks to report on any issues.  I'm hoping
something happened between 3.18 and 4.4 that fixed underlying problems.
>> Did you ever try without MTU=9000 (default 1500 instead)?
> 
> Yes, also with all sorts of configuration combinations like LACP rate
> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
Alright, I'll assume that probably won't help then.  I tried it on one
box which hasn't had the issue again yet, but that doesn't guarantee
anything.
>> I am having certain issues on certain hardware where there's no
shutting
>> down the affected NICs.  Trying to do so or unload the igb module hangs
>> the entire box.  But in that case they're throwing AER errors
instead of
>> just unit hangs:
>>
>> pcieport 0000:00:03.0: AER: Uncorrected (Non-Fatal) error received:
>> id=0000
>> igb 0000:04:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal),
>> type=Transaction Layer, id=0401(Requester ID)
>> igb 0000:04:00.1:   device [8086:10a7] error
>> status/mask=00004000/00000000
>> igb 0000:04:00.1:    [14] Completion Timeout     (First)
>> igb 0000:04:00.1: broadcast error_detected message
>> igb 0000:04:00.1: broadcast slot_reset message
>> igb 0000:04:00.1: broadcast resume message
>> igb 0000:04:00.1: AER: Device recovery successful
> 
> This is interesting. We've never had any problems with the 1Gb NICs,
but
> we're only using 10Gb for the storage network. Could it be a common
> problem with either the adapters, or the drivers which only replicate
> running the Xen enabled kernel?
Since I've never run the 3.18 kernel on a box of this type without
running in a dom0 and since I can't reproduce this kind of issue without
a fair amount of NIC load over a tremendous period of time, it's
impossible to test if it's tied to Xen.

However, I know this hardware works well under 2.6.32-*.el6 and
3.10.0-*.el7 kernels without stability problems, as it did with
2.6.18-*.el5xen (Xen 3.4.4).

I suspect the above errors are actually due to something PCIe related,
and I have a subset of boxes which are actually being impacted by two
distinct problems with equivalent impact, which increases the likelihood
that the boxes will die.  Another set of boxes only ever sees the unit
hangs which seem unrecoverable even unloading/reloading the driver.  A
third set has random recoverable unit hangs only.  With so much
diversity, it's even harder to pin any specific causes to the problems.

The fact we're both pushing NFS and iSCSI traffic over these links makes
me wonder if there's something about that kind of traffic that increases
the chances of causing these issues.  When I put VM network traffic over
the same NICs, they seem a lot less prone to failures, but also end up
pushing less traffic in general.
>> Switching to Broadcom would be a possibility, though it's tricky
because
>> two of the NICs are onboard, so we'd need to replace the dual-port
1G
>> card with a quad-port 1G card.  Since you're saying you're all
10G,
>> maybe you don't know, but if you have any specific Broadcom 1G
cards
>> you've had good fortune with, I'd be interested in knowing
which models.
>>   Broadcom cards are rarely labeled as such which makes finding them a
>> bit more difficult than Intel ones.
> 
> We've purchased a number of servers with Broadcom BCM957810A1008G, sold
> by Dell as QLogic 57810 dual 10Gb Base-T adapters, none of them going up
> & down like a yo-yo so far.
> 
>> So far the one hypervisor with pci=nomsi has been quiet but that
doesn't
>> mean it's fixed.  I need to give it 6 weeks or so. :)
> 
> It'd be more like 6-9 months for us, making it terrible to debug it :-/
I had a bunch of these on relatively light VM load for 3 months for
"burn in" with no issues but they've been pretty aggressively
failing
since I started to try to put real loads on them.  Still, it's odd
because some of the boxes with identical hardware and similar VM loads
have not yet blown up after 3 or more weeks, and maybe they won't for
several months.

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net

Kevin Stange

2017-Feb-10 19:29 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 01/30/2017 06:41 PM, Kevin Stange wrote:> On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
>> On 31/01/17 10:49, Kevin Stange wrote:
>>> You said 3.x kernels specifically. The kernel on Xen Made Easy now
is a
>>> 4.4 kernel.  Any chance you have tested with that one?
>>
>> Not yet, however the future Xen nodes we'll deploy will run CentOS
7 and
>> Xen with kernel 4.4.
> 
> I'll keep you (and others here) posted on my own experiences with that
> 4.4 build over the next few weeks to report on any issues.  I'm hoping
> something happened between 3.18 and 4.4 that fixed underlying problems.
> 
>>> Did you ever try without MTU=9000 (default 1500 instead)?
>>
>> Yes, also with all sorts of configuration combinations like LACP rate
>> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
> 
> Alright, I'll assume that probably won't help then.  I tried it on
one
> box which hasn't had the issue again yet, but that doesn't
guarantee
> anything.
I was able to discover something new, which might not conclusively prove
anything, but it at least seems to rule out the pci=nomsi kernel option
from being effective.

I had one server booted with that option as well as MTU 1500.  It was
stable for quite a long time, so I decided to try turning the MTU back
to 9000 and within 12 hours, the interface on the expansion NIC with the
jumbo MTU failed.

The other NIC in the LACP bundle is onboard and didn't fail.  The other
NIC on the dual-port expansion card also didn't fail.  This leads me to
believe that ONE of the bugs I'm experiencing is related to 82575EB +
jumbo frames.

I still think I'm also having a PCI-e issue that is separate and
additional on top of that, and which has not reared its head recently,
making it difficult for me to gather any new data.

One of the things I've done that seemed to help a lot with stability was
balance the LACP so that one NIC from onboard and one NIC from expansion
card is in each LAG.  Previously we just had the first LAG onboard and
the second on the expansion card.  This way, at least, given the
expansion NIC's propensity toward failing first, I don't have to crash
the server and all running VMs to recover.

I've seen absolutely no issues yet with the 4.4 kernel either, but I am
not willing to call that a win because of the quiet from even the
servers on which no tweaks have been applied yet.

I will continue the story as I have more material! :)

-- 
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net

-=X.L.O.R.D=-

2017-Feb-12 04:08 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Kevin Stange,
Good attempt, jumbo frame is extremely important if hosting for IaaS, not to
mention other provider who need feather for network specific application.

Xlord
-----Original Message-----
From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin
Stange
Sent: Saturday, February 11, 2017 3:30 AM
To: centos-virt at centos.org
Subject: Re: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 /
Linux 3.18

On 01/30/2017 06:41 PM, Kevin Stange wrote:> On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
>> On 31/01/17 10:49, Kevin Stange wrote:
>>> You said 3.x kernels specifically. The kernel on Xen Made Easy now 
>>> is a
>>> 4.4 kernel.  Any chance you have tested with that one?
>>
>> Not yet, however the future Xen nodes we'll deploy will run CentOS
7
>> and Xen with kernel 4.4.
> 
> I'll keep you (and others here) posted on my own experiences with that
> 4.4 build over the next few weeks to report on any issues.  I'm hoping 
> something happened between 3.18 and 4.4 that fixed underlying problems.
> 
>>> Did you ever try without MTU=9000 (default 1500 instead)?
>>
>> Yes, also with all sorts of configuration combinations like LACP rate 
>> slow/fast, "options ixgbe LRO=0,0" and so on. No improvement.
> 
> Alright, I'll assume that probably won't help then.  I tried it on
one
> box which hasn't had the issue again yet, but that doesn't
guarantee
> anything.
I was able to discover something new, which might not conclusively prove
anything, but it at least seems to rule out the pci=nomsi kernel option from
being effective.

I had one server booted with that option as well as MTU 1500.  It was stable
for quite a long time, so I decided to try turning the MTU back to 9000 and
within 12 hours, the interface on the expansion NIC with the jumbo MTU
failed.

The other NIC in the LACP bundle is onboard and didn't fail.  The other NIC
on the dual-port expansion card also didn't fail.  This leads me to believe
that ONE of the bugs I'm experiencing is related to 82575EB + jumbo frames.

I still think I'm also having a PCI-e issue that is separate and additional
on top of that, and which has not reared its head recently, making it
difficult for me to gather any new data.

One of the things I've done that seemed to help a lot with stability was
balance the LACP so that one NIC from onboard and one NIC from expansion
card is in each LAG.  Previously we just had the first LAG onboard and the
second on the expansion card.  This way, at least, given the expansion NIC's
propensity toward failing first, I don't have to crash the server and all
running VMs to recover.

I've seen absolutely no issues yet with the 4.4 kernel either, but I am not
willing to call that a win because of the quiet from even the servers on
which no tweaks have been applied yet.

I will continue the story as I have more material! :)

--
Kevin Stange
Chief Technology Officer
Steadfast | Managed Infrastructure, Datacenter and Cloud Services
800 S Wells, Suite 190 | Chicago, IL 60607
312.602.2689 X203 | Fax: 312.602.2688
kevin at steadfast.net | www.steadfast.net
_______________________________________________
CentOS-virt mailing list
CentOS-virt at centos.org
https://lists.centos.org/mailman/listinfo/centos-virt

Adi Pircalabu

2017-Feb-12 23:07 UTC

head link

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

On 11/02/17 06:29, Kevin Stange wrote:> On 01/30/2017 06:41 PM, Kevin Stange wrote:
>> On 01/30/2017 06:12 PM, Adi Pircalabu wrote:
>>> On 31/01/17 10:49, Kevin Stange wrote:
>>>> You said 3.x kernels specifically. The kernel on Xen Made Easy
now is a
>>>> 4.4 kernel.  Any chance you have tested with that one?
>>>
>>> Not yet, however the future Xen nodes we'll deploy will run
CentOS 7 and
>>> Xen with kernel 4.4.
>>
>> I'll keep you (and others here) posted on my own experiences with
that
>> 4.4 build over the next few weeks to report on any issues.  I'm
hoping
>> something happened between 3.18 and 4.4 that fixed underlying problems.
>>
>>>> Did you ever try without MTU=9000 (default 1500 instead)?
>>>
>>> Yes, also with all sorts of configuration combinations like LACP
rate
>>> slow/fast, "options ixgbe LRO=0,0" and so on. No
improvement.
>>
>> Alright, I'll assume that probably won't help then.  I tried it
on one
>> box which hasn't had the issue again yet, but that doesn't
guarantee
>> anything.
> 
> I was able to discover something new, which might not conclusively prove
> anything, but it at least seems to rule out the pci=nomsi kernel option
> from being effective.
> 
> I had one server booted with that option as well as MTU 1500.  It was
> stable for quite a long time, so I decided to try turning the MTU back
> to 9000 and within 12 hours, the interface on the expansion NIC with the
> jumbo MTU failed.
> 
> The other NIC in the LACP bundle is onboard and didn't fail.  The other
> NIC on the dual-port expansion card also didn't fail.  This leads me to
> believe that ONE of the bugs I'm experiencing is related to 82575EB +
> jumbo frames.
> 
> I still think I'm also having a PCI-e issue that is separate and
> additional on top of that, and which has not reared its head recently,
> making it difficult for me to gather any new data.
> 
> One of the things I've done that seemed to help a lot with stability
was
> balance the LACP so that one NIC from onboard and one NIC from expansion
> card is in each LAG.  Previously we just had the first LAG onboard and
> the second on the expansion card.  This way, at least, given the
> expansion NIC's propensity toward failing first, I don't have to
crash
> the server and all running VMs to recover.
> 
> I've seen absolutely no issues yet with the 4.4 kernel either, but I am
> not willing to call that a win because of the quiet from even the
> servers on which no tweaks have been applied yet.
Thanks for the heads-up Kevin, appreciated. One thing I need to clarify, 
though: what kernel was this machine running at the time?

Adi Pircalabu

Possibly Parallel Threads

Search for more reasonably related threads

CentOS virt - Feb 2017 - NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18

Possibly Parallel Threads