Kevin Stange
2017-Jan-23 17:04 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
I have three different types of CentOS 6 Xen 4.4 based hypervisors (by hardware) that are experiencing stability issues which I haven't been able to track down. All three types seem to be having issues with NIC and/or PCIe. In most cases, the issues are unrecoverable and require a hard boot to resolve. All have Intel NICs. Often the systems will remain stable for days or weeks, then suddenly encounter one of these issues. I have yet to tie the error to any specific action on the systems and can't reproduce it reliably. - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs Kernel messages upon failure: pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0018(Receiver ID) pcieport 0000:00:03.0: device [8086:340a] error status/mask=00002000/00001001 pcieport 0000:00:03.0: [13] Advisory Non-Fatal pcieport 0000:00:03.0: Error of this Agent(0018) is reported first igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID) igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.0: [ 0] Receiver Error (First) igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0401(Receiver ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.1: [ 0] Receiver Error (First) This spams to the console continuously until hard booting. - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB igb 0000:82:00.0: Detected Tx Unit Hang Tx Queue <1> TDH <43> TDT <50> next_to_use <50> next_to_clean <43> buffer_info[next_to_clean] time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> jiffies <12e6bc8dc> desc.status <1c8210> This spams to the console continuously until hard booting. - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: TDH <ff> TDT <33> next_to_use <33> next_to_clean <fd> buffer_info[next_to_clean]: time_stamp <138230862> next_to_watch <ff> jiffies <138231ac0> next_to_watch.status <0> MAC Status <80383> PHY Status <792d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10> This type of system, the NIC automatically recovers and I don't need to reboot. So far I tried using pcie_aspm=off to see that would help, but it appears that the 3.18 kernel turns off ASPM by default on these due to probing the BIOS. Stability issues were not resolved by the changes. On the latter system type I also turned off all offloading setting. It appears the stability increased slightly but it didn't fully resolve the problem. I haven't adjusted offload settings on the first two server types yet. I suspect this problem is related to the 3.18 kernel used by the virt SIG, as we had these running Xen on CentOS 5's kernel with no issues for years, and systems of these types used elsewhere in our facility are stable under CentOS 6's standard kernel. This affects more than one server of each type, so I don't believe it is a hardware failure, or else it's a hardware design flaw. Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues? -- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin at steadfast.net | www.steadfast.net
-=X.L.O.R.D=-
2017-Jan-24 13:29 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
Kevin Stange, It can be either kernel or update the NIC driver or firmware of the NIC card. Hope that helps! Xlord -----Original Message----- From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin Stange Sent: Tuesday, January 24, 2017 1:04 AM To: centos-virt at centos.org Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18 I have three different types of CentOS 6 Xen 4.4 based hypervisors (by hardware) that are experiencing stability issues which I haven't been able to track down. All three types seem to be having issues with NIC and/or PCIe. In most cases, the issues are unrecoverable and require a hard boot to resolve. All have Intel NICs. Often the systems will remain stable for days or weeks, then suddenly encounter one of these issues. I have yet to tie the error to any specific action on the systems and can't reproduce it reliably. - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs Kernel messages upon failure: pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction Layer, id=0018(Receiver ID) pcieport 0000:00:03.0: device [8086:340a] error status/mask=00002000/00001001 pcieport 0000:00:03.0: [13] Advisory Non-Fatal pcieport 0000:00:03.0: Error of this Agent(0018) is reported first igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0400(Receiver ID) igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.0: [ 0] Receiver Error (First) igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=0401(Receiver ID) igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 igb 0000:04:00.1: [ 0] Receiver Error (First) This spams to the console continuously until hard booting. - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB igb 0000:82:00.0: Detected Tx Unit Hang Tx Queue <1> TDH <43> TDT <50> next_to_use <50> next_to_clean <43> buffer_info[next_to_clean] time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> jiffies <12e6bc8dc> desc.status <1c8210> This spams to the console continuously until hard booting. - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: TDH <ff> TDT <33> next_to_use <33> next_to_clean <fd> buffer_info[next_to_clean]: time_stamp <138230862> next_to_watch <ff> jiffies <138231ac0> next_to_watch.status <0> MAC Status <80383> PHY Status <792d> PHY 1000BASE-T Status <3c00> PHY Extended Status <3000> PCI Status <10> This type of system, the NIC automatically recovers and I don't need to reboot. So far I tried using pcie_aspm=off to see that would help, but it appears that the 3.18 kernel turns off ASPM by default on these due to probing the BIOS. Stability issues were not resolved by the changes. On the latter system type I also turned off all offloading setting. It appears the stability increased slightly but it didn't fully resolve the problem. I haven't adjusted offload settings on the first two server types yet. I suspect this problem is related to the 3.18 kernel used by the virt SIG, as we had these running Xen on CentOS 5's kernel with no issues for years, and systems of these types used elsewhere in our facility are stable under CentOS 6's standard kernel. This affects more than one server of each type, so I don't believe it is a hardware failure, or else it's a hardware design flaw. Has anyone experienced similar issues with this configuration, and if so, does anyone have tips on how to resolve the issues? -- Kevin Stange Chief Technology Officer Steadfast | Managed Infrastructure, Datacenter and Cloud Services 800 S Wells, Suite 190 | Chicago, IL 60607 312.602.2689 X203 | Fax: 312.602.2688 kevin at steadfast.net | www.steadfast.net _______________________________________________ CentOS-virt mailing list CentOS-virt at centos.org https://lists.centos.org/mailman/listinfo/centos-virt
Konrad Rzeszutek Wilk
2017-Jan-24 15:10 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
On Tue, Jan 24, 2017 at 09:29:39PM +0800, -=X.L.O.R.D=- wrote:> Kevin Stange, > It can be either kernel or update the NIC driver or firmware of the NIC > card. Hope that helps! > > Xlord > -----Original Message----- > From: CentOS-virt [mailto:centos-virt-bounces at centos.org] On Behalf Of Kevin > Stange > Sent: Tuesday, January 24, 2017 1:04 AM > To: centos-virt at centos.org > Subject: [CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / > Linux 3.18 > > I have three different types of CentOS 6 Xen 4.4 based hypervisors (by > hardware) that are experiencing stability issues which I haven't been able > to track down. All three types seem to be having issues with NIC and/or > PCIe. In most cases, the issues are unrecoverable and require a hard boot > to resolve. All have Intel NICs. > > Often the systems will remain stable for days or weeks, then suddenly > encounter one of these issues. I have yet to tie the error to any specific > action on the systems and can't reproduce it reliably. > > - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs > > Kernel messages upon failure: > > pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 > pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Transaction > Layer, id=0018(Receiver ID) > pcieport 0000:00:03.0: device [8086:340a] error > status/mask=00002000/00001001 > pcieport 0000:00:03.0: [13] Advisory Non-Fatal > pcieport 0000:00:03.0: Error of this Agent(0018) is reported first > igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, > id=0400(Receiver ID) > igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 > igb 0000:04:00.0: [ 0] Receiver Error (First) > igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical Layer, > id=0401(Receiver ID) > igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 > igb 0000:04:00.1: [ 0] Receiver Error (First) > > This spams to the console continuously until hard booting. > > - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB > > igb 0000:82:00.0: Detected Tx Unit Hang > Tx Queue <1> > TDH <43> > TDT <50> > next_to_use <50> > next_to_clean <43> > buffer_info[next_to_clean] > time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> > jiffies <12e6bc8dc> > desc.status <1c8210> > > This spams to the console continuously until hard booting. > > - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB > > e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: > TDH <ff> > TDT <33> > next_to_use <33> > next_to_clean <fd> > buffer_info[next_to_clean]: > time_stamp <138230862> > next_to_watch <ff> > jiffies <138231ac0> > next_to_watch.status <0> > MAC Status <80383> > PHY Status <792d> > PHY 1000BASE-T Status <3c00> > PHY Extended Status <3000> > PCI Status <10> > > This type of system, the NIC automatically recovers and I don't need to > reboot. > > So far I tried using pcie_aspm=off to see that would help, but it appears > that the 3.18 kernel turns off ASPM by default on these due to probing the > BIOS. Stability issues were not resolved by the changes. > > On the latter system type I also turned off all offloading setting. It > appears the stability increased slightly but it didn't fully resolve the > problem. I haven't adjusted offload settings on the first two server types > yet. > > I suspect this problem is related to the 3.18 kernel used by the virt SIG, > as we had these running Xen on CentOS 5's kernel with no issues for years, > and systems of these types used elsewhere in our facility are stable under > CentOS 6's standard kernel. This affects more than one server of each type, > so I don't believe it is a hardware failure, or else it's a hardware design > flaw. > > Has anyone experienced similar issues with this configuration, and if so, > does anyone have tips on how to resolve the issues?Honeslty I would email Intel and see if they can help. This looks like the NIC decides something is wrong, throws off an PCIe error and then resets itself. It could also be an error in the Linux stack which would "eat" an interrupt when migrating interrupts (which was fixed upstream, see below). Are you running irqbalance? Could you try turning it off? Did you have these issues with an earlier kernel? The fix was ff1e22e7a638a0782f54f81a6c9cb139aca2da35 Author: Boris Ostrovsky <boris.ostrovsky at oracle.com> Date: Fri Mar 18 10:11:07 2016 -0400 xen/events: Mask a moving irq and then there was a fix to this fix: commit f0f393877c71ad227d36705d61d1e4062bc29cf5 Author: Ross Lagerwall <ross.lagerwall at citrix.com> Date: Tue May 10 16:11:00 2016 +0100 xen/events: Don't move disabled irqs> > -- > Kevin Stange > Chief Technology Officer > Steadfast | Managed Infrastructure, Datacenter and Cloud Services > 800 S Wells, Suite 190 | Chicago, IL 60607 > 312.602.2689 X203 | Fax: 312.602.2688 > kevin at steadfast.net | www.steadfast.net > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > https://lists.centos.org/mailman/listinfo/centos-virt > > _______________________________________________ > CentOS-virt mailing list > CentOS-virt at centos.org > https://lists.centos.org/mailman/listinfo/centos-virt
Johnny Hughes
2017-Feb-21 17:47 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
On 01/23/2017 11:04 AM, Kevin Stange wrote:> I have three different types of CentOS 6 Xen 4.4 based hypervisors (by > hardware) that are experiencing stability issues which I haven't been > able to track down. All three types seem to be having issues with NIC > and/or PCIe. In most cases, the issues are unrecoverable and require a > hard boot to resolve. All have Intel NICs. > > Often the systems will remain stable for days or weeks, then suddenly > encounter one of these issues. I have yet to tie the error to any > specific action on the systems and can't reproduce it reliably. > > - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs > > Kernel messages upon failure: > > pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 > pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, > type=Transaction Layer, id=0018(Receiver ID) > pcieport 0000:00:03.0: device [8086:340a] error > status/mask=00002000/00001001 > pcieport 0000:00:03.0: [13] Advisory Non-Fatal > pcieport 0000:00:03.0: Error of this Agent(0018) is reported first > igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical > Layer, id=0400(Receiver ID) > igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 > igb 0000:04:00.0: [ 0] Receiver Error (First) > igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical > Layer, id=0401(Receiver ID) > igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 > igb 0000:04:00.1: [ 0] Receiver Error (First) > > This spams to the console continuously until hard booting. > > - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB > > igb 0000:82:00.0: Detected Tx Unit Hang > Tx Queue <1> > TDH <43> > TDT <50> > next_to_use <50> > next_to_clean <43> > buffer_info[next_to_clean] > time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> > jiffies <12e6bc8dc> > desc.status <1c8210> > > This spams to the console continuously until hard booting. > > - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB > > e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: > TDH <ff> > TDT <33> > next_to_use <33> > next_to_clean <fd> > buffer_info[next_to_clean]: > time_stamp <138230862> > next_to_watch <ff> > jiffies <138231ac0> > next_to_watch.status <0> > MAC Status <80383> > PHY Status <792d> > PHY 1000BASE-T Status <3c00> > PHY Extended Status <3000> > PCI Status <10> > > This type of system, the NIC automatically recovers and I don't need to > reboot. > > So far I tried using pcie_aspm=off to see that would help, but it > appears that the 3.18 kernel turns off ASPM by default on these due to > probing the BIOS. Stability issues were not resolved by the changes. > > On the latter system type I also turned off all offloading setting. It > appears the stability increased slightly but it didn't fully resolve the > problem. I haven't adjusted offload settings on the first two server > types yet. > > I suspect this problem is related to the 3.18 kernel used by the virt > SIG, as we had these running Xen on CentOS 5's kernel with no issues for > years, and systems of these types used elsewhere in our facility are > stable under CentOS 6's standard kernel. This affects more than one > server of each type, so I don't believe it is a hardware failure, or > else it's a hardware design flaw. > > Has anyone experienced similar issues with this configuration, and if > so, does anyone have tips on how to resolve the issues? >Kevin, Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along with the newer linux-firmare packages and xfsprogs). If you enable the xen-testing repository in your CentOS-Xen.repo file (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' should replace all the needed packages. The actual path is here for the packages: https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/ Hopefully this helps. Thanks, Johnny Hughes -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170221/9d9c3218/attachment-0002.sig>
Johnny Hughes
2017-Feb-21 17:50 UTC
[CentOS-virt] NIC Stability Problems Under Xen 4.4 / CentOS 6 / Linux 3.18
On 02/21/2017 11:47 AM, Johnny Hughes wrote:> On 01/23/2017 11:04 AM, Kevin Stange wrote: >> I have three different types of CentOS 6 Xen 4.4 based hypervisors (by >> hardware) that are experiencing stability issues which I haven't been >> able to track down. All three types seem to be having issues with NIC >> and/or PCIe. In most cases, the issues are unrecoverable and require a >> hard boot to resolve. All have Intel NICs. >> >> Often the systems will remain stable for days or weeks, then suddenly >> encounter one of these issues. I have yet to tie the error to any >> specific action on the systems and can't reproduce it reliably. >> >> - Supermicro X8DT3, Dual Xeon E5620, 2x 82575EB NICs, 2x 82576 NICs >> >> Kernel messages upon failure: >> >> pcieport 0000:00:03.0: AER: Multiple Corrected error received: id=0018 >> pcieport 0000:00:03.0: PCIe Bus Error: severity=Corrected, >> type=Transaction Layer, id=0018(Receiver ID) >> pcieport 0000:00:03.0: device [8086:340a] error >> status/mask=00002000/00001001 >> pcieport 0000:00:03.0: [13] Advisory Non-Fatal >> pcieport 0000:00:03.0: Error of this Agent(0018) is reported first >> igb 0000:04:00.0: PCIe Bus Error: severity=Corrected, type=Physical >> Layer, id=0400(Receiver ID) >> igb 0000:04:00.0: device [8086:10a7] error status/mask=00002001/00002000 >> igb 0000:04:00.0: [ 0] Receiver Error (First) >> igb 0000:04:00.1: PCIe Bus Error: severity=Corrected, type=Physical >> Layer, id=0401(Receiver ID) >> igb 0000:04:00.1: device [8086:10a7] error status/mask=00002001/00002000 >> igb 0000:04:00.1: [ 0] Receiver Error (First) >> >> This spams to the console continuously until hard booting. >> >> - Supermicro X9DRD-iF/LF, Dual Xeon E5-2630, 2x I350, 2x 82575EB >> >> igb 0000:82:00.0: Detected Tx Unit Hang >> Tx Queue <1> >> TDH <43> >> TDT <50> >> next_to_use <50> >> next_to_clean <43> >> buffer_info[next_to_clean] >> time_stamp <12e6bc0b6> next_to_watch <ffff880006aa7440> >> jiffies <12e6bc8dc> >> desc.status <1c8210> >> >> This spams to the console continuously until hard booting. >> >> - Supermicro X9DRT, Dual Xeon E5-2650, 2x I350, 2x 82571EB >> >> e1000e 0000:04:00.0 eth2: Detected Hardware Unit Hang: >> TDH <ff> >> TDT <33> >> next_to_use <33> >> next_to_clean <fd> >> buffer_info[next_to_clean]: >> time_stamp <138230862> >> next_to_watch <ff> >> jiffies <138231ac0> >> next_to_watch.status <0> >> MAC Status <80383> >> PHY Status <792d> >> PHY 1000BASE-T Status <3c00> >> PHY Extended Status <3000> >> PCI Status <10> >> >> This type of system, the NIC automatically recovers and I don't need to >> reboot. >> >> So far I tried using pcie_aspm=off to see that would help, but it >> appears that the 3.18 kernel turns off ASPM by default on these due to >> probing the BIOS. Stability issues were not resolved by the changes. >> >> On the latter system type I also turned off all offloading setting. It >> appears the stability increased slightly but it didn't fully resolve the >> problem. I haven't adjusted offload settings on the first two server >> types yet. >> >> I suspect this problem is related to the 3.18 kernel used by the virt >> SIG, as we had these running Xen on CentOS 5's kernel with no issues for >> years, and systems of these types used elsewhere in our facility are >> stable under CentOS 6's standard kernel. This affects more than one >> server of each type, so I don't believe it is a hardware failure, or >> else it's a hardware design flaw. >> >> Has anyone experienced similar issues with this configuration, and if >> so, does anyone have tips on how to resolve the issues? >> > > > Kevin, > > Please try the 4.9.11-22 kernel that I just released for CentOS-6 (along > with the newer linux-firmare packages and xfsprogs). > > If you enable the xen-testing repository in your CentOS-Xen.repo file > (assuming it is pointing to xen-44 and not xen-46) then a 'yum upgrade' > should replace all the needed packages. > > The actual path is here for the packages: > > https://buildlogs.centos.org/centos/6/virt/x86_64/xen-44/ > > Hopefully this helps. >I should have said .. 'just releaed for testing' :) I have been using this for 4 or 5 days with no issues in production, but it needs testing before final release :) -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 198 bytes Desc: OpenPGP digital signature URL: <http://lists.centos.org/pipermail/centos-virt/attachments/20170221/c5618519/attachment-0002.sig>