Arun Khan
2010-Feb-08 18:43 UTC
[CentOS] Experiencing continual eth0 link up/down on a 10G Chelsio NIC (cxgb3 driver)
File Server OS: CentOS 5.3 (x86_64) Kernel: CentOS Plus kernel (need XFS fs drivers) The file server has a Chelsio T310 10GBASE-CX4 RNIC (rev 3) PCI Express x8 MSI-X (eth0), driver and firmware is stock from the CentOS Plus kernel. Using ethtool I have verified driver association with the 3 NICs on the system (eth1 and eth2 are not connected to any switch) Driver for eth0 driver: cxgb3 version: 1.1.3-ko firmware-version: T 7.4.0 TP 1.1.0 Driver for eth1 driver: e1000e version: 1.0.2-k2 firmware-version: 1.0-0 Driver for eth2 driver: e1000e version: 1.0.2-k2 firmware-version: 1.0-0 The last 3-4 weeks, I have noticed that the eth0 link keeps going up and down, confirmed by "dmesg" output as well in /var/log/messages (dmesg sample shown below). eth0: link down eth0: link up, 10Gbps, full-duplex eth0: link down eth0: link up, 10Gbps, full-duplex eth0: link down eth0: link up, 10Gbps, full-duplex The kernel RPM verification shows no errors # uname --kernel-release 2.6.18-164.2.1.el5.plus # rpm --verify kernel-2.6.18-164.2.1.el5.plus The hardware vendor tells me that the card either fails completely (kaput) or works - there is no grey area. He is of the opinion that the problem is with the driver. Verification of the kernel rpm tells me that all files including the cxgb3 driver file md5sum are OK. I would like to hear from anyone with the same NIC or another rev. using the same driver. Are you seeing similar link up/down in your system? How did you solve the problem? TIA -- Arun Khan
Hakan Koseoglu
2010-Feb-08 18:55 UTC
[CentOS] Experiencing continual eth0 link up/down on a 10G Chelsio NIC (cxgb3 driver)
Hi Arun, On Mon, Feb 8, 2010 at 6:43 PM, Arun Khan <knura9 at gmail.com> wrote:> The file server has a Chelsio T310 10GBASE-CX4 RNIC (rev 3) PCI > Express x8 MSI-X (eth0), driver and firmware is stock from the CentOS > Plus kernel.Way way back, I had similar problems on a bunch of servers with 3Com cards and some 3Com switches. It turns out to be the autonegotiation of the 3Com cards and switches we got at that time were buggy and some idiot had set the switches to 100Mbit full duplex and the cards to autoneg and it kept initiating autonegation and the buggy card kept on doing the wrong rate. The result was poor performance and constant link down/up cycle. It's worth checking what the switch side is set to. Also setting both sides to a particular value might help or just remove all and leave for autonegation.> The hardware vendor tells me that the card either fails completely > (kaput) or works - there is no grey area. ?He is of the opinion that > the problem is with the driver.The last thing, IMHO, never trust a supplier trying to wriggle out of a support case :) I've seen plenty of network cards that's on the way but not dead yet. -- Hakan (m1fcj) - http://www.hititgunesi.org
Jobst Schmalenbach
2010-Feb-08 22:18 UTC
[CentOS] Experiencing continual eth0 link up/down on a 10G Chelsio NIC (cxgb3 driver)
IMHO link failures are never a LOCAL problem ONLY but BOTH sides, and one of the people answering this question has already explained the stuff with the (auto) negotiation. Before that check: * other side * if the other is a switch, use different port * cable (make sure connectors sit properly and are clean) * lock the protocol (i.e. 100mb full duplex) both sides. Jobst On Tue, Feb 09, 2010 at 12:13:53AM +0530, Arun Khan (knura9 at gmail.com) wrote:> File Server OS: CentOS 5.3 (x86_64) > Kernel: CentOS Plus kernel (need XFS fs drivers) > > The file server has a Chelsio T310 10GBASE-CX4 RNIC (rev 3) PCI > Express x8 MSI-X (eth0), driver and firmware is stock from the CentOS > Plus kernel. > > Using ethtool I have verified driver association with the 3 NICs on > the system (eth1 and eth2 are not connected to any switch) > > Driver for eth0 > driver: cxgb3 > version: 1.1.3-ko > firmware-version: T 7.4.0 TP 1.1.0 > > Driver for eth1 > driver: e1000e > version: 1.0.2-k2 > firmware-version: 1.0-0 > > Driver for eth2 > driver: e1000e > version: 1.0.2-k2 > firmware-version: 1.0-0 > > > The last 3-4 weeks, I have noticed that the eth0 link keeps going up > and down, confirmed by "dmesg" output as well in /var/log/messages > (dmesg sample shown below). > > eth0: link down > eth0: link up, 10Gbps, full-duplex > eth0: link down > eth0: link up, 10Gbps, full-duplex > eth0: link down > eth0: link up, 10Gbps, full-duplex > > The kernel RPM verification shows no errors > > # uname --kernel-release > 2.6.18-164.2.1.el5.plus > > # rpm --verify kernel-2.6.18-164.2.1.el5.plus > > The hardware vendor tells me that the card either fails completely > (kaput) or works - there is no grey area. He is of the opinion that > the problem is with the driver. > > Verification of the kernel rpm tells me that all files including the > cxgb3 driver file md5sum are OK. > > I would like to hear from anyone with the same NIC or another rev. > using the same driver. > Are you seeing similar link up/down in your system? > How did you solve the problem? > > TIA > -- Arun Khan > _______________________________________________ > CentOS mailing list > CentOS at centos.org > http://lists.centos.org/mailman/listinfo/centos-- I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... "Are you sure?" ... YES ... Phew ... I'm out! | |0| | Jobst Schmalenbach, jobst at barrett.com.au, General Manager | | |0| Barrett Consulting Group P/L & The Meditation Room P/L |0|0|0| +61 3 9532 7677, POBox 277, Caulfield South, 3162, Australia
Arun Khan
2010-Feb-09 04:18 UTC
[CentOS] Experiencing continual eth0 link up/down on a 10G Chelsio NIC (cxgb3 driver)
Hi Jobst, Brent, and Hakan, Thanks for your inputs. I sincerely appreciate your suggestions and sharing your experience. I have posted a support query with Chelsio as well. In my case the connection is a Fiber cable. On the switch side there is only one 10GB port so there is not much I can do. Luckily the switch and the cable, besides the NIC, has also been supplied by the same vendor. I am going to request the hardware vendor to send their engineer and to take on the next steps. Will post final findings over here when the problem is solved. -- Arun Khan
Arun Khan
2010-Feb-11 18:15 UTC
[CentOS] Experiencing continual eth0 link up/down on a 10G Chelsio NIC (cxgb3 driver)
SOLVED On Tue, Feb 9, 2010 at 9:48 AM, Arun Khan <knura9 at gmail.com> wrote:> I am going to request the hardware vendor to send their engineer and > to take on the next steps.The hardware vendor finally sent their technical support team.> Will post final findings over here when the problem is solved.After investigation, they concluded the card was overheating and this was the possible cause of link going up/down. They have changed the location of the server. Hopefully, the problem will not manifest again. -- Arun Khan