On Tue, Jan 10, 2017 at 2:38 AM Daniel Genis <daniel.genis at gmx.de>
wrote:
> Hello everyone,
>
> we're trying to tackle a rare bug that is very hard to debug.
>
> Our 10.3-RELEASE servers can panic boot and subsequently can come up
> without network (2x - no carrier). We've seen this on 10.3-RELEASE-p0
we
> have never seen this before.
>
> root at storage ~ # pciconf -lv | grep -B3 network
> ix0 at pci0:2:0:0: class=0x020000 card=0xd10f19e5 chip=0x10fb8086
> rev=0x01 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
> class = network
> --
> ix1 at pci0:2:0:1: class=0x020000 card=0xd10f19e5 chip=0x10fb8086
> rev=0x01 hdr=0x00
> vendor = 'Intel Corporation'
> device = '82599ES 10-Gigabit SFI/SFP+ Network Connection'
> class = network
>
> Our network is configured as active/passive using lagg. (/etc/rc.conf):
>
> ifconfig_ix0="up"
> ifconfig_ix1="up"
> cloned_interfaces="lagg0"
> ifconfig_lagg0="laggproto failover laggport ix0 laggport ix1
10.1.2.31/16"
>
> After panic boot the network show up like this:
>
> ix0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu
1500
>
>
options=8407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO>
> ether 60:08:10:d0:4e:9f
> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> media: <unknown type> (autoselect)
> status: no carrier
> ix1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu
1500
>
>
options=8407bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,VLAN_HWTSO>
> ether 60:08:10:d0:4e:9f
> nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
> media: <unknown type> (autoselect)
> status: no carrier
>
> The network switch sees the connection as online. The LED's of the
nic's
> suggest the same, they see the network as online (led's are on like in
> normal operation). Unplugging/replugging the network cable will get the
> network online. Shutting the port on the switch and reenabling it wil
> also get the network online. However another reboot will return the
> machine into the no-carrier state.
>
> I've built various kernels trying to find where the regression is
> without success. I tried porting the 10.2 nic driver (2.8.3) to 10.3 and
> subsequently the lagg code as well. I ported nic driver 3.1.14 from
> pfsense into 10.3-STABLE (2 december kernel) to no avail, also porting
> lagg code from 10.2 did not make any difference. Rebooting with those
> kernels the server remains in the no carrier state.
>
> We install our systems using mfsbsd for PXE boot. If I boot a machine
> which has the "no carrier" state using the 10.3 PXE boot, both
nic's
> come online. If I then boot from disk again the machine returns into the
> "no carrier" state.
>
> Recently we got some new machines, so we can shuffle more around and
> also to try to debug this. We baseinstalled it using mfsbsd 10.3 pxe and
> configured it like always. Here interestingly one of the two nic's
> entered the "no carrier" state, the other nic remained
operational. This
> remained persistent across reboots.
>
> The issue disappears after many reboots but it's not conclusive.
I've
> had 2 machines with which I could experiment with.
>
> On one the problem it disappeared on it's own after a reboot (kernel
> 10.3-STABLE git hash d99ba5c aka r299900(?)).
>
> On the other one I pxe booted 10.1 live environment and then
> subsequently I booted into kernel 10.3-STABLE git hash 3673260fc9 aka
> r308456(?)). But I don't think anything can be concluded from that.
That
> was the machine which had both nic's online after booting into the 10.3
> pxe environment but subsequently returned into no carrier state when
> booting 10.3 from disk.
>
> We also tried many sysctl flags (and many reboots), but without success.
> For example: hw.ix.enable_msix=0 and hw.ix.enable_msi=0
>
> At the moment I have no spare/empty machine in this state, we will empty
> one machine though which currently has this state (but is in production
> right now).
> I don't know how to trigger this state manually, which doesn't help
for
> debugging.
>
> I could link reference where others report similar issues, for example
> https://www.reddit.com/r/PFSENSE/comments/45bcuq/10_gig_woes/
> Here they suggest that the new intel nic driver 3.1.14 fixes it. Though
> I was not able to resolve the state by booting into a kernel with this
> driver.
>
> If I can provide any additional information please do not hesitate to ask.
>
> Any tips and suggestions for debugging are most welcome!
>
> With kind regards,
>
> Daniel
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at
freebsd.org"
>
This is a late follow-up, but could you file this as a bug on
bugs.freebsd.org?
- Eric