thr3ads.net - freebsd stable - igb(4) watchdog timeout, lagg(4) fails [Feb 2015]

If this information is useful, please help other people find it:
Share via:

Jack Vogel

2015-Feb-11 17:31 UTC

igb(4) watchdog timeout, lagg(4) fails

tdh and tdt mean the head and tail indices of the ring, and these values
are
obviously severely borked :)

I'm buried with some other issues, but I'll try and find some time to
look
at this a bit more.

Jack


On Wed, Feb 11, 2015 at 8:55 AM, Harald Schmalzbauer <
h.schmalzbauer at omnilan.de> wrote:
>  Bez?glich Harald Schmalzbauer's Nachricht vom 10.01.2015 11:51
> (localtime):
> >  Bez?glich Jack Vogel's Nachricht vom 09.01.2015 18:46
(localtime):
> >> The tuneable interrupt rate code is not mine, and looking at it
I'm not
> >> entirely
> >> sure it works. Why are you focused on the interrupt rate anyway,
do you
> have
> >> some reason to tie it to the watchdog?
> >>
> >> You could turn AIM off (enable_aim) and see if that changed
anything?
> >>
> >> It seems most the time problems show up they involve the use of
lagg,
> if you
> >> take it out of the mix does the problem go away?
> ...
>
> > Is there a way to reset the interface without rebooting the machine?
The
> > watchdog doesn't really reset the device, it's in
non-operating state
> > afterwards. I need to 'ifconfig down' it for bringin lagg(4)
back into
> > operational state.
> > Some kind of D3D0-state switch for a single address? kldunloading
would
> > destroy the remaining interface too...
>
> I could isolate the igb watchdog timeout problem a bit.
> It only happens on nics which take the PCH-PCIe route. Nics that are
> connected to the CPU's PCIe root complex never show this issue.
>
> Currently, I suffer from one unresponsible nic which shows the following
> states:
> dev.igb.1.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0
> dev.igb.1.%driver: igb
> dev.igb.1.%location: slot=0 function=0 handle=\_SB_.PCI0.PE70.S1F0
> dev.igb.1.%pnpinfo: vendor=0x8086 device=0x10c9 subvendor=0x8086
> subdevice=0xa03c class=0x020000
> dev.igb.1.%parent: pci11
> dev.igb.1.nvm: -1
> dev.igb.1.enable_aim: 1
> dev.igb.1.fc: 3
> dev.igb.1.rx_processing_limit: 250
> dev.igb.1.link_irq: 848
> ^^^^^^^^^^^^^^ 848???
> dev.igb.1.dropped: 0
> dev.igb.1.tx_dma_fail: 0
> dev.igb.1.rx_overruns: 0
> dev.igb.1.watchdog_timeouts: 414
> dev.igb.1.device_control: 1488978497
> dev.igb.1.rx_control: 67272738
> dev.igb.1.interrupt_mask: 4
> dev.igb.1.extended_int_mask: 2147483655
> dev.igb.1.tx_buf_alloc: 0
> dev.igb.1.rx_buf_alloc: 0
> dev.igb.1.fc_high_water: 47488
> dev.igb.1.fc_low_water: 47472
> dev.igb.1.queue0.interrupt_rate: 0
> dev.igb.1.queue0.txd_head: 0
> dev.igb.1.queue0.txd_tail: 0
> dev.igb.1.queue0.no_desc_avail: 2520
> dev.igb.1.queue0.tx_packets: 43894
> dev.igb.1.queue0.rxd_head: 0
> dev.igb.1.queue0.rxd_tail: 0
> dev.igb.1.queue0.rx_packets: 1918054
> dev.igb.1.queue0.rx_bytes: 0
> dev.igb.1.queue0.lro_queued: 0
> dev.igb.1.queue0.lro_flushed: 0
> dev.igb.1.queue1.interrupt_rate: 0
> dev.igb.1.queue1.txd_head: 0
> dev.igb.1.queue1.txd_tail: 0
> dev.igb.1.queue1.no_desc_avail: 17
> dev.igb.1.queue1.tx_packets: 36813
> dev.igb.1.queue1.rxd_head: 0
> dev.igb.1.queue1.rxd_tail: 0
> dev.igb.1.queue1.rx_packets: 63738
> dev.igb.1.queue1.rx_bytes: 0
> dev.igb.1.queue1.lro_queued: 0
> dev.igb.1.queue1.lro_flushed: 0
> ...
> dev.igb.1.interrupts.asserts: 5890499
> dev.igb.1.interrupts.rx_pkt_timer: 2103707
> dev.igb.1.interrupts.rx_abs_timer: 0
> dev.igb.1.interrupts.tx_pkt_timer: 0
> dev.igb.1.interrupts.tx_abs_timer: 1983448
> dev.igb.1.interrupts.tx_queue_empty: 50493
> dev.igb.1.interrupts.tx_queue_min_thresh: 0
> dev.igb.1.interrupts.rx_desc_min_thresh: 0
> dev.igb.1.interrupts.rx_overrun: 0
>
> The dev.igb.1.link_irq value doesn't really make sense, does it?
>
> The rest isn't unusual, just shows the watchdog timeout problem becaus
> of dev.igb.1.queue0.no_desc_avail I guess.
>
> I manually adjusted:
> hw.igb.num_queues: 2
> hw.igb.rx_process_limit: 250
> hw.igb.rxd: 4096
> hw.igb.txd: 4096
>
> Like mentioned, the nics not going through PCH-PCIe don't show this
> problem, falsified.
>
> This is the regular timeout interval for the last 24h (~3 minutes):
> Feb 11 17:26:53 vega kernel: igb1: Watchdog timeout -- resetting
> Feb 11 17:26:53 vega kernel: igb1: Queue(911600000) tdh = 2068077355, hw
> tdt = 397078446
> Feb 11 17:26:53 vega kernel: igb1: TX(911600000) desc avail = 0,Next TX
> to Clean = 0
> Feb 11 17:26:53 vega kernel: igb1: link state changed to DOWN
> Feb 11 17:26:56 vega kernel: igb1: link state changed to UP
> Feb 11 17:26:56 vega devd: Executing '/etc/rc.d/dhclient quietstart
igb1'
> Feb 11 17:30:10 vega kernel: igb1: Watchdog timeout -- resetting
> Feb 11 17:30:10 vega kernel: igb1: Queue(911600000) tdh = 2068077355, hw
> tdt = 397078446
> Feb 11 17:30:10 vega kernel: igb1: TX(911600000) desc avail = 0,Next TX
> to Clean = 0
> Feb 11 17:30:10 vega kernel: igb1: link state changed to DOWN
> Feb 11 17:30:13 vega kernel: igb1: link state changed to UP
>
> But these resets don't bring the interface back into a working state
:-(
> "Queue" value remains constant, but "tdh" and
"tdt" vary from time to
> time, for example:
> igb1: Queue(911600000) tdh = -337225283, hw tdt = 398180458
>
> Unfortunately I don't know what they stand for. Is there an explanation
> for people who don't just look for it in the drivers code?
> Any idea where the problem could be?
>
> Thanks,
>
> -Harry
>
>

Harald Schmalzbauer

2015-Feb-11 19:48 UTC

head link

igb(4) watchdog timeout, lagg(4) fails

Bez?glich Jack Vogel's Nachricht vom 11.02.2015 18:31
(localtime):> tdh and tdt mean the head and tail indices of the ring, and these
> values are
> obviously severely borked :)
>
> I'm buried with some other issues, but I'll try and find some time
to look
> at this a bit more.
Highly appreciated, thanks in advance!

For the records: Rebooting the machine (ESXi guest-only!) brought the
stalled igb1 back to operation.
The guest has 2 igb (kawela) ports, one from a NIC(Intel ET Dual Port
82576)@CPU-PCIe and the second port from an identical NIC, but connected
via PCH-PCIe.
The watchdog timeout problem only occurs with the port from the
PCH-PCIe-connected NIC (falisfied)!
After the reboot the suspicious "dev.igb.1.link_irq=848" turned into:
dev.igb.0.link_irq: 3
dev.igb.1.link_irq: 4

Thanks,

-Harry
>
> On Wed, Feb 11, 2015 at 8:55 AM, Harald Schmalzbauer
> <h.schmalzbauer at omnilan.de <mailto:h.schmalzbauer at
omnilan.de>> wrote:
>
>     Bez?glich Harald Schmalzbauer's Nachricht vom 10.01.2015 11:51
>     (localtime):
>     > Bez?glich Jack Vogel's Nachricht vom 09.01.2015 18:46
(localtime):
>     >> The tuneable interrupt rate code is not mine, and looking at
it
>     I'm not
>     >> entirely
>     >> sure it works. Why are you focused on the interrupt rate
anyway,
>     do you have
>     >> some reason to tie it to the watchdog?
>     >>
>     >> You could turn AIM off (enable_aim) and see if that changed
>     anything?
>     >>
>     >> It seems most the time problems show up they involve the use
of
>     lagg, if you
>     >> take it out of the mix does the problem go away?
>     ?
>
>     > Is there a way to reset the interface without rebooting the
>     machine? The
>     > watchdog doesn't really reset the device, it's in
non-operating state
>     > afterwards. I need to 'ifconfig down' it for bringin
lagg(4) back
>     into
>     > operational state.
>     > Some kind of D3D0-state switch for a single address? kldunloading
>     would
>     > destroy the remaining interface too?
>
>     I could isolate the igb watchdog timeout problem a bit.
>     It only happens on nics which take the PCH-PCIe route. Nics that are
>     connected to the CPU's PCIe root complex never show this issue.
>
>     Currently, I suffer from one unresponsible nic which shows the
>     following
>     states:
>     dev.igb.1.%desc: Intel(R) PRO/1000 Network Connection version - 2.4.0
>     dev.igb.1.%driver: igb
>     dev.igb.1.%location: slot=0 function=0 handle=\_SB_.PCI0.PE70.S1F0
>     dev.igb.1.%pnpinfo: vendor=0x8086 device=0x10c9 subvendor=0x8086
>     subdevice=0xa03c class=0x020000
>     dev.igb.1.%parent: pci11
>     dev.igb.1.nvm: -1
>     dev.igb.1.enable_aim: 1
>     dev.igb.1.fc: 3
>     dev.igb.1.rx_processing_limit: 250
>     dev.igb.1.link_irq: 848
>     ^^^^^^^^^^^^^^ 848???
>     dev.igb.1.dropped: 0
>     dev.igb.1.tx_dma_fail: 0
>     dev.igb.1.rx_overruns: 0
>     dev.igb.1.watchdog_timeouts: 414
>     dev.igb.1.device_control: 1488978497
>     dev.igb.1.rx_control: 67272738
>     dev.igb.1.interrupt_mask: 4
>     dev.igb.1.extended_int_mask: 2147483655
>     dev.igb.1.tx_buf_alloc: 0
>     dev.igb.1.rx_buf_alloc: 0
>     dev.igb.1.fc_high_water: 47488
>     dev.igb.1.fc_low_water: 47472
>     dev.igb.1.queue0.interrupt_rate: 0
>     dev.igb.1.queue0.txd_head: 0
>     dev.igb.1.queue0.txd_tail: 0
>     dev.igb.1.queue0.no_desc_avail: 2520
>     dev.igb.1.queue0.tx_packets: 43894
>     dev.igb.1.queue0.rxd_head: 0
>     dev.igb.1.queue0.rxd_tail: 0
>     dev.igb.1.queue0.rx_packets: 1918054
>     dev.igb.1.queue0.rx_bytes: 0
>     dev.igb.1.queue0.lro_queued: 0
>     dev.igb.1.queue0.lro_flushed: 0
>     dev.igb.1.queue1.interrupt_rate: 0
>     dev.igb.1.queue1.txd_head: 0
>     dev.igb.1.queue1.txd_tail: 0
>     dev.igb.1.queue1.no_desc_avail: 17
>     dev.igb.1.queue1.tx_packets: 36813
>     dev.igb.1.queue1.rxd_head: 0
>     dev.igb.1.queue1.rxd_tail: 0
>     dev.igb.1.queue1.rx_packets: 63738
>     dev.igb.1.queue1.rx_bytes: 0
>     dev.igb.1.queue1.lro_queued: 0
>     dev.igb.1.queue1.lro_flushed: 0
>     ?
>     dev.igb.1.interrupts.asserts: 5890499
>     dev.igb.1.interrupts.rx_pkt_timer: 2103707
>     dev.igb.1.interrupts.rx_abs_timer: 0
>     dev.igb.1.interrupts.tx_pkt_timer: 0
>     dev.igb.1.interrupts.tx_abs_timer: 1983448
>     dev.igb.1.interrupts.tx_queue_empty: 50493
>     dev.igb.1.interrupts.tx_queue_min_thresh: 0
>     dev.igb.1.interrupts.rx_desc_min_thresh: 0
>     dev.igb.1.interrupts.rx_overrun: 0
>
>     The dev.igb.1.link_irq value doesn't really make sense, does it?
>
>     The rest isn't unusual, just shows the watchdog timeout problem
becaus
>     of dev.igb.1.queue0.no_desc_avail I guess.
>
>     I manually adjusted:
>     hw.igb.num_queues: 2
>     hw.igb.rx_process_limit: 250
>     hw.igb.rxd: 4096
>     hw.igb.txd: 4096
>
>     Like mentioned, the nics not going through PCH-PCIe don't show this
>     problem, falsified.
>
>     This is the regular timeout interval for the last 24h (~3 minutes):
>     Feb 11 17:26:53 vega kernel: igb1: Watchdog timeout -- resetting
>     Feb 11 17:26:53 vega kernel: igb1: Queue(911600000) tdh >    
2068077355, hw
>     tdt = 397078446
>     Feb 11 17:26:53 vega kernel: igb1: TX(911600000) desc avail >    
0,Next TX
>     to Clean = 0
>     Feb 11 17:26:53 vega kernel: igb1: link state changed to DOWN
>     Feb 11 17:26:56 vega kernel: igb1: link state changed to UP
>     Feb 11 17:26:56 vega devd: Executing '/etc/rc.d/dhclient
>     quietstart igb1'
>     Feb 11 17:30:10 vega kernel: igb1: Watchdog timeout -- resetting
>     Feb 11 17:30:10 vega kernel: igb1: Queue(911600000) tdh >    
2068077355, hw
>     tdt = 397078446
>     Feb 11 17:30:10 vega kernel: igb1: TX(911600000) desc avail >    
0,Next TX
>     to Clean = 0
>     Feb 11 17:30:10 vega kernel: igb1: link state changed to DOWN
>     Feb 11 17:30:13 vega kernel: igb1: link state changed to UP
>
>     But these resets don't bring the interface back into a working
>     state :-(
>     "Queue" value remains constant, but "tdh" and
"tdt" vary from time to
>     time, for example:
>     igb1: Queue(911600000) tdh = -337225283, hw tdt = 398180458
>
>     Unfortunately I don't know what they stand for. Is there an
>     explanation
>     for people who don't just look for it in the drivers code?
>     Any idea where the problem could be?
>
>     Thanks,
>
>     -Harry
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 196 bytes
Desc: OpenPGP digital signature
URL:
<http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20150211/2cd878fa/attachment.sig>

freebsd stable - Feb 2015 - igb(4) watchdog timeout, lagg(4) fails

igb(4) watchdog timeout, lagg(4) fails

igb(4) watchdog timeout, lagg(4) fails