I recently put a new server running 9.2 (with a local patches for NFS) into production, and it's immediately started to fail in an odd way. Since I pounded this server pretty heavily and never saw the error in testing, I'm more than a little bit taken aback. We have identical hardware in production with 9.1, and I have the same kernel running just peachy on a machine with Chelsio T4 NICs. The problem machine has ixgbe(4): ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port 0x9c00-0x9c1f mem 0xdef80000-0xdeffffff,0xdef7c000-0xdef7ffff irq 24 at device 0.0 on pci2 ix0: Using MSIX interrupts with 7 vectors ix0: Ethernet address: 04:7d:7b:a5:87:32 ix0: PCI Express Bus: Speed 5.0GT/s Width x4 ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port 0x9880-0x989f mem 0xdee80000-0xdeefffff,0xdee7c000-0xdee7ffff irq 34 at device 0.1 on pci2 ix1: Using MSIX interrupts with 7 vectors ix1: Ethernet address: 04:7d:7b:a5:87:33 ix1: PCI Express Bus: Speed 5.0GT/s Width x4 (pciconf tells me these are "82599EB 10-Gigabit SFI/SFP+ Network Connection". It's a bug that the driver doesn't tell me that.) These are glued together in a lagg(4) using LACP. Since we put this server into production, random network system calls have started failing with [EFBIG] or maybe sometimes [EIO]. I've observed this with a simple ping, but various daemons also log the errors: Mar 20 09:22:04 nfs-prod-4 sshd[42487]: fatal: Write failed: File too large [preauth] Mar 20 09:23:44 nfs-prod-4 nrpe[42492]: Error: Could not complete SSL handshake. 5 The machine eventually becomes unreachable and has to be rebooted from the console. So, can anyone tell me how this is possible, and what changed between 9.1 and 9.2 to cause it? -GAWollman
turn off TSO the problems sound similar to the one I reported a while back. truing off tso fixed it. danny On Mar 20, 2014, at 3:26 PM, Garrett Wollman <wollman at bimajority.org> wrote:> I recently put a new server running 9.2 (with a local patches for NFS) > into production, and it's immediately started to fail in an odd way. > Since I pounded this server pretty heavily and never saw the error in > testing, I'm more than a little bit taken aback. We have identical > hardware in production with 9.1, and I have the same kernel running > just peachy on a machine with Chelsio T4 NICs. The problem machine has > ixgbe(4): > > ix0: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port 0x9c00-0x9c1f mem 0xdef80000-0xdeffffff,0xdef7c000-0xdef7ffff irq 24 at device 0.0 on pci2 > ix0: Using MSIX interrupts with 7 vectors > ix0: Ethernet address: 04:7d:7b:a5:87:32 > ix0: PCI Express Bus: Speed 5.0GT/s Width x4 > ix1: <Intel(R) PRO/10GbE PCI-Express Network Driver, Version - 2.5.15> port 0x9880-0x989f mem 0xdee80000-0xdeefffff,0xdee7c000-0xdee7ffff irq 34 at device 0.1 on pci2 > ix1: Using MSIX interrupts with 7 vectors > ix1: Ethernet address: 04:7d:7b:a5:87:33 > ix1: PCI Express Bus: Speed 5.0GT/s Width x4 > > (pciconf tells me these are "82599EB 10-Gigabit SFI/SFP+ Network > Connection". It's a bug that the driver doesn't tell me that.) > > These are glued together in a lagg(4) using LACP. > > Since we put this server into production, random network system calls > have started failing with [EFBIG] or maybe sometimes [EIO]. I've > observed this with a simple ping, but various daemons also log the > errors: > Mar 20 09:22:04 nfs-prod-4 sshd[42487]: fatal: Write failed: File too large [preauth] > Mar 20 09:23:44 nfs-prod-4 nrpe[42492]: Error: Could not complete SSL handshake. 5 > > The machine eventually becomes unreachable and has to be rebooted from > the console. > > So, can anyone tell me how this is possible, and what changed between > 9.1 and 9.2 to cause it? > > -GAWollman > _______________________________________________ > freebsd-stable at freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"
In article <21290.60558.750106.630804 at hergotha.csail.mit.edu>, I wrote:>Since we put this server into production, random network system calls >have started failing with [EFBIG] or maybe sometimes [EIO]. I've >observed this with a simple ping, but various daemons also log the >errors: >Mar 20 09:22:04 nfs-prod-4 sshd[42487]: fatal: Write failed: File too >large [preauth] >Mar 20 09:23:44 nfs-prod-4 nrpe[42492]: Error: Could not complete SSL >handshake. 5I found at least one call stack where this happens and it does get returned all the way to userspace: 17 15547 _bus_dmamap_load_buffer:return kernel`_bus_dmamap_load_mbuf_sg+0x5f kernel`bus_dmamap_load_mbuf_sg+0x38 kernel`ixgbe_xmit+0xcf kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 8863 _bus_dmamap_load_mbuf_sg:return kernel`bus_dmamap_load_mbuf_sg+0x38 kernel`ixgbe_xmit+0xcf kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 25315 bus_dmamap_load_mbuf_sg:return kernel`ixgbe_xmit+0xcf kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 15547 _bus_dmamap_load_buffer:return kernel`_bus_dmamap_load_mbuf_sg+0x5f kernel`bus_dmamap_load_mbuf_sg+0x38 kernel`ixgbe_xmit+0xcf kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 8863 _bus_dmamap_load_mbuf_sg:return kernel`bus_dmamap_load_mbuf_sg+0x38 kernel`ixgbe_xmit+0xcf kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 25315 bus_dmamap_load_mbuf_sg:return kernel`ixgbe_xmit+0xcf kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 4206 ixgbe_xmit:return kernel`ixgbe_mq_start_locked+0x94 kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 4208 ixgbe_mq_start_locked:return kernel`ixgbe_mq_start+0x12a if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 4212 ixgbe_mq_start:return if_lagg.ko`lagg_transmit+0xc4 kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 36017 lagg_transmit:return kernel`ether_output_frame+0x33 kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 23948 ether_output_frame:return kernel`ether_output+0x4fe kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 18849 ether_output:return kernel`ip_output+0xd74 kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 30895 ip_output:return kernel`tcp_output+0xfea kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 20356 tcp_output:return kernel`tcp_usr_send+0x325 kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 10923 tcp_usr_send:return kernel`sosend_generic+0x3f6 kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 19509 sosend_generic:return kernel`soo_write+0x5e kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 26794 soo_write:return kernel`dofilewrite+0x85 kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 9141 dofilewrite:return kernel`kern_writev+0x6c kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 25665 kern_writev:return kernel`sys_write+0x64 kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 17 24390 sys_write:return kernel`amd64_syscall+0x5ea kernel`0xffffffff808443c7 The MTU here is 9120, and the ixgbe driver has one local modification, to prevent it from using large contiguous mbufs in its receive queue: Index: ixgbe.c ==================================================================--- ixgbe.c (revision 261091) +++ ixgbe.c (working copy) @@ -1117,12 +1117,8 @@ */ if (adapter->max_frame_size <= 2048) adapter->rx_mbuf_sz = MCLBYTES; - else if (adapter->max_frame_size <= 4096) + else adapter->rx_mbuf_sz = MJUMPAGESIZE; - else if (adapter->max_frame_size <= 9216) - adapter->rx_mbuf_sz = MJUM9BYTES; - else - adapter->rx_mbuf_sz = MJUM16BYTES; /* Prepare receive descriptors and buffers */ if (ixgbe_setup_receive_structures(adapter)) { -GAWollman