Charles A. Taylor
2009-Aug-17 16:23 UTC
[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]
FWIW, I posted this to ofa-general a little earlier. Anyone else seeing this? Suggestions? I think this is an OFED 1.4.1 problem but they may point the finger at you guys. :) We''ve tried limiting OST threads to no avail. It doesn''t really seem to require a heavy load to trigger it - more or less random. Charlie Taylor UF HPC Center -------- Forwarded Message -------- From: Charles A. Taylor <taylor at hpc.ufl.edu> To: general at lists.openfabrics.org Cc: Craig Prescott <prescott at hpc.ufl.edu> Subject: [ofa-general] IPoIB Transmit Timeouts Date: Mon, 17 Aug 2009 12:10:25 -0400 We upgraded our file servers to OFED 1.4.1 last Thursday and have since been hit with a daily ration of the following across all eight of our servers... Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449 msecs Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770, tx_tail 868165647 The difference between the head/tail is always 123. The send queue size is 128 according to... cat /sys/module/ib_ipoib/parameters/send_queue_size 128>From the post below, others seem to have encountered this but we havenot seen any patches or work-arounds. Has anyone solved this problem? They were very stable under OFED 1.2. We are running the Lustre-patched kernel but we did that under OFED 1.2 + lustre 1.6.4.2 as well and I''m pretty sure they don''t touch the IB modules. Relevant information: ====================CentOS 5.3 Lustre 1.8.0.1 2.6.18-128.1.6.el5_lustre.1.8.0.1smp X86_64 (Opteron 275s) hca_id: mthca0 fw_ver: 4.8.200 node_guid: 0005:ad00:0004:668c sys_image_guid: 0002:c900:0100:d050 vendor_id: 0x02c9 vendor_part_id: 25208 hw_ver: 0xA0 board_id: MT_00A0000001 phys_port_cnt: 2 port: 1 state: active (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 49 port_lmc: 0x00 port: 2 state: active (4) max_mtu: 2048 (4) active_mtu: 2048 (4) sm_lid: 1 port_lid: 98 port_lmc: 0x00 Charlie Taylor UF HPC Center> On Wed, Jul 29, 2009 at 2:14 PM, Pradeep Satyanarayana < > prade... at linux.vnet.ibm.com> wrote: > > > Hal Rosenstock wrote: > > > Hi, > > > > > > I''m seeing the following messages from IPoIB: > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > ib0: post_send failed > > > NETDEV WATCHDOG: ib0: transmit timed out > > > ib0: transmit timeout: latency 1374 msecs > > > ib0: queue stopped 1, tx_head 140245691, tx_tail 140245565 > > > > > > What are the possible (and most likely) causes of post_send failures ? I > > > went through the code for all the errors (some at the driver level) but > > > none popped out at me. > > > > > > > Is it possible that the receiver is overwhelmed and hence the tx_ring is > > full? >_______________________________________________ general mailing list general at lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Nirmal Seenu
2009-Aug-17 18:18 UTC
[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]
I was getting these same errors when I was running the following kernel: kernel-lustre-smp-2.6.18-92.1.17.el5_lustre.1.6.7.1.x86_64 These errors went away when I started using 2.6.22.19 with lustre patches + OFED-1.4.2 (http://www.openfabrics.org/downloads/OFED/ofed-1.4.2/OFED-1.4.2.tgz) on the Lustre servers. Nirmal
Isaac Huang
2009-Aug-17 22:36 UTC
[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]
On Mon, Aug 17, 2009 at 12:23:35PM -0400, Charles A. Taylor wrote:> FWIW, I posted this to ofa-general a little earlier. Anyone else > seeing this? Suggestions? I think this is an OFED 1.4.1 problem > but they may point the finger at you guys. :) > > We''ve tried limiting OST threads to no avail. It doesn''t really seem > to require a heavy load to trigger it - more or less random.I wouldn''t think it''s directly caused by Lustre. The IPoIB interface is only needed for address resolution - no Lustre traffic would end up sitting in the IPoIB interface''s TX queue. Have you tried to stress IPoIB, without Lustre running, with a TCP/IP benchmark (e.g. Netperf, Iperf, NetPIPE) or simply a ''ping -f''? Isaac> ...... > Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out > Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449 > msecs > Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770, > tx_tail 868165647 > > The difference between the head/tail is always 123. The send queue > size is 128 according to... > > cat /sys/module/ib_ipoib/parameters/send_queue_size > 128
Craig Prescott
2009-Aug-18 01:16 UTC
[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]
Isaac Huang wrote:> On Mon, Aug 17, 2009 at 12:23:35PM -0400, Charles A. Taylor wrote: >> FWIW, I posted this to ofa-general a little earlier. Anyone else >> seeing this? Suggestions? I think this is an OFED 1.4.1 problem >> but they may point the finger at you guys. :) >> >> We''ve tried limiting OST threads to no avail. It doesn''t really seem >> to require a heavy load to trigger it - more or less random. > > I wouldn''t think it''s directly caused by Lustre. The IPoIB interface > is only needed for address resolution - no Lustre traffic would end up > sitting in the IPoIB interface''s TX queue.We are using a tcp NID on the (troubled) ib1 interfaces to reach our non-IB hosts. We have o2ib NIDs on ib0 (dual-port HCA) to reach the InfiniBand-connected hosts on the same subnet. No problems there.> Have you tried to stress IPoIB, without Lustre running, with a TCP/IP > benchmark (e.g. Netperf, Iperf, NetPIPE) or simply a ''ping -f''?We''ve tried to stress IPoIB with netperf TCP_STREAM on a spare OSS (same hardware, same connectivity) running the same Lustre kernel. No trouble so far. Cheers, Craig Prescott UF HPC Center> Isaac > >> ...... >> Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out >> Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449 >> msecs >> Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770, >> tx_tail 868165647 >> >> The difference between the head/tail is always 123. The send queue >> size is 128 according to... >> >> cat /sys/module/ib_ipoib/parameters/send_queue_size >> 128 > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss