thr3ads.net - Lustre discuss - [Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts] [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Charles A. Taylor

2009-Aug-17 16:23 UTC

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

FWIW, I posted this to ofa-general a little earlier.   Anyone else
seeing this?    Suggestions?    I think this is an OFED 1.4.1 problem
but they may point the finger at you guys.  :)

We''ve tried limiting OST threads to no avail.   It doesn''t
really seem
to require a heavy load to trigger it - more or less random.

Charlie Taylor
UF HPC Center

-------- Forwarded Message --------
From: Charles A. Taylor <taylor at hpc.ufl.edu>
To: general at lists.openfabrics.org
Cc: Craig Prescott <prescott at hpc.ufl.edu>
Subject: [ofa-general] IPoIB Transmit Timeouts
Date: Mon, 17 Aug 2009 12:10:25 -0400

We upgraded our file servers to OFED 1.4.1 last Thursday and have since
been hit with a daily ration of the following across all eight of our
servers...

Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out
Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449
msecs
Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770,
tx_tail 868165647

The difference between the head/tail is always 123.   The send queue
size is 128 according to...

cat /sys/module/ib_ipoib/parameters/send_queue_size 
128
>From the post below, others seem to have encountered this but we havenot seen any patches or work-arounds.   Has anyone solved this problem?

They were very stable under OFED 1.2.   We are running the
Lustre-patched kernel but we did that under OFED 1.2 + lustre 1.6.4.2 as
well and I''m pretty sure they don''t touch the IB modules.

Relevant information:
====================CentOS 5.3
Lustre 1.8.0.1
2.6.18-128.1.6.el5_lustre.1.8.0.1smp
X86_64 (Opteron 275s)

hca_id: mthca0
        fw_ver:                         4.8.200
        node_guid:                      0005:ad00:0004:668c
        sys_image_guid:                 0002:c900:0100:d050
        vendor_id:                      0x02c9
        vendor_part_id:                 25208
        hw_ver:                         0xA0
        board_id:                       MT_00A0000001
        phys_port_cnt:                  2
                port:   1
                        state:                  active (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               49
                        port_lmc:               0x00

                port:   2
                        state:                  active (4)
                        max_mtu:                2048 (4)
                        active_mtu:             2048 (4)
                        sm_lid:                 1
                        port_lid:               98
                        port_lmc:               0x00


Charlie Taylor
UF HPC Center
> On Wed, Jul 29, 2009 at 2:14 PM, Pradeep Satyanarayana <
> prade... at linux.vnet.ibm.com> wrote:
> 
> > Hal Rosenstock wrote:
> > > Hi,
> > >
> > > I''m seeing the following messages from IPoIB:
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > ib0: post_send failed
> > > NETDEV WATCHDOG: ib0: transmit timed out
> > > ib0: transmit timeout: latency 1374 msecs
> > > ib0: queue stopped 1, tx_head 140245691, tx_tail 140245565
> > >
> > > What are the possible (and most likely) causes of post_send
failures ? I
> > > went through the code for all the errors (some at the driver
level) but
> > > none popped out at me.
> > >
> >
> > Is it possible that the receiver is overwhelmed and hence the tx_ring
is
> > full?
> 
_______________________________________________
general mailing list
general at lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Nirmal Seenu

2009-Aug-17 18:18 UTC

head link

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

I was getting these same errors when I was running the following kernel:

kernel-lustre-smp-2.6.18-92.1.17.el5_lustre.1.6.7.1.x86_64

These errors went away when I started using 2.6.22.19 with lustre 
patches + OFED-1.4.2 
(http://www.openfabrics.org/downloads/OFED/ofed-1.4.2/OFED-1.4.2.tgz) on 
the Lustre servers.

Nirmal

Isaac Huang

2009-Aug-17 22:36 UTC

head link

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

On Mon, Aug 17, 2009 at 12:23:35PM -0400, Charles A. Taylor
wrote:> FWIW, I posted this to ofa-general a little earlier.   Anyone else
> seeing this?    Suggestions?    I think this is an OFED 1.4.1 problem
> but they may point the finger at you guys.  :)
> 
> We''ve tried limiting OST threads to no avail.   It
doesn''t really seem
> to require a heavy load to trigger it - more or less random.
I wouldn''t think it''s directly caused by Lustre. The IPoIB
interface
is only needed for address resolution - no Lustre traffic would end up
sitting in the IPoIB interface''s TX queue.

Have you tried to stress IPoIB, without Lustre running, with a TCP/IP 
benchmark (e.g. Netperf, Iperf, NetPIPE) or simply a ''ping
-f''?

Isaac
> ......
> Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out
> Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449
> msecs
> Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770,
> tx_tail 868165647
> 
> The difference between the head/tail is always 123.   The send queue
> size is 128 according to...
> 
> cat /sys/module/ib_ipoib/parameters/send_queue_size 
> 128

Craig Prescott

2009-Aug-18 01:16 UTC

head link

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

Isaac Huang wrote:> On Mon, Aug 17, 2009 at 12:23:35PM -0400, Charles A. Taylor wrote:
>> FWIW, I posted this to ofa-general a little earlier.   Anyone else
>> seeing this?    Suggestions?    I think this is an OFED 1.4.1 problem
>> but they may point the finger at you guys.  :)
>>
>> We''ve tried limiting OST threads to no avail.   It
doesn''t really seem
>> to require a heavy load to trigger it - more or less random.
> 
> I wouldn''t think it''s directly caused by Lustre. The
IPoIB interface
> is only needed for address resolution - no Lustre traffic would end up
> sitting in the IPoIB interface''s TX queue.
We are using a tcp NID on the (troubled) ib1 interfaces to reach our 
non-IB hosts.

We have o2ib NIDs on ib0 (dual-port HCA) to reach the 
InfiniBand-connected hosts on the same subnet.  No problems there.
> Have you tried to stress IPoIB, without Lustre running, with a TCP/IP 
> benchmark (e.g. Netperf, Iperf, NetPIPE) or simply a ''ping
-f''?
We''ve tried to stress IPoIB with netperf TCP_STREAM on a spare OSS
(same
hardware, same connectivity) running the same Lustre kernel.  No trouble 
so far.

Cheers,
Craig Prescott
UF HPC Center

> Isaac
> 
>> ......
>> Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out
>> Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449
>> msecs
>> Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770,
>> tx_tail 868165647
>>
>> The difference between the head/tail is always 123.   The send queue
>> size is 128 according to...
>>
>> cat /sys/module/ib_ipoib/parameters/send_queue_size 
>> 128
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss

Lustre discuss - Aug 2009 - [Fwd: [ofa-general] IPoIB Transmit Timeouts]

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]

[Lustre-discuss] [Fwd: [ofa-general] IPoIB Transmit Timeouts]