I been seeing something that looks like IB timeout errors lately after upgrading to 1.6.5.1 using the supplied ofed kernel drivers. From what I can tell there hasnt been any real network issues that was apparent. Are these errors just typical if the network is busy? Heres from MDS: Sep 8 16:04:19 lustre-mds-0-0 kernel: LustreError: 10001:0:(events.c:55:request_out_callback()) @@@ ty pe 4, status -5 req at ffff81010ffb7200 x15310157/t0 o104-><F0>''<84><88><FF><FF><FF><FF>^P at NET_0x500000a0 c1d08_UUID:15/16 lens 232/256 e 0 to 6 dl 1220857463 ref 2 fl Rpc:N/0/0 rc 0/0 Sep 8 16:04:19 lustre-mds-0-0 kernel: Lustre: Request x15310157 sent from lfs-MDT0000 to NID 10.12.29. 8 at o2ib 2s ago has timed out (limit 6s). Sep 8 16:04:23 lustre-mds-0-0 kernel: Lustre: Request x15310047 sent from lfs-MDT0000 to NID 10.12.28. 235 at o2ib 6s ago has timed out (limit 6s). Sep 8 16:04:43 lustre-mds-0-0 kernel: LustreError: 10003:0:(events.c:55:request_out_callback()) @@@ ty pe 4, status -5 req at ffff8100638cd400 x15310096/t0 o104-><F0>''<84><88><FF><FF><FF><FF>^P at NET_0x500000a0 c1d0e_UUID:15/16 lens 232/256 e 0 to 6 dl 1220857463 ref 1 fl Complete:XN/0/0 rc 0/0 Sep 8 16:05:07 lustre-mds-0-0 kernel: LustreError: 3930:0:(events.c:55:request_out_callback()) @@@ typ e 4, status -113 req at ffff8101672a8c00 x15310047/t0 o104-><F0>''<84><88><FF><FF><FF><FF>^P at NET_0x500000a 0c1ceb_UUID:15/16 lens 232/256 e 0 to 6 dl 1220857463 ref 1 fl Complete:XN/0/0 rc 0/0 Sep 8 16:08:44 lustre-mds-0-0 kernel: Lustre: Skipped 1 previous similar message Sep 8 16:13:24 lustre-mds-0-0 kernel: Lustre: Skipped 4 previous similar messages On the OSS: Sep 9 00:24:55 lustre-oss-4-1 kernel: Lustre: Skipped 3 previous similar messages Sep 9 00:25:01 lustre-oss-4-1 kernel: Lustre: Request x784766 sent from lfs-OST0039 to NID 10.12.29.7@ o2ib 20s ago has timed out (limit 20s). Sep 9 00:25:31 lustre-oss-4-1 kernel: LustreError: 13228:0:(o2iblnd_cb.c:2874:kiblnd_check_conns()) Ti med out RDMA with 10.12.29.7 at o2ib Sep 9 00:25:31 lustre-oss-4-1 kernel: LustreError: 13228:0:(events.c:55:request_out_callback()) @@@ ty pe 4, status -103 req at ffff8102d9b49c00 x784766/t0 o104->@:15/16 lens 232/256 e 0 to 20 dl 1220887501 r ef 1 fl Complete:XN/0/0 rc 0/0 -Alex
Alex Lee wrote:> I been seeing something that looks like IB timeout errors lately after > upgrading to 1.6.5.1 using the supplied ofed kernel drivers. > > From what I can tell there hasnt been any real network issues that was > apparent. Are these errors just typical if the network is busy?Could be. If you are having actual IB issues, there is generally an error from IB prior to any Lustre errors. (And generally lots of LustreErrors - we really get unhappy when your network breaks) What we see here are two requests timing out, one with a 6sec limit, one 20 sec. No indication of a related error, just a timeout exceeded. These timeouts can happen on a busy network. If they are frequent, you should increase obd_timeout. cliffw> > > Heres from MDS: > Sep 8 16:04:19 lustre-mds-0-0 kernel: LustreError: > 10001:0:(events.c:55:request_out_callback()) @@@ ty > pe 4, status -5 req at ffff81010ffb7200 x15310157/t0 > o104-><F0>''<84><88><FF><FF><FF><FF>^P at NET_0x500000a0 > c1d08_UUID:15/16 lens 232/256 e 0 to 6 dl 1220857463 ref 2 fl Rpc:N/0/0 > rc 0/0 > Sep 8 16:04:19 lustre-mds-0-0 kernel: Lustre: Request x15310157 sent > from lfs-MDT0000 to NID 10.12.29. > 8 at o2ib 2s ago has timed out (limit 6s). > Sep 8 16:04:23 lustre-mds-0-0 kernel: Lustre: Request x15310047 sent > from lfs-MDT0000 to NID 10.12.28. > 235 at o2ib 6s ago has timed out (limit 6s). > Sep 8 16:04:43 lustre-mds-0-0 kernel: LustreError: > 10003:0:(events.c:55:request_out_callback()) @@@ ty > pe 4, status -5 req at ffff8100638cd400 x15310096/t0 > o104-><F0>''<84><88><FF><FF><FF><FF>^P at NET_0x500000a0 > c1d0e_UUID:15/16 lens 232/256 e 0 to 6 dl 1220857463 ref 1 fl > Complete:XN/0/0 rc 0/0 > Sep 8 16:05:07 lustre-mds-0-0 kernel: LustreError: > 3930:0:(events.c:55:request_out_callback()) @@@ typ > e 4, status -113 req at ffff8101672a8c00 x15310047/t0 > o104-><F0>''<84><88><FF><FF><FF><FF>^P at NET_0x500000a > 0c1ceb_UUID:15/16 lens 232/256 e 0 to 6 dl 1220857463 ref 1 fl > Complete:XN/0/0 rc 0/0 > Sep 8 16:08:44 lustre-mds-0-0 kernel: Lustre: Skipped 1 previous > similar message > Sep 8 16:13:24 lustre-mds-0-0 kernel: Lustre: Skipped 4 previous > similar messages > > On the OSS: > Sep 9 00:24:55 lustre-oss-4-1 kernel: Lustre: Skipped 3 previous > similar messages > Sep 9 00:25:01 lustre-oss-4-1 kernel: Lustre: Request x784766 sent from > lfs-OST0039 to NID 10.12.29.7@ > o2ib 20s ago has timed out (limit 20s). > Sep 9 00:25:31 lustre-oss-4-1 kernel: LustreError: > 13228:0:(o2iblnd_cb.c:2874:kiblnd_check_conns()) Ti > med out RDMA with 10.12.29.7 at o2ib > Sep 9 00:25:31 lustre-oss-4-1 kernel: LustreError: > 13228:0:(events.c:55:request_out_callback()) @@@ ty > pe 4, status -103 req at ffff8102d9b49c00 x784766/t0 o104->@:15/16 lens > 232/256 e 0 to 20 dl 1220887501 r > ef 1 fl Complete:XN/0/0 rc 0/0 > > -Alex > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
> These timeouts can happen on a busy network. If they are frequent, you > should increase obd_timeout. > cliffw > >obd_timeout is already set at 300secs which I think is pretty long... Does the error message means it timed out after using up all 300sec waittime and then additional 20secs? Or is this from the new variable timeout that was implemented in 1.6.5? -Alex
Alex Lee wrote:>> These timeouts can happen on a busy network. If they are frequent, you >> should increase obd_timeout. >> cliffw >> >> > > obd_timeout is already set at 300secs which I think is pretty long... > Does the error message means it timed out after using up all 300sec > waittime and then additional 20secs? >No actually those timeouts i think are a divisor of obd_timeout. If you are already at 300 seconds you must have a very busy network.> Or is this from the new variable timeout that was implemented in 1.6.5?Adaptive timeouts are disabled by default currently, so you would have to enable them. Might be worth trying actually. cliffw> > -Alex > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
On Tue, Sep 09, 2008 at 01:55:46PM +0900, Alex Lee wrote:> I been seeing something that looks like IB timeout errors lately after > upgrading to 1.6.5.1 using the supplied ofed kernel drivers. > ...... > Sep 9 00:25:31 lustre-oss-4-1 kernel: LustreError: > 13228:0:(o2iblnd_cb.c:2874:kiblnd_check_conns()) Ti > med out RDMA with 10.12.29.7 at o2ibThere''s a debug patch in bug 15804 to help narrow down RDMA timeout causes: https://bugzilla.lustre.org/show_bug.cgi?id=15804 Thanks, Isaac