thr3ads.net - Lustre discuss - [Lustre-discuss] lustre errors when system stressed; bad hardware? [Aug 2009]

If this information is useful, please help other people find it:
Share via:

Abe Ingersoll

2009-Aug-27 01:52 UTC

[Lustre-discuss] lustre errors when system stressed; bad hardware?

Is this likely bad IB hardware/switch/cables? Bad RAM?

It''s lustre-1.8.1 on CentOS-5.x with four OSS''s exporting two
Datadirect/FC
OST''s each, a separate MDT/MGS and clients all over qlogic ib_ipath IB.
The
clients are simulatenously running iozone against the lustre fs. ("iozone
-a
-g 32G -f /mnt/sparta/iozone/iozone.`hostname`.file -M -R -b
report-`hostname`.xls -i 0 -i 1 -i 2 -i 3 -i 4 -i 5 -i 6 -n 4g -y 4096")

One client and an OSS spit out these errors, iozone appears to continue on
just fine --

client:

Lustre: 2637:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked()) Closing
conn to 10.168.22.106 at o2ib: error 0(waiting)

LustreError: 3898:0:(events.c:66:request_out_callback()) @@@ type 4,
status -103  req at ffff81021933d400 x1312117388289524/t0
o4->sparta-OST0003_UUID at 10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1
dl 1251335107 ref 3 fl Rpc:/0/0 rc 0/0


Lustre: 3926:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1312117388289524 sent from sparta-OST0003-osc-ffff8102198f1000 to NID
10.168.22.106 at o2ib 0s ago has failed due to network error (limit 7s).
  req at ffff81021933d400 x1312117388289524/t0
o4->sparta-OST0003_UUID at 10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1
dl 1251335107 ref 2 fl Rpc:/0/0 rc 0/0


Lustre: sparta-OST0003-osc-ffff8102198f1000: Connection to service
sparta-OST0003 via nid 10.168.22.106 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with
10.168.22.106 at o2ib. The ost_connect operation failed with -16


LustreError: Skipped 2 previous similar messages
Lustre: 3926:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request
x1312117388289487 sent from sparta-OST0002-osc-ffff8102198f1000 to NID
10.168.22.106 at o2ib 7s ago has timed out (limit 7s).


  req at ffff8102192b6400 x1312117388289487/t0
o4->sparta-OST0002_UUID at 10.168.22.106@o2ib:6/4 lens 448/608 e 0 to 1
dl 1251335106 ref 2 fl Rpc:/0/0 rc 0/0


Lustre: sparta-OST0002-osc-ffff8102198f1000: Connection to service
sparta-OST0002 via nid 10.168.22.106 at o2ib was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: sparta-OST0002-osc-ffff8102198f1000: Connection restored to
service sparta-OST0002 using nid 10.168.22.106 at o2ib.


Lustre: 3928:0:(import.c:508:import_select_connection())
sparta-OST0003-osc-ffff8102198f1000: tried all connections, increasing
latency to 6s
Lustre: 3928:0:(import.c:508:import_select_connection()) Skipped 2
previous similar messages


Lustre: sparta-OST0003-osc-ffff8102198f1000: Connection restored to
service sparta-OST0003 using nid 10.168.22.106 at o2ib.


oss:

Lustre: 4790:0:(o2iblnd_cb.c:955:kiblnd_tx_complete()) Tx ->
10.168.22.104 at o2ib cookie 0xc8dd6 sending 1 waiting 1: failed 12
Lustre: 4790:0:(o2iblnd_cb.c:1785:kiblnd_close_conn_locked()) Closing conn
to 10.168.22.104 at o2ib: error -5(waiting)
LustreError: 4790:0:(events.c:367:server_bulk_callback()) event type 4,
status -5, desc ffff8100ae208000
LustreError: 4790:0:(events.c:367:server_bulk_callback()) event type 2,
status -5, desc ffff8100ae208000
LustreError: 5086:0:(ost_handler.c:1014:ost_brw_write()) @@@ network error
on bulk GET 0(1048576)  req at ffff8101fc9fc800 x1312117388289524/t0
o4->2920ef40-0b97-644f-178a-5e74613e467b at NET_0x500000aa81668_UUID:0/0 lens
448/416 e 0 to 0 dl 1251335106 ref 1 fl Interpret:/0/0 rc 0/0
Lustre: 5086:0:(ost_handler.c:1150:ost_brw_write()) sparta-OST0003: ignoring
bulk IO comm error with
2920ef40-0b97-644f-178a-5e74613e467b at NET_0x500000aa81668_UUID id
12345-10.168.22.104 at o2ib - client will retry
Lustre: 4953:0:(ldlm_lib.c:541:target_handle_reconnect()) sparta-OST0003:
2920ef40-0b97-644f-178a-5e74613e467b reconnecting
Lustre: 4953:0:(ldlm_lib.c:835:target_handle_connect()) sparta-OST0003:
refuse reconnection from
2920ef40-0b97-644f-178a-5e74613e467b at 10.168.22.104@o2ib
to 0xffff810421231000; still busy with 2 active RPCs
LustreError: 4953:0:(ldlm_lib.c:1850:target_send_reply_msg()) @@@ processing
error (-16)  req at ffff8103baae8c00 x1312117388289527/t0
o8->2920ef40-0b97-644f-178a-5e74613e467b at NET_0x500000aa81668_UUID:0/0 lens
368/264 e 0 to 0 dl 1251335200 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 4953:0:(ldlm_lib.c:1850:target_send_reply_msg()) Skipped 1
previous similar message
Lustre: 5075:0:(ldlm_lib.c:541:target_handle_reconnect()) sparta-OST0002:
2920ef40-0b97-644f-178a-5e74613e467b reconnecting
Lustre: 5010:0:(ldlm_lib.c:541:target_handle_reconnect()) sparta-OST0003:
2920ef40-0b97-644f-178a-5e74613e467b reconnecting
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090826/9f412561/attachment.html

Isaac Huang

2009-Aug-27 21:37 UTC

head link

[Lustre-discuss] lustre errors when system stressed; bad hardware?

On Wed, Aug 26, 2009 at 06:52:24PM -0700, Abe Ingersoll
wrote:>    ......
>    kiblnd_tx_complete()) Tx -> 10.168.22.104 at o2ib cookie 0xc8dd6
sending 1
>    waiting 1: failed 12
12 == IB_WC_RETRY_EXC_ERR, which usually indicates faulty links in the
network or some other application (like a MPI application) hogging
network resources unfavorably against Lustre. We once observed such
errors at times there was no IO at all - a bad MPI implementation was
resending aggressively upon RNR such that even the tiny bit of
keepalive traffic from Lustre would end up with IB_WC_RETRY_EXC_ERR.

Diagnostics from OFED and the fabric should point you to faulty
hardware, and setting up IB QoS should prevent Lustre from being hurt
badly by someone else.

Meanwhile, there''s a potential workaround mentioned here:
https://bugzilla.lustre.org/show_bug.cgi?id=14223#c36

But it''s certainly not a good solution in the long run.

Thanks,
Isaac

Lustre discuss - Aug 2009 - lustre errors when system stressed; bad hardware?

[Lustre-discuss] lustre errors when system stressed; bad hardware?

[Lustre-discuss] lustre errors when system stressed; bad hardware?