faheem patel
2011-Sep-19 09:10 UTC
[Lustre-discuss] Slow acess or hang state of lustre filesystem on client machines
Hi All, Thanks in advance. We are new to this lustre filesystem. We have installed some 30TB of lustre filesystem configured on client systems. We have 2 MDC servers (MDS) which are in HA mode with IB interface and 1 /mdc filesystem mounted with 500GB of size. we have 2 OSS servers with HA configured between them. 1 OSS server had 8 OST''s filesystem. And 2nd OSS server had 9 OST''s filesystem. i.e total of 17 OST''s distributed between 2 OSS servers which are configured in HA with bond of IB interface on both servers. All lustre clients and Oss and MDS servers are all having IB (infiniband) Network. we are getiing the following error messages on my OSS servers and also on my client machine for the past week. ---------------------------------------------------------------------------------------------------------------------------------------- *Lustre Server OSS error logs* Sep 19 11:08:57 oss1 kernel: Lustre: 15007:0:(ldlm_lib.c:872:target_handle_connect()) lustre-OST0006: refuse reconnection from 65820f02-c4f0-e79a-4778-15a9b4653a88 at 10.148.0.2@o2ib to 0xffff8806320e1800; still busy with 1 active RPCs Sep 19 11:08:57 oss1 kernel: Lustre: 15007:0:(ldlm_lib.c:872:target_handle_connect()) Skipped 1 previous similar message Sep 19 11:09:18 oss1 kernel: Lustre: 13143:0:(ldlm_lib.c:572:target_handle_reconnect()) lustre-OST0006: 65820f02-c4f0-e79a-4778-15a9b4653a88 reconnecting Sep 19 11:09:18 oss1 kernel: Lustre: 13143:0:(ldlm_lib.c:572:target_handle_reconnect()) Skipped 18 previous similar messages Sep 19 11:09:26 oss1 kernel: Lustre: 13255:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348987462916 sent from lustre-OST0008 to NID 10.148.0.2 at o2ib 7s ago has timed out (7s prior to deadline). Sep 19 11:09:26 oss1 kernel: req at ffff880631af7400 x1380348987462916/t0 o104->@NET_0x500000a940002_UUID:15/16 lens 296/384 e 0 to 1 dl 1316410765 ref 2 fl Rpc:N/0/0 rc 0/0 Sep 19 11:09:26 oss1 kernel: LustreError: 138-a: lustre-OST0008: A client on nid 10.148.0.2 at o2ib was evicted due to a lock blocking callback to 10.148.0.2 at o2ib timed out: rc -107 Sep 19 11:09:26 oss1 kernel: LustreError: 13122:0:(client.c:841:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff880312465c00 x1380348987462920/t0 o105->@NET_0x500000a940002_UUID:15/16 lens 344/384 e 0 to 1 dl 0 ref 1 fl Rpc:N/0/0 rc 0/0 Sep 19 11:09:26 oss1 kernel: LustreError: 13122:0:(ldlm_lockd.c:595:ldlm_handle_ast_error()) ### client (nid 10.148.0.2 at o2ib) returned 0 from completion AST ns: filter-lustre-OST0008_UUID lock: ffff880629826c00/0x7f1137a31caded22 lrc: 3/0,0 mode: PW/PW res: 10165838/0 rrc: 3 type: EXT [0->18446744073709551615] (req 0->18446744073709551615) flags: 0x0 remote: 0xb4433ce500b30ffb expref: 440 pid: 13255 timeout 0 Sep 19 11:09:38 oss1 kernel: LustreError: 13117:0:(ldlm_lockd.c:1824:ldlm_cancel_handler()) operation 103 from 12345-10.148.0.2 at o2ib with bad export cookie 9156160691120629456 Sep 19 11:26:58 oss1 gdm-session-worker[15599]: PAM pam_putenv: NULL pam handle passed ------------------------------------------------------------------------------------------------------*Lustre Client error log messages..* Sep 19 10:38:43 service0 kernel: [ 6094.583298] req at ffff880b0f7f0800 x1380348451643117/t0 o8->lustre-OST0007_UUID at 10.148.0.107@o2ib:28/4 lens 368/584 e 0 to 1 dl 1316408923 ref 2 fl Rpc:N/0/0 rc 0/0 Sep 19 10:38:43 service0 kernel: [ 6094.583305] Lustre: 8565:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 6 previous similar messages Sep 19 10:38:44 service0 kernel: [ 6095.582711] Lustre: 8566:0:(import.c:517:import_select_connection()) lustre-OST0001-osc-ffff8806234ee000: tried all connections, increasing latency to 8s Sep 19 10:38:46 service0 kernel: [ 6097.355378] LustreError: 11-0: an error occurred while communicating with 10.148.0.107 at o2ib. The ost_connect operation failed with -16 Sep 19 10:38:46 service0 kernel: [ 6097.355381] LustreError: Skipped 20 previous similar messages Sep 19 10:38:58 service0 kernel: [ 6109.582174] Lustre: lustre-OST0001-osc-ffff8806234ee000: Connection restored to service lustre-OST0001 using nid 10.148.0.107 at o2ib. Sep 19 10:39:55 service0 kernel: [ 6166.617376] Lustre: 14902:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348451645667 sent from lustre-OST0000-osc-ffff8806234ee000 to NID 10.148.0.106 at o2ib 14s ago has timed out (14s prior to deadline). Sep 19 10:39:55 service0 kernel: [ 6166.617381] req at ffff8805f69bc800 x1380348451645667/t0 o101->lustre-OST0000_UUID at 10.148.0.106@o2ib:28/4 lens 296/544 e 0 to 1 dl 1316408995 ref 2 fl Rpc:/0/0 rc 0/0 Sep 19 10:39:55 service0 kernel: [ 6166.617390] Lustre: 14902:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 8 previous similar messages Sep 19 10:39:55 service0 kernel: [ 6166.617402] Lustre: lustre-OST0000-osc-ffff8806234ee000: Connection to service lustre-OST0000 via nid 10.148.0.106 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 19 10:39:55 service0 kernel: [ 6166.617406] Lustre: Skipped 8 previous similar messages Sep 19 10:39:56 service0 sshd[14904]: Accepted keyboard-interactive/pam for root from 192.9.70.32 port 33623 ssh2 Sep 19 10:40:08 service0 kernel: [ 6179.616393] Lustre: 8565:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348451645687 sent from lustre-OST0000-osc-ffff8806234ee000 to NID 10.148.0.106 at o2ib 13s ago has timed out (13s prior to deadline). Sep 19 10:40:08 service0 kernel: [ 6179.616396] req at ffff880af48e5000 x1380348451645687/t0 o8->lustre-OST0000_UUID at 10.148.0.106@o2ib:28/4 lens 368/584 e 0 to 1 dl 1316409008 ref 2 fl Rpc:N/0/0 rc 0/0 Sep 19 10:40:09 service0 kernel: [ 6180.616338] Lustre: 8566:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff8806234ee000: tried all connections, increasing latency to 9s Sep 19 10:40:09 service0 kernel: [ 6180.616344] Lustre: 8566:0:(import.c:517:import_select_connection()) Skipped 8 previous similar messages Sep 19 10:40:16 service0 kernel: [ 6187.219814] Lustre: 8564:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1380348451645670 sent from lustre-OST0000-osc-ffff8806234ee000 to NID 10.148.0.106 at o2ib 31s ago has timed out (31s prior to deadline). Sep 19 10:40:16 service0 kernel: [ 6187.219818] req at ffff880b27740800 x1380348451645670/t0 o400->lustre-OST0000_UUID at 10.148.0.106@o2ib:28/4 lens 192/384 e 0 to 1 dl 1316409016 ref 1 fl Rpc:N/0/0 rc 0/0 Sep 19 10:40:16 service0 kernel: [ 6187.219843] Lustre: lustre-OST0003-osc-ffff8806234ee000: Connection to service lustre-OST0003 via nid 10.148.0.106 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 19 10:40:21 service0 kernel: [ 6192.219448] Lustre: lustre-OST0010-osc-ffff8806234ee000: Connection to service lustre-OST0010 via nid 10.148.0.106 at o2ib was lost; in progress operations using this service will wait for recovery to complete. Sep 19 10:40:21 service0 kernel: [ 6192.219453] Lustre: Skipped 2 previous similar messages Sep 19 10:40:24 service0 kernel: [ 6195.615180] Lustre: 8566:0:(import.c:517:import_select_connection()) lustre-OST0000-osc-ffff8806234ee000: tried all connections, increasing latency to 10s Sep 19 10:40:25 service0 kernel: [ 6196.029170] LustreError: 11-0: an error occurred while communicating with 10.148.0.106 at o2ib. The ost_connect operation failed with -16 Sep 19 10:40:25 service0 kernel: [ 6196.029174] LustreError: Skipped 8 previous similar messages Sep 19 10:40:25 service0 kernel: [ 6196.029345] Lustre: lustre-OST0000-osc-ffff8806234ee000: Connection restored to service lustre-OST0000 using nid 10.148.0.106 at o2ib. Sep 19 10:40:25 service0 kernel: [ 6196.029349] Lustre: Skipped 8 previous similar messages Thanks and Regards, Faheem PAtel -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110919/05908728/attachment.html