Hi,dearlist
We got a high frequency of logining node crash these days. The scene is that we
can''t remote access to the nodes but we can access them in the
terminal. We run the command :
ps -ef |grep ldlm
root 3823 1 0 10:57 ? 00:00:00 [ldlm_bl_00]
root 3824 1 0 10:57 ? 00:00:00 [ldlm_bl_01]
root 3825 1 0 10:57 ? 00:00:00 [ldlm_bl_02]
root 3826 1 0 10:57 ? 00:00:00 [ldlm_bl_03]
root 3827 1 0 10:57 ? 00:00:00 [ldlm_bl_04]
root 3828 1 0 10:57 ? 00:00:00 [ldlm_bl_05]
root 3829 1 0 10:57 ? 00:00:00 [ldlm_bl_06]
root 3830 1 0 10:57 ? 00:00:00 [ldlm_bl_07]
root 3831 1 0 10:57 ? 00:00:00 [ldlm_cn_00]
root 3832 1 0 10:57 ? 00:00:00 [ldlm_cn_01]
root 3834 1 0 10:57 ? 00:00:00 [ldlm_cn_02]
root 3835 1 0 10:57 ? 00:00:00 [ldlm_cn_03]
root 3836 1 0 10:57 ? 00:00:00 [ldlm_cn_04]
root 3837 1 0 10:57 ? 00:00:00 [ldlm_cn_05]
root 3838 1 0 10:57 ? 00:00:00 [ldlm_cn_06]
root 3839 1 0 10:57 ? 00:00:00 [ldlm_cn_07]
root 3840 1 0 10:57 ? 00:00:00 [ldlm_cb_00]
root 3841 1 0 10:57 ? 00:00:00 [ldlm_cb_01]
root 3842 1 0 10:57 ? 00:00:00 [ldlm_cb_02]
root 3843 1 0 10:57 ? 00:00:00 [ldlm_cb_03]
root 3844 1 0 10:57 ? 00:00:00 [ldlm_cb_04]
root 3845 1 0 10:57 ? 00:00:00 [ldlm_cb_05]
root 3846 1 0 10:57 ? 00:00:00 [ldlm_cb_06]
root 3847 1 0 10:57 ? 00:00:00 [ldlm_cb_07]
.
.
.
we can see many processes about ldlm, it''s up to hundreds. As a result,
the load avarage is too high (155.0 165.0 145.0) to work normally.However, we
have no idea and have to restart the nodes. At the same time,we can get the log
as follows:
The filesystem features:
Server: lustre 1.6.6
Client: lustre 1.6.5
Can someone else get the same problem?
I will appreciate for your any help!
Nov 6 09:24:10 lxslc22 kernel: LustreError:
30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC:
canceling anyway
Nov 6 09:24:10 lxslc22 kernel: LustreError:
30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar
messages
Nov 6 09:24:10 lxslc22 kernel: LustreError:
30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
12745,1 92%
Nov 6 09:24:10 lxslc22 kernel: LustreError:
30586:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar
messages
Nov 6 09:24:10 lxslc22 kernel: LustreError:
30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov 6 09:24:10 lxslc22 kernel: LustreError:
30586:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous similar
messages
Nov 6 09:24:14 lxslc22 hm[4390]: Server went down, finding new server.
Nov 6 09:24:49 lxslc22 last message repeated 7 times
Nov 6 09:25:04 lxslc22 last message repeated 3 times
Nov 6 09:25:08 lxslc22 kernel: Lustre: Request x1111842 sent from
MGC192.168.50.32 at tcp to NID 192.168.50.32 at tcp 500s ago has timed out
(limit 500s).
Nov 6 09:25:08 lxslc22 kernel: Lustre: Skipped 29 previous similar messages
Nov 6 09:25:09 lxslc22 hm[4390]: Server went down, finding new server.
Nov 6 09:25:44 lxslc22 last message repeated 7 times
Nov 6 09:26:49 lxslc22 last message repeated 13 times
Nov 6 09:27:09 lxslc22 last message repeated 4 times
Nov 6 09:27:13 lxslc22 kernel: Lustre:
3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200:
tried all connections, increasing latency to 51s
Nov 6 09:27:13 lxslc22 kernel: Lustre:
3728:0:(import.c:395:import_select_connection()) Skipped 17 previous similar
messages
Nov 6 09:27:14 lxslc22 hm[4390]: Server went down, finding new server.
Nov 6 09:27:49 lxslc22 last message repeated 7 times
Nov 6 09:28:54 lxslc22 last message repeated 13 times
Nov 6 09:29:59 lxslc22 last message repeated 13 times
Nov 6 09:31:04 lxslc22 last message repeated 13 times
Nov 6 09:32:09 lxslc22 last message repeated 13 times
Nov 6 09:33:14 lxslc22 last message repeated 13 times
Nov 6 09:34:19 lxslc22 last message repeated 13 times
Nov 6 09:34:54 lxslc22 last message repeated 7 times
Nov 6 09:34:55 lxslc22 kernel: LustreError:
30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC:
canceling anyway
Nov 6 09:34:55 lxslc22 kernel: LustreError:
30521:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 10 previous similar
messages
Nov 6 09:34:55 lxslc22 kernel: LustreError:
30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov 6 09:34:55 lxslc22 kernel: LustreError:
30521:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 10 previous similar
messages
Nov 6 10:02:38 lxslc22 kernel: Lustre: Request x1113069 sent from
besfs-OST0008-osc-f7e14200 to NID 192.168.50.40 at tcp 500s ago has timed out
(limit 500s).
Nov 6 10:02:38 lxslc22 kernel: Lustre: Skipped 18 previous similar messages
Nov 6 10:02:39 lxslc22 hm[4390]: Server went down, finding new server.
Nov 6 10:03:14 lxslc22 last message repeated 7 times
Nov 6 10:04:19 lxslc22 last message repeated 13 times
Nov 6 10:04:39 lxslc22 last message repeated 4 times
Nov 6 10:04:43 lxslc22 kernel: Lustre:
3728:0:(import.c:395:import_select_connection()) besfs-MDT0000-mdc-f7e14200:
tried all connections, increasing latency to 51s
Nov 6 10:04:43 lxslc22 kernel: Lustre:
3728:0:(import.c:395:import_select_connection()) Skipped 34 previous similar
messages
Nov 6 10:04:44 lxslc22 hm[4390]: Server went down, finding new server.
Nov 6 10:05:19 lxslc22 last message repeated 7 times
Nov 6 10:06:24 lxslc22 last message repeated 13 times
Nov 6 10:06:44 lxslc22 last message repeated 4 times
Nov 6 10:06:46 lxslc22 kernel: LustreError:
30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC:
canceling anyway
Nov 6 10:06:46 lxslc22 kernel: LustreError:
30582:0:(ldlm_request.c:986:ldlm_cli_cancel_req()) Skipped 9 previous similar
messages
Nov 6 10:06:46 lxslc22 kernel: LustreError:
30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11
Nov 6 10:06:46 lxslc22 kernel: LustreError:
30582:0:(ldlm_request.c:1575:ldlm_cli_cancel_list()) Skipped 9 previous similar
messages
Thanks,
Sarea
2009-11-06
huangql
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20091106/6ea7437f/attachment.html