huangql
2009-Aug-17 07:23 UTC
[Lustre-discuss] Why are there many threads named "ll_imp_inval" in client?
Hi, all Our system run well past two weeks, However, we found there are some computing nodes which has so many threads named "ll_imp_inval", and the load average of the clients(computing nodes) is up to 28. As a results, Users can''t submit jobs to the clients. I read the source file(import.c) and In my opinion, when each ptlrpc-connect-import or ptlrpc-import-recovery, the ll_imp_inval thread is triggered. So if the server or clients have something wrong, the thread will not exit. Is it right? we run ''ps -aux | grep ll_imp_inval'' ,the results as follows: root 22568 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] root 22569 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] root 22570 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] root 22571 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] ... We had check out the log, and found the main messages as follows,and in other nodes we can get the client evicted messages: Aug 13 08:57:55 bws0211 kernel: Lustre: Request x5103879 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 08:57:55 bws0211 kernel: Lustre: Skipped 6 previous similar messages Aug 13 08:58:34 bws0211 kernel: Lustre: testfs-OST0018-osc-f7dcfe00: Connection restored to service testfs-OST0018 using nid 192.168.50.79 at tcp. Aug 13 09:00:00 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 09:00:00 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 5 previous similar messages Aug 13 09:00:00 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:00:00 bws0211 kernel: LustreError: Skipped 2 previous similar messages Aug 13 09:00:00 bws0211 kernel: LustreError: 167-0: This client was evicted by testfs-OST001c; in progress operations using this service will fail. Aug 13 09:00:00 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection restored to service testfs-OST001c using nid 192.168.50.80 at tcp. Aug 13 09:00:00 bws0211 kernel: Lustre: Skipped 2 previous similar messages Aug 13 09:02:05 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:02:05 bws0211 kernel: LustreError: Skipped 1 previous similar message Aug 13 09:06:15 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:06:15 bws0211 kernel: LustreError: Skipped 3 previous similar messages Aug 13 09:10:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 26s Aug 13 09:10:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 10 previous similar messages Aug 13 09:12:30 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:12:30 bws0211 kernel: LustreError: Skipped 5 previous similar messages Aug 13 09:20:50 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 09:20:50 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 09:22:55 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:22:55 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 09:31:15 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 09:31:15 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 09:33:20 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:33:20 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 09:41:40 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 31s Aug 13 09:41:40 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 09:43:45 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:43:45 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 09:52:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 09:52:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 09:54:10 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 09:54:10 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 10:02:30 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 10:02:30 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 10:04:35 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 10:04:35 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 10:12:55 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 31s Aug 13 10:12:55 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 10:15:00 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 10:15:00 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 10:23:20 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 10:23:20 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 10:25:25 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 10:25:25 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 10:33:45 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 10:33:45 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 10:35:50 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 10:35:50 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 10:44:10 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 31s Aug 13 10:44:10 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 10:46:15 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 10:46:15 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 10:54:35 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 10:54:35 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 10:56:40 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 10:56:40 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 11:05:00 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 11:05:00 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 11:07:05 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 11:07:05 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 11:15:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 31s Aug 13 11:15:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 11:17:30 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 11:17:30 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 11:25:50 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 11:25:50 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 11:27:55 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 11:27:55 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 11:36:15 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 11:36:15 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 11:38:20 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 11:38:20 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 11:46:40 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 31s Aug 13 11:46:40 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 11:48:45 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 11:48:45 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 11:57:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 11:57:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 12:01:15 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The obd_ping operation failed with -107 Aug 13 12:01:15 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 12:01:15 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection to service testfs-OST001c via nid 192.168.50.80 at tcp was lost; in progress operations using this service will wait for recovery to complete. Aug 13 12:01:15 bws0211 kernel: Lustre: Skipped 2 previous similar messages Aug 13 12:01:15 bws0211 kernel: LustreError: 167-0: This client was evicted by testfs-OST001c; in progress operations using this service will fail. Aug 13 12:01:15 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection restored to service testfs-OST001c using nid 192.168.50.80 at tcp. Aug 13 12:07:30 bws0211 kernel: Lustre: Request x5245865 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 12:07:30 bws0211 kernel: Lustre: Skipped 2 previous similar messages Aug 13 12:09:35 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 12:09:35 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 3 previous similar messages Aug 13 12:09:35 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection to service testfs-OST001c via nid 192.168.50.80 at tcp was lost; in progress operations using this service will wait for recovery to complete. Aug 13 12:09:35 bws0211 kernel: LustreError: 167-0: This client was evicted by testfs-OST001c; in progress operations using this service will fail. Aug 13 12:09:35 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection restored to service testfs-OST001c using nid 192.168.50.80 at tcp. Aug 13 12:17:55 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The obd_ping operation failed with -107 Aug 13 12:17:55 bws0211 kernel: LustreError: Skipped 3 previous similar messages Aug 13 12:17:55 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection to service testfs-OST001c via nid 192.168.50.80 at tcp was lost; in progress operations using this service will wait for recovery to complete. Aug 13 12:17:55 bws0211 kernel: LustreError: 167-0: This client was evicted by testfs-OST001c; in progress operations using this service will fail. Aug 13 12:17:55 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection restored to service testfs-OST001c using nid 192.168.50.80 at tcp. Aug 13 12:20:00 bws0211 kernel: Lustre: Request x5254534 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 12:20:00 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 12:22:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 11s Aug 13 12:22:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 3 previous similar messages Aug 13 12:28:20 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 12:28:20 bws0211 kernel: LustreError: Skipped 6 previous similar messages Aug 13 12:32:30 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 36s Aug 13 12:32:30 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 12:38:45 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 12:38:45 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 12:42:55 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 12:42:55 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 12:49:10 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The obd_ping operation failed with -107 Aug 13 12:49:10 bws0211 kernel: LustreError: Skipped 5 previous similar messages Aug 13 12:49:10 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection to service testfs-OST001c via nid 192.168.50.80 at tcp was lost; in progress operations using this service will wait for recovery to complete. Aug 13 12:49:10 bws0211 kernel: LustreError: 167-0: This client was evicted by testfs-OST001c; in progress operations using this service will fail. Aug 13 12:49:10 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection restored to service testfs-OST001c using nid 192.168.50.80 at tcp. Aug 13 12:53:20 bws0211 kernel: Lustre: Request x5277671 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 12:53:20 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 12:55:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 12:55:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 3 previous similar messages Aug 13 12:57:30 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection to service testfs-OST001c via nid 192.168.50.80 at tcp was lost; in progress operations using this service will wait for recovery to complete. Aug 13 12:57:30 bws0211 kernel: LustreError: 167-0: This client was evicted by testfs-OST001c; in progress operations using this service will fail. Aug 13 12:57:30 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection restored to service testfs-OST001c using nid 192.168.50.80 at tcp. Aug 13 13:03:45 bws0211 kernel: Lustre: Request x5283965 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 13:03:45 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 13:05:50 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 13:05:50 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 1 previous similar message Aug 13 13:05:50 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 13:05:50 bws0211 kernel: LustreError: Skipped 1 previous similar message Aug 13 13:16:15 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 26s Aug 13 13:16:15 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 13:16:15 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 13:16:15 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 13:26:40 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 13:26:40 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 13:26:40 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 13:26:40 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 13:37:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 13:37:05 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 13:37:05 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 13:37:05 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 13:47:30 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 26s Aug 13 13:47:30 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 13:47:30 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 13:47:30 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 13:57:55 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 13:57:55 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 13:57:55 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 13:57:55 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 14:08:20 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 14:08:20 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 14:08:20 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 14:08:20 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 14:18:45 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 31s Aug 13 14:18:45 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 14:18:45 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 14:18:45 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 14:29:10 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 51s Aug 13 14:29:10 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 14:29:10 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 14:29:10 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 14:39:35 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 14:39:35 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 14:39:35 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 14:39:35 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 14:50:00 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 31s Aug 13 14:50:00 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 9 previous similar messages Aug 13 14:58:20 bws0211 kernel: Lustre: Request x5363068 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 14:58:20 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 14:58:20 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The obd_ping operation failed with -107 Aug 13 14:58:20 bws0211 kernel: LustreError: Skipped 9 previous similar messages Aug 13 14:58:20 bws0211 kernel: Lustre: testfs-OST001c-osc-f7dcfe00: Connection to service testfs-OST001c via nid 192.168.50.80 at tcp was lost; in progress operations using this service will wait for recovery to complete. Aug 13 14:58:22 bws0211 kernel: Lustre: Request x5363070 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 502s ago has timed out (limit 500s). Aug 13 14:58:22 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 15:00:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 36s Aug 13 15:00:25 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 1 previous similar message Aug 13 15:00:27 bws0211 kernel: Lustre: Request x5364501 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 502s ago has timed out (limit 500s). Aug 13 15:02:34 bws0211 kernel: Lustre: Request x5365911 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 504s ago has timed out (limit 500s). Aug 13 15:04:35 bws0211 kernel: Lustre: Request x5367371 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:08:45 bws0211 kernel: Lustre: Request x5370730 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:10:51 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 41s Aug 13 15:10:51 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 1 previous similar message Aug 13 15:19:11 bws0211 kernel: Lustre: Request x5377804 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:19:11 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 15:21:16 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 46s Aug 13 15:21:16 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 1 previous similar message Aug 13 15:23:20 bws0211 kernel: Lustre: Request x5369047 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 1500s ago has timed out (limit 1500s). Aug 13 15:23:20 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 15:29:36 bws0211 kernel: Lustre: Request x5384919 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:31:40 bws0211 kernel: Lustre: Request x5386289 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:31:40 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 15:31:41 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001c-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 15:31:41 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 3 previous similar messages Aug 13 15:37:56 bws0211 kernel: Lustre: Request x5390694 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:40:01 bws0211 kernel: Lustre: Request x5393807 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:40:01 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 15:42:06 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001c-osc-f7dcfe00: tried all connections, increasing latency to 11s Aug 13 15:42:06 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 2 previous similar messages Aug 13 15:46:16 bws0211 kernel: Lustre: Request x5412456 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:50:26 bws0211 kernel: Lustre: Request x5425369 sent from testfs-OST001c-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 15:50:26 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 15:52:31 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001c-osc-f7dcfe00: tried all connections, increasing latency to 16s Aug 13 15:52:31 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 2 previous similar messages Aug 13 15:56:41 bws0211 kernel: Lustre: Request x5444459 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 16:02:56 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001c-osc-f7dcfe00: tried all connections, increasing latency to 21s Aug 13 16:02:56 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 2 previous similar messages Aug 13 16:05:01 bws0211 kernel: Lustre: Request x5469816 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 16:05:01 bws0211 kernel: Lustre: Skipped 2 previous similar messages Aug 13 16:07:06 bws0211 kernel: LustreError: 4236:0:(import.c:756:ptlrpc_connect_interpret()) testfs-OST001a_UUID went back in time (transno 22127117 was previously committed, server now claims 22127112)! See https://bugzilla.clusterfs.com/long_list.cgi?buglist=9646 Aug 13 16:07:06 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 16:09:11 bws0211 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.50.80 at tcp. The ost_connect operation failed with -19 Aug 13 16:13:21 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001b-osc-f7dcfe00: tried all connections, increasing latency to 16s Aug 13 16:13:21 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 4 previous similar messages Aug 13 16:21:41 bws0211 kernel: Lustre: Request x5519899 sent from testfs-OST001b-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 16:21:41 bws0211 kernel: Lustre: Skipped 2 previous similar messages Aug 13 16:23:46 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001b-osc-f7dcfe00: tried all connections, increasing latency to 21s Aug 13 16:23:46 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 1 previous similar message Aug 13 16:32:06 bws0211 kernel: Lustre: Request x5501056 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 1500s ago has timed out (limit 1500s). Aug 13 16:32:06 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 16:40:26 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 6s Aug 13 16:40:26 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 3 previous similar messages Aug 13 16:48:46 bws0211 kernel: Lustre: Request x5599100 sent from testfs-OST001a-osc-f7dcfe00 to NID 192.168.50.80 at tcp 500s ago has timed out (limit 500s). Aug 13 16:48:46 bws0211 kernel: Lustre: Skipped 5 previous similar messages Aug 13 16:55:03 bws0211 kernel: Lustre: setting import testfs-OST001b_UUID INACTIVE by administrator request Aug 13 16:55:03 bws0211 kernel: Lustre: Skipped 1 previous similar message Aug 13 16:55:03 bws0211 kernel: Lustre: testfs-OST001b-osc-f7dcfe00.osc: set parameter active=0 Aug 13 16:55:03 bws0211 kernel: Lustre: Skipped 8 previous similar messages Aug 13 16:57:06 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) testfs-OST001a-osc-f7dcfe00: tried all connections, increasing latency to 16s Aug 13 16:57:06 bws0211 kernel: Lustre: 4237:0:(import.c:395:import_select_connection()) Skipped 5 previous similar messages Aug 13 16:57:06 bws0211 kernel: LustreError: 167-0: This client was evicted by testfs-OST001a; in progress operations using this service will fail. Thank you for your help in advance and I hope receive your letter as soon as possible. Best wishes, Sarea 2009-08-17 huangql -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090817/6d06b39c/attachment-0001.html -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1841 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090817/6d06b39c/attachment-0001.gif
Alexey Lyashkov
2009-Aug-17 14:31 UTC
[Lustre-discuss] Why are there many threads named "ll_imp_inval" in client?
Hi huangql, which lustre version you using? On Mon, 2009-08-17 at 15:23 +0800, huangql wrote:> Hi, all > > Our system run well past two weeks, However, we found there are some > computing nodes which has so many threads named "ll_imp_inval", and > the load average of the clients(computing nodes) is up to 28. As a > results, Users can''t submit jobs to the clients. I read the source > file(import.c) and In my opinion, when each ptlrpc-connect-import or > ptlrpc-import-recovery, the ll_imp_inval thread is triggered. So if > the server or clients have something wrong, the thread will not exit. > Is it right? >don''t. ll_imp_inval is evictor thread - which started if client isn''t connected to server (MDS or OST) until recovery is finished, and server ask client to flush own staled data.> > we run ''ps -aux | grep ll_imp_inval'' ,the results as follows: > > root 22568 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22569 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22570 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22571 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > ...is possible to see output from sysrq-t (echo t > /proc/sysrq-trigger) ?> > We had check out the log, and found the main messages as follows,and > in other nodes we can get the client evicted messages: > > > > Thank you for your help in advance and I hope receive your letter as > soon as possible. > > > Best wishes, > Sarea > > > 2009-08-17 > > ______________________________________________________________________ > huangql > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
huangql
2009-Aug-17 14:47 UTC
[Lustre-discuss] Why are there many threads named "ll_imp_inval"in client?
Hi, Alexey The version of lustre we are using is V1.6.6. I hope you can give me more suggestions. Thanks, Sarea 2009-08-17 huangql ???? Alexey Lyashkov ????? 2009-08-17 22:31:58 ???? huangql ??? lustre-discuss ??? Re: [Lustre-discuss] Why are there many threads named "ll_imp_inval"in client? Hi huangql, which lustre version you using? On Mon, 2009-08-17 at 15:23 +0800, huangql wrote:> Hi, all > > Our system run well past two weeks, However, we found there are some > computing nodes which has so many threads named "ll_imp_inval", and > the load average of the clients(computing nodes) is up to 28. As a > results, Users can''t submit jobs to the clients. I read the source > file(import.c) and In my opinion, when each ptlrpc-connect-import or > ptlrpc-import-recovery, the ll_imp_inval thread is triggered. So if > the server or clients have something wrong, the thread will not exit. > Is it right? >don''t. ll_imp_inval is evictor thread - which started if client isn''t connected to server (MDS or OST) until recovery is finished, and server ask client to flush own staled data.> > we run ''ps -aux | grep ll_imp_inval'' ,the results as follows: > > root 22568 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22569 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22570 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22571 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > ...is possible to see output from sysrq-t (echo t > /proc/sysrq-trigger) ?> > We had check out the log, and found the main messages as follows,and > in other nodes we can get the client evicted messages: > > > > Thank you for your help in advance and I hope receive your letter as > soon as possible. > > > Best wishes, > Sarea > > > 2009-08-17 > > ______________________________________________________________________ > huangql > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090817/049d31d5/attachment-0001.html
huangql
2009-Aug-18 01:52 UTC
[Lustre-discuss] Why are there many threads named "ll_imp_inval"in client?
Hi, all Sorry, I should show the lustre version in details. The version for our system as follows: MDS: V1.6.5 OSS: V1.6.6 Clients: V1.6.6 Thank you for your help. Best wishes, Sarea 2009-08-18 huangql ???? Alexey Lyashkov ????? 2009-08-17 22:31:58 ???? huangql ??? lustre-discuss ??? Re: [Lustre-discuss] Why are there many threads named "ll_imp_inval"in client? Hi huangql, which lustre version you using? On Mon, 2009-08-17 at 15:23 +0800, huangql wrote:> Hi, all > > Our system run well past two weeks, However, we found there are some > computing nodes which has so many threads named "ll_imp_inval", and > the load average of the clients(computing nodes) is up to 28. As a > results, Users can''t submit jobs to the clients. I read the source > file(import.c) and In my opinion, when each ptlrpc-connect-import or > ptlrpc-import-recovery, the ll_imp_inval thread is triggered. So if > the server or clients have something wrong, the thread will not exit. > Is it right? >don''t. ll_imp_inval is evictor thread - which started if client isn''t connected to server (MDS or OST) until recovery is finished, and server ask client to flush own staled data.> > we run ''ps -aux | grep ll_imp_inval'' ,the results as follows: > > root 22568 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22569 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22570 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > root 22571 0.0 0.0 0 0 ? D Aug13 0:00 [ll_imp_inval] > ...is possible to see output from sysrq-t (echo t > /proc/sysrq-trigger) ?> > We had check out the log, and found the main messages as follows,and > in other nodes we can get the client evicted messages: > > > > Thank you for your help in advance and I hope receive your letter as > soon as possible. > > > Best wishes, > Sarea > > > 2009-08-17 > > ______________________________________________________________________ > huangql > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090818/98fc8c2c/attachment-0001.html