Hello, We upgraded out clients to 1.6.6 and the servers are still 1.6.5 we are still seeing where the login nodes much more than the compute nodes start being evicted but never are able to recover, LustreError: 11-0: an error occurred while communicating with 10.164.3.246 at tcp. The mds_connect operation failed with -16 LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got rc -11 from cancel RPC: canceling anyway LustreError: 15616:0:(ldlm_request.c:1605:ldlm_cli_cancel_list()) ldlm_cli_cancel_list: -11 Lustre: Request x960964 sent from nobackup-MDT0000- mdc-00000100f7eb0800 to NID 10.164.3.247 at tcp 100s ago has timed out (limit 100s). Lustre: Skipped 2 previous similar messages LustreError: 167-0: This client was evicted by nobackup-MDT0000; in progress operations using this service will fail. LustreError: 23549:0:(mdc_locks.c:598:mdc_enqueue()) ldlm_cli_enqueue: -5 LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001008d7e0400 x961041/t0 o35->nobackup- MDT0000_UUID at 10.164.3.246@tc p:23/10 lens 296/1248 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0 LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) Skipped 691 previous similar messages LustreError: 19442:0:(file.c:116:ll_close_inode_openhandle()) inode 93290605 mdc close failed: rc = -108 Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection restored to service nobackup-MDT0000 using nid 10.164.3.246 at tcp. LustreError: 23549:0:(mdc_request.c:741:mdc_close()) Unexpected: can''t find mdc_open_data, but the close succeeded. Please tell <http://bugzilla.lustre. org/>. Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection to service nobackup-MDT0000 via nid 10.164.3.246 at tcp was lost; in progress operations using thi s service will wait for recovery to complete. LustreError: 11-0: an error occurred while communicating with 10.164.3.246 at tcp. The mds_connect operation failed with -16 Lustre: 3879:0:(import.c:410:import_select_connection()) nobackup- MDT0000-mdc-00000100f7eb0800: tried all connections, increasing latency to 36s Lustre: 3879:0:(import.c:410:import_select_connection()) Skipped 2 previous similar messages LustreError: 11-0: an error occurred while communicating with 10.164.3.246 at tcp. The mds_connect operation failed with -16 LustreError: 11-0: an error occurred while communicating with 10.164.3.246 at tcp. The mds_connect operation failed with -16 Lustre: Changing connection for nobackup-MDT0000-mdc-00000100f7eb0800 to 10.164.3.246 at tcp/10.164.3.246 at tcp Is this the same bug? The compute nodes look mostly ok, but the above still happens every few days. I don''t notice any mention of statahead but should I go ahead and set it to 0 again? Brock Palen www.umich.edu/~brockp Center for Advanced Computing brockp at umich.edu (734)936-1985
On Dec 07, 2008 09:09 -0500, Brock Palen wrote:> We upgraded out clients to 1.6.6 and the servers are still 1.6.5 we > are still seeing where the login nodes much more than the compute > nodes start being evicted but never are able to recover, > > LustreError: 11-0: an error occurred while communicating with > 10.164.3.246 at tcp. The mds_connect operation failed with -16 > LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got > rc -11 from cancel RPC: canceling anyway > LustreError: 167-0: This client was evicted by nobackup-MDT0000; in > progress operations using this service will fail.Having error messages from the servers is critical to figure out what is going on.> Is this the same bug? The compute nodes look mostly ok, but the > above still happens every few days. I don''t notice any mention of > statahead but should I go ahead and set it to 0 again?At worst it would only slow down the "ls -l" performance, and would tell us whether statahead is the culprit or not. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.