thr3ads.net - Lustre discuss - [Lustre-discuss] login nodes still hang with 1.6.6 [Dec 2008]

If this information is useful, please help other people find it:
Share via:

Brock Palen

2008-Dec-07 14:09 UTC

[Lustre-discuss] login nodes still hang with 1.6.6

Hello,
We upgraded out clients to 1.6.6 and the servers are still 1.6.5  we  
are still seeing where the login nodes much more than the compute  
nodes start being evicted but never are able to recover,

LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got  
rc -11 from cancel RPC: canceling anyway
LustreError: 15616:0:(ldlm_request.c:1605:ldlm_cli_cancel_list())  
ldlm_cli_cancel_list: -11
Lustre: Request x960964 sent from nobackup-MDT0000- 
mdc-00000100f7eb0800 to NID 10.164.3.247 at tcp 100s ago has timed out  
(limit 100s).
Lustre: Skipped 2 previous similar messages
LustreError: 167-0: This client was evicted by nobackup-MDT0000; in  
progress operations using this service will fail.
LustreError: 23549:0:(mdc_locks.c:598:mdc_enqueue())  
ldlm_cli_enqueue: -5
LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) @@@  
IMP_INVALID  req at 000001008d7e0400 x961041/t0 o35->nobackup- 
MDT0000_UUID at 10.164.3.246@tc
p:23/10 lens 296/1248 e 0 to 100 dl 0 ref 1 fl Rpc:/0/0 rc 0/0
LustreError: 19442:0:(client.c:722:ptlrpc_import_delay_req()) Skipped  
691 previous similar messages
LustreError: 19442:0:(file.c:116:ll_close_inode_openhandle()) inode  
93290605 mdc close failed: rc = -108
Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection restored to  
service nobackup-MDT0000 using nid 10.164.3.246 at tcp.
LustreError: 23549:0:(mdc_request.c:741:mdc_close()) Unexpected:  
can''t find mdc_open_data, but the close succeeded.  Please tell  
<http://bugzilla.lustre.
org/>.
Lustre: nobackup-MDT0000-mdc-00000100f7eb0800: Connection to service  
nobackup-MDT0000 via nid 10.164.3.246 at tcp was lost; in progress  
operations using thi
s service will wait for recovery to complete.
LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
Lustre: 3879:0:(import.c:410:import_select_connection()) nobackup- 
MDT0000-mdc-00000100f7eb0800: tried all connections, increasing  
latency to 36s
Lustre: 3879:0:(import.c:410:import_select_connection()) Skipped 2  
previous similar messages
LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
LustreError: 11-0: an error occurred while communicating with  
10.164.3.246 at tcp. The mds_connect operation failed with -16
Lustre: Changing connection for nobackup-MDT0000-mdc-00000100f7eb0800  
to 10.164.3.246 at tcp/10.164.3.246 at tcp


Is this the same bug?  The compute nodes look mostly ok, but the  
above still happens every few days.   I don''t notice any mention of  
statahead but should I go ahead and set it to 0 again?


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
brockp at umich.edu
(734)936-1985

Andreas Dilger

2008-Dec-08 06:29 UTC

head link

[Lustre-discuss] login nodes still hang with 1.6.6

On Dec 07, 2008  09:09 -0500, Brock Palen wrote:> We upgraded out clients to 1.6.6 and the servers are still 1.6.5  we  
> are still seeing where the login nodes much more than the compute  
> nodes start being evicted but never are able to recover,
> 
> LustreError: 11-0: an error occurred while communicating with  
> 10.164.3.246 at tcp. The mds_connect operation failed with -16
> LustreError: 15616:0:(ldlm_request.c:996:ldlm_cli_cancel_req()) Got  
> rc -11 from cancel RPC: canceling anyway
> LustreError: 167-0: This client was evicted by nobackup-MDT0000; in  
> progress operations using this service will fail.
Having error messages from the servers is critical to figure out what
is going on.
> Is this the same bug?  The compute nodes look mostly ok, but the  
> above still happens every few days.   I don''t notice any mention
of
> statahead but should I go ahead and set it to 0 again?
At worst it would only slow down the "ls -l" performance, and would
tell us whether statahead is the culprit or not.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

Lustre discuss - Dec 2008 - login nodes still hang with 1.6.6

[Lustre-discuss] login nodes still hang with 1.6.6

[Lustre-discuss] login nodes still hang with 1.6.6