Per Lundqvist
2007-Dec-14 16:41 UTC
[Lustre-discuss] lock callback timer expired: evicting client
Recently we upgraded all of our lustre clients from 1.4.11.1 to 1.6.3 while remaining on 1.4.11.1 on all the MDS/OSS:es (which will be upgraded later on). This has resulted in a never previously observed error when running a specific application (ccsm3) on this lustre filesystem: Opening the same file from a handful of compute nodes results in some/all of the corresponding lustre clients getting evicted and the open system call to fail. When triggered, any processes from any node that tries to stat the directory where this file is opened will also get evicted and freeze for 100 s. (I include output from /var/log/messages last in this mail., apologies for the long mail...). I am interested in knowing whether this problem is only triggered when using 1.6.3 clients against 1.4.11.1 servers, or if it might be triggered for more similar client/server versions? Or if there is any way to prevent this from happening without upgrading servers or downgrading clients. (I will later try to reproduce this using 1.4.11.1 clients) Our setup: 1 MDS + 3 OSS: Centos4 with 2.6.9-55.0.9.EL_lustre.1.4.11.1smp clients: Centos4 with 2.6.9-55.0.9.EL_lustre.1.6.3smp (Lustre over Gigabit ethernet, no failover) We have narrowed this down to 2 similar examples which both triggers this error: one in fortran: implicit none integer i do i=1,1000 open(unit=16,file=''tmpfile'') close(16) enddo stop end and one in C: #include <fcntl.h> #include <unistd.h> #include <stdio.h> #include <stdlib.h> int main(void) { int i, fd; char cwd[4096]; char tmpfile[4196]; for (i = 0; i < 1000; i++) { if (getcwd(cwd, 4096) == NULL) exit(-1); sprintf(tmpfile,"%s/tmpfile", cwd); fd = open(tmpfile, O_RDWR|O_CREAT, 0666); close(fd); } return 0; } We can trigger the error by executing one of the examples on at least 3 nodes in parallel. i.e. 1. copy the binary to its own directory on the lustre filesystem. 2. execute the binary in parallel on at least 3 nodes. Example: gcc evict.c -o evict mkdir /lustrefs/tmpdir cp evict /lustrefs/tmpdir pdsh -f3 -w n[1-3] ''cd /lustrefs/tmpdir; ./evict'' /var/log/messages on evicted client: Dec 14 16:31:26 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107 Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us ing this service will wait for recovery to complete. Dec 14 16:31:26 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail. Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001007dc87400 x854/t0 o101->misu2_mds_UUID at mds2_UUID:12 lens 392/728 re f 1 fl Rpc:/0/0 rc 0/0 Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp. Dec 14 16:33:06 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107 Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us ing this service will wait for recovery to complete. Dec 14 16:33:06 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail. Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 1 previous similar message Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp. ....<snip>.. /var/log/messages on MDS: Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp ns: mds-misu2_mds_UUID lock: 0000010075f50700/0x959917d3f3b88931 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 7 type: IBT flags: 30 remote: 0x2a6036f763199bff expref: 7 pid 5373 Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) Skipped 1 previous similar message Dec 14 16:31:26 mds2 kernel: Lustre: 5376:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0 Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) Skipped 1 previous similar message Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107) req at 000001012551c450 x852/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0 Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) Skipped 2 previous similar messages Dec 14 16:33:06 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp ns: mds-misu2_mds_UUID lock: 00000100b829fb00/0x959917d3f3b89ff1 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 8 type: IBT flags: 30 remote: 0x2a6036f763199c3e expref: 6 pid 5398 Dec 14 16:33:06 mds2 kernel: Lustre: 5373:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0 Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107) req at 0000010027297800 x927/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0 ....<snip>... Thanks for any help/input, /Per Lundqvist -- Per Lundqvist National Supercomputer Centre Link?ping University, Sweden http://www.nsc.liu.se
Brian J. Murrell
2007-Dec-14 16:57 UTC
[Lustre-discuss] lock callback timer expired: evicting client
On Fri, 2007-12-14 at 17:41 +0100, Per Lundqvist wrote:> /var/log/messages on evicted client: > Dec 14 16:31:26 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107 > Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us > ing this service will wait for recovery to complete. > Dec 14 16:31:26 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail. > Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 > Dec 14 16:31:26 n75 kernel: LustreError: 5484:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 000001007dc87400 x854/t0 o101->misu2_mds_UUID at mds2_UUID:12 lens 392/728 re > f 1 fl Rpc:/0/0 rc 0/0 > Dec 14 16:31:26 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp. > Dec 14 16:33:06 n75 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.11.226 at tcp. The ldlm_enqueue operation failed with -107 > Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection to service misu2_mds via nid 192.168.11.226 at tcp was lost; in progress operations us > ing this service will wait for recovery to complete. > Dec 14 16:33:06 n75 kernel: LustreError: 167-0: This client was evicted by misu2_mds; in progress operations using this service will fail. > Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 > Dec 14 16:33:06 n75 kernel: LustreError: 5484:0:(mdc_locks.c:423:mdc_finish_enqueue()) Skipped 1 previous similar message > Dec 14 16:33:06 n75 kernel: Lustre: MDC_mds2_misu2_mds_MNT_misu2_client-000001007e011400: Connection restored to service misu2_mds using nid 192.168.11.226 at tcp. > ....<snip>.. > > /var/log/messages on MDS: > Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp ns: mds-misu2_mds_UUID lock: 0000010075f50700/0x959917d3f3b88931 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 7 type: IBT flags: 30 remote: 0x2a6036f763199bff expref: 7 pid 5373 > Dec 14 16:31:26 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) Skipped 1 previous similar message > Dec 14 16:31:26 mds2 kernel: Lustre: 5376:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0 > Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp > Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(handler.c:1478:mds_handle()) Skipped 1 previous similar message > Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107) req at 000001012551c450 x852/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0 > Dec 14 16:31:26 mds2 kernel: LustreError: 5394:0:(ldlm_lib.c:1343:target_send_reply_msg()) Skipped 2 previous similar messages > Dec 14 16:33:06 mds2 kernel: LustreError: 0:0:(ldlm_lockd.c:205:waiting_locks_callback()) ### lock callback timer expired: evicting client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542 at NET_0x20000c0a80b4b_UUID nid 192.168.11.75 at tcp ns: mds-misu2_mds_UUID lock: 00000100b829fb00/0x959917d3f3b89ff1 lrc: 1/0,0 mode: CR/CR res: 4780751/2726154257 bits 0x3 rrc: 8 type: IBT flags: 30 remote: 0x2a6036f763199c3e expref: 6 pid 5398 > Dec 14 16:33:06 mds2 kernel: Lustre: 5373:0:(mds_reint.c:125:mds_finish_transno()) commit transaction for disconnected client e689a9c4-2a46-9239-7fd7-5a7e2c8c6542: rc 0 > Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(handler.c:1478:mds_handle()) operation 101 on unconnected MDS from 12345-192.168.11.75 at tcp > Dec 14 16:33:06 mds2 kernel: LustreError: 5372:0:(ldlm_lib.c:1343:target_send_reply_msg()) @@@ processing error (-107) req at 0000010027297800 x927/t0 o101-><?>@<?>:-1 lens 392/0 ref 0 fl Interpret:/0/0 rc -107/0 > ....<snip>...This looks like 13917, attachment 13540. b.
Per Lundqvist
2007-Dec-17 08:23 UTC
[Lustre-discuss] lock callback timer expired: evicting client
On Fri, 14 Dec 2007, Brian J. Murrell wrote:> > This looks like 13917, attachment 13540.ok, thanks, I will test if I can reproduce it with 1.6.4.1 when it is available, /Per -- Per Lundqvist National Supercomputer Centre Link?ping University, Sweden http://www.nsc.liu.se
Per Lundqvist
2007-Dec-21 09:06 UTC
[Lustre-discuss] lock callback timer expired: evicting client
I have now tried to reproduce this problem using the 1.6.4.1 client, and it seems that the problem is fixed in 1.6.4.1 as the referenced bugzilla entry claims. Thanks Brian! But now my test case often provokes these kind of messages: /var/log/messages on MDS: Dec 20 19:05:12 mds2 kernel: LustreError: 5585:0:(ldlm_lockd.c:1056:ldlm_handle_cancel()) received cancel for unknown lock cookie 0x959917d43d0e9ea1 from client 636a48bd-4096-6592-497e-9432c3099b2f id 12345-192.168.11.73 at tcp (no errors logged on the corresponding clients and on any OSS): When running my test case 1000 times on 6 nodes, I got 269 messages about "received cancel for unknown lock...", (naming all 6 involved clients). /Per Lundqvist -- Per Lundqvist National Supercomputer Centre Link?ping University, Sweden http://www.nsc.liu.se