Simon Latapie
2009-Mar-30 11:06 UTC
[Lustre-discuss] LustreError: lock callback timer expired after
Greetings, I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, and 37 lustre clients (1 login and 36 compute nodes), all using infiniband as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel. For the past two months, several times a month, the login node seems to be permanently evicted from the OSTs. The OSTs show a "lock callback timer expired after ..." error, then the login tries to reconnect, and fail. As lustre mount is the home directory of the cluster, users can''t have access to it, and can''t log in anymore. The only way I found to stop this is to reboot the login node (umount -f stucks). After the reboot, the login simply reconnects to the OST, and everything is okay until the next "lock callback timer" issue. Compute nodes doesn''t seem to be affected by this problem. Only login node does. There is no memory problem (no swap, no memory leaks), neither on OSTs or login node. There is network error (no packet loss detected), in IB or IPoIB. Expiration time can be very random: from about 300s to 9000s. I tried to reproduce the bug, for example by stressing the login with a lot of simultaneous creation/read/write of files, or with big IOs (using IOR from lustre-iokit), or by stressing the compute nodes and the login at the same time, but none of these worked. I also tried to increase the global timeout on the OSTs and login to 300, but the problem still reappeared. Any tip or suggestion would be very appreciated. By the way, is there any way to extract the list of connections between lock ids and files for a single node ? Here is an example of logs for the problem: There is no error on the /var/log/messages or lustre debug log at the supposed time of lock callback, on any machine. /var/log/messages on an OSS: Mar 27 17:28:51 lustre-oss-0-1 kernel: LustreError: 0:0:(ldlm_lockd.c:234:waiting_locks_callback()) ### lock callback timer expired after 339s: evicting client at 192.168.198.203 at o2ib ns: filter-lustre-OST0001_UUID lock: 00000101a5de8280/0xc8fe937f513470ae lrc: 1/0,0 mode: PW/PW res: 10811159/0 rrc: 8 type: EXT [0->18446744073709551615] (req 1118208->1122303) flags: 20 remote: 0xc9f5de2876c8bde0 expref: 3006 pid: 6678 Mar 27 17:29:01 lustre-oss-0-1 kernel: LustreError: 8203:0:(ldlm_lib.c:1536:target_send_reply_msg()) @@@ processing error (-107) req at 00000101fd830c00 x193375/t0 o13-><?>@<?>:0/0 lens 128/0 e 0 to 0 dl 1238171441 ref 1 fl Interpret:/0/0 rc -107/0 Mar 27 17:29:01 lustre-oss-0-1 kernel: LustreError: 8203:0:(ldlm_lib.c:1536:target_send_reply_msg()) Skipped 630 previous similar messages debug log on the login: 00000100:02020000:7:1238171222.003080:0:11481:0:(client.c:788:ptlrpc_check_status()) 11-0: an error occurred while communicating with 192.168.198.206 at o2ib. The ost_statfs operation failed with -107 00000100:00080000:7:1238171222.003091:0:11481:0:(recover.c:188:ptlrpc_request_handle_notconn()) import lustre-OST0000-osc-0000010821d74800 of lustre-OST0000_UUID at 192.168.198.206@o2ib abruptly disconnected: reconnecting 00000100:02000400:7:1238171222.003097:0:11481:0:(import.c:133:ptlrpc_set_import_discon()) lustre-OST0000-osc-0000010821d74800: Connection to service lustre-OST0000 via nid 192.168.198.206 at o2ib was lost; in progress operations using this service will wait for recovery to complete. 00000100:00080000:7:1238171222.003103:0:11481:0:(import.c:141:ptlrpc_set_import_discon()) 00000104362e3c00 lustre-OST0000_UUID: changing import state from FULL to DISCONN 00000100:00080000:7:1238171222.003108:0:11481:0:(import.c:465:ptlrpc_connect_import()) 00000104362e3c00 lustre-OST0000_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:7:1238171222.003111:0:11481:0:(import.c:342:import_select_connection()) lustre-OST0000-osc-0000010821d74800: connect to NID 192.168.198.206 at o2ib last attempt 4294943333 00000100:00080000:7:1238171222.003118:0:11481:0:(import.c:423:import_select_connection()) lustre-OST0000-osc-0000010821d74800: import 00000104362e3c00 using connection 192.168.198.206 at o2ib/192.168.198.206 at o2ib 00000100:00080000:7:1238171222.003238:0:2274:0:(import.c:737:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4) req at 000001041c3a0600 x193244/t0 o8->lustre-OST0000_UUID at 192.168.198.206@o2ib:28/4 lens 304/456 e 0 to 100 dl 1238171321 ref 1 fl Interpret:R/0/0 rc 0/0 00000100:00080000:7:1238171222.003244:0:2274:0:(import.c:740:ptlrpc_connect_interpret()) 00000104362e3c00 lustre-OST0000_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:7:1238171222.003246:0:2274:0:(import.c:1049:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre-OST0000; in progress operations using this service will fail. 00000100:00080000:7:1238171222.003251:0:2274:0:(import.c:1053:ptlrpc_import_recovery_state_machine()) evicted from lustre-OST0000_UUID at 192.168.198.206@o2ib; invalidating 00000100:00080000:7:1238171222.003292:0:11486:0:(import.c:1016:ptlrpc_invalidate_import_thread()) thread invalidate import lustre-OST0000-osc-0000010821d74800 to lustre-OST0000_UUID at 192.168.198.206@o2ib 00000100:00080000:7:1238171222.003295:0:11486:0:(import.c:168:ptlrpc_deactivate_and_unlock_import()) setting import lustre-OST0000_UUID INVALID 00020000:01000000:7:1238171222.003300:0:11486:0:(lov_obd.c:505:lov_set_osc_active()) Marking OSC lustre-OST0000_UUID inactive 00000100:00080000:7:1238171222.181183:0:11486:0:(import.c:1025:ptlrpc_invalidate_import_thread()) 00000104362e3c00 lustre-OST0000_UUID: changing import state from EVICTED to RECOVER 00000100:00080000:7:1238171222.181189:0:11486:0:(import.c:1103:ptlrpc_import_recovery_state_machine()) reconnected to lustre-OST0000_UUID at 192.168.198.206@o2ib 00000100:00080000:7:1238171222.181192:0:11486:0:(import.c:1108:ptlrpc_import_recovery_state_machine()) 00000104362e3c00 lustre-OST0000_UUID: changing import state from RECOVER to FULL 00020000:01000000:7:1238171222.181196:0:11486:0:(lov_obd.c:505:lov_set_osc_active()) Marking OSC lustre-OST0000_UUID active 00000100:02000000:7:1238171222.181202:0:11486:0:(import.c:1116:ptlrpc_import_recovery_state_machine()) lustre-OST0000-osc-0000010821d74800: Connection restored to service lustre-OST0000 using nid 192.168.198.206 at o2ib. 00000100:02020000:5:1238171341.066101:0:11794:0:(client.c:788:ptlrpc_check_status()) 11-0: an error occurred while communicating with 192.168.198.204 at o2ib. The ost_statfs operation failed with -107 00000100:00080000:5:1238171341.066113:0:11794:0:(recover.c:188:ptlrpc_request_handle_notconn()) import lustre-OST0001-osc-0000010821d74800 of lustre-OST0001_UUID at 192.168.198.204@o2ib abruptly disconnected: reconnecting 00000100:02000400:5:1238171341.066117:0:11794:0:(import.c:133:ptlrpc_set_import_discon()) lustre-OST0001-osc-0000010821d74800: Connection to service lustre-OST0001 via nid 192.168.198.204 at o2ib was lost; in progress operations using this service will wait for recovery to complete. 00000100:00080000:5:1238171341.066126:0:11794:0:(import.c:141:ptlrpc_set_import_discon()) 0000010821d14c00 lustre-OST0001_UUID: changing import state from FULL to DISCONN 00000100:00080000:5:1238171341.066130:0:11794:0:(import.c:465:ptlrpc_connect_import()) 0000010821d14c00 lustre-OST0001_UUID: changing import state from DISCONN to CONNECTING 00000100:00080000:5:1238171341.066133:0:11794:0:(import.c:342:import_select_connection()) lustre-OST0001-osc-0000010821d74800: connect to NID 192.168.198.204 at o2ib last attempt 4294943333 00000100:00080000:5:1238171341.066140:0:11794:0:(import.c:423:import_select_connection()) lustre-OST0001-osc-0000010821d74800: import 0000010821d14c00 using connection 192.168.198.204 at o2ib/192.168.198.204 at o2ib 00000100:00080000:7:1238171341.093184:0:2274:0:(import.c:737:ptlrpc_connect_interpret()) @@@ evicting (not initial connect and flags reconnect/recovering not set: 4) req at 000001082130d400 x193378/t0 o8->lustre-OST0001_UUID at 192.168.198.204@o2ib:28/4 lens 304/456 e 0 to 100 dl 1238171441 ref 1 fl Interpret:R/0/0 rc 0/0 00000100:00080000:7:1238171341.093193:0:2274:0:(import.c:740:ptlrpc_connect_interpret()) 0000010821d14c00 lustre-OST0001_UUID: changing import state from CONNECTING to EVICTED 00000100:02020000:7:1238171341.093196:0:2274:0:(import.c:1049:ptlrpc_import_recovery_state_machine()) 167-0: This client was evicted by lustre-OST0001; in progress operations using this service will fail. 00000100:00080000:7:1238171341.093203:0:2274:0:(import.c:1053:ptlrpc_import_recovery_state_machine()) evicted from lustre-OST0001_UUID at 192.168.198.204@o2ib; invalidating 00000100:00080000:7:1238171341.093279:0:11859:0:(import.c:1016:ptlrpc_invalidate_import_thread()) thread invalidate import lustre-OST0001-osc-0000010821d74800 to lustre-OST0001_UUID at 192.168.198.204@o2ib 00000100:00080000:7:1238171341.093282:0:11859:0:(import.c:168:ptlrpc_deactivate_and_unlock_import()) setting import lustre-OST0001_UUID INVALID 00020000:01000000:7:1238171341.093291:0:11859:0:(lov_obd.c:505:lov_set_osc_active()) Marking OSC lustre-OST0001_UUID inactive 00000080:00020000:5:1238171341.093306:0:11794:0:(llite_lib.c:1561:ll_statfs_internal()) obd_statfs fails: rc = -4 Thanks in advance, -- Simon Latapie
Simon Latapie
2009-Mar-30 15:33 UTC
[Lustre-discuss] LustreError: lock callback timer expired after
Simon Latapie wrote:> Greetings, > > I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, and > 37 lustre clients (1 login and 36 compute nodes), all using infiniband > as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel. > > There is network error (no packet loss detected), in IB or IPoIB.I actually meant "no" network error. Otherwise I would have not make this post. -- Simon Latapie
Oleg Drokin
2009-Mar-30 15:44 UTC
[Lustre-discuss] LustreError: lock callback timer expired after
Hello! On Mar 30, 2009, at 7:06 AM, Simon Latapie wrote:> I currently have a lustre system with 1 MDS, 2 OSS with 2 OSTs each, > and > 37 lustre clients (1 login and 36 compute nodes), all using infiniband > as lustre network (o2ib). All nodes are on 1.6.5.1 patched kernel. > For the past two months, several times a month, the login node seems > to > be permanently evicted from the OSTs. The OSTs show a "lock callback > timer expired after ..." error, then the login tries to reconnect, and > fail. As lustre mount is the home directory of the cluster, users > can''t > have access to it, and can''t log in anymore. The only way I found to > stop this is to reboot the login node (umount -f stucks). After the > reboot, the login simply reconnects to the OST, and everything is okay > until the next "lock callback timer" issue. > Compute nodes doesn''t seem to be affected by this problem. Only login > node does. > There is no memory problem (no swap, no memory leaks), neither on OSTs > or login node. > There is network error (no packet loss detected), in IB or IPoIB. > Expiration time can be very random: from about 300s to 9000s.There are several possible bugs that could lead to this. One of the possible ones that comes to mind is bug 15716. Recommended way is to upgrade to latest lustre release, of course. Bye, Oleg