Hi! We''ve started to poke and prod at Lustre 1.6.4.1, and it seems to mostly work (we haven''t had it OOPS on us yet like the earlier 1.6-versions did). However, we had this weird incident where an active client (it was copying 4GB files and running ls at the time) got evicted by the MDS and all OST''s. After a while logs indicate that it did recover the connection to the MDT, but not the OST''s ... lctl ping between the client and mds/oss''s still worked. After a while I got tired of waiting for it to recover and did umount -f, which also hung. So, what should happen in situations like this? What can I do to debug it? Has anyone else seen this? 130.239.78.238 is the client, 130.239.78.233 is the MGS/MDT. Logs from the client: ----------------------------8<------------------------ Jan 10 12:40:38 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107 Jan 10 12:40:38 LustreError: Skipped 1 previous similar message Jan 10 12:40:38 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jan 10 12:40:38 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail. Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 Jan 10 12:40:38 LustreError: 7975:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8100c5298800 x649493/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 432/912 ref 1 fl Rpc:/0/0 rc 0/0 Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 Jan 10 12:41:40 LustreError: 7979:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81005ee12e00 x649521/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0 Jan 10 12:41:40 LustreError: 7979:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff810043ad4a00 x649532/t0 o35->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 296/784 ref 1 fl Rpc:/0/0 rc 0/0 Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421378 mdc close failed: rc = -108 Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421383 mdc close failed: rc = -108 Jan 10 12:42:18 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp. Jan 10 12:42:48 LustreError: 11-0: an error occurred while communicating with 130.239.78.235 at tcp. The ost_setattr operation failed with -107 Jan 10 12:42:48 Lustre: hpfs-OST0007-osc-ffff8100016d2c00: Connection to service hpfs-OST0007 via nid 130.239.78.235 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jan 10 12:42:48 LustreError: 167-0: This client was evicted by hpfs-OST0007; in progress operations using this service will fail. Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107 Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5 Jan 10 12:42:49 LustreError: 8001:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5 Jan 10 12:42:50 LustreError: 11-0: an error occurred while communicating with 130.239.78.239 at tcp. The ost_setattr operation failed with -107 Jan 10 12:42:50 Lustre: hpfs-OST0000-osc-ffff8100016d2c00: Connection to service hpfs-OST0000 via nid 130.239.78.239 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jan 10 12:42:50 LustreError: 167-0: This client was evicted by hpfs-OST0000; in progress operations using this service will fail. Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107 Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) Skipped 1 previous similar message Jan 10 12:42:52 LustreError: 11-0: an error occurred while communicating with 130.239.78.236 at tcp. The ost_setattr operation failed with -107 Jan 10 12:42:52 Lustre: hpfs-OST0002-osc-ffff8100016d2c00: Connection to service hpfs-OST0002 via nid 130.239.78.236 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jan 10 12:42:52 LustreError: 167-0: This client was evicted by hpfs-OST0002; in progress operations using this service will fail. Jan 10 12:50:41 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107 Jan 10 12:50:41 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete. Jan 10 12:50:41 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail. Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message Jan 10 12:50:41 LustreError: 8017:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377 Jan 10 12:50:41 LustreError: 8016:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377 Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8100f650e800 x649712/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0 Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8100f650e800 x649715/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0 Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 2 previous similar messages Jan 10 12:52:21 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp. Jan 10 12:57:46 Lustre: setting import hpfs-MDT0000_UUID INACTIVE by administrator request Jan 10 12:59:26 Lustre: setting import hpfs-OST0000_UUID INACTIVE by administrator request ----------------------------8<------------------------ Logs from the MGS/MDT: ----------------------------8<------------------------ Jan 10 12:20:31 Lustre: MGS: haven''t heard from client 01a8bcfc-fd98-90a9-6aeb-7c331a658b2e (at 130.239.78.238 at tcp) in 233 seconds. I think it''s dead, and I am evicting it. Jan 10 12:21:11 Lustre: hpfs-MDT0000: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 271 seconds. I think it''s dead, and I am evicting it. Jan 10 12:40:38 LustreError: 27332:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp Jan 10 12:40:38 LustreError: 27332:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff8100e20ffe00 x649491/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 fl Interpret:/0/0 rc -107/0 Jan 10 12:47:26 Lustre: hpfs-MDT0000: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 274 seconds. I think it''s dead, and I am evicting it. Jan 10 12:50:41 LustreError: 27322:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp Jan 10 12:50:41 LustreError: 27322:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff8101fbdaf000 x649703/t0 o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0 Jan 10 12:57:26 Lustre: hpfs-MDT0000: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 251 seconds. I think it''s dead, and I am evicting it. ----------------------------8<------------------------ Logs from one of the OSS''s (serves two OST''s). Logs similar on all OSS''s: ----------------------------8<------------------------ Jan 10 12:20:46 Lustre: hpfs-OST0002: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 246 seconds. I think it''s dead, and I am evicting it. Jan 10 12:20:56 Lustre: hpfs-OST0003: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 256 seconds. I think it''s dead, and I am evicting it. Jan 10 12:42:52 LustreError: 6665:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff810081af2400 x649565/t0 o2-><?>@<?>:-1 lens 336/0 ref 0 fl Interpret:/0/0 rc -107/0 Jan 10 12:47:01 Lustre: hpfs-OST0002: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 249 seconds. I think it''s dead, and I am evicting it. ----------------------------8<------------------------ /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- The magic of Windows - turn a 486-50 into a 4MHz XT... =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
On Thu, 10 Jan 2008, Niklas Edmundsson wrote:> However, we had this weird incident where an active client (it was > copying 4GB files and running ls at the time) got evicted by the MDS > and all OST''s. After a while logs indicate that it did recover the > connection to the MDT, but not the OST''s ... > > lctl ping between the client and mds/oss''s still worked. >> umount -f, which also hung.After I rebooted the client it recovered. However, I got this rather suspicious log message on the MDS (similar on the OSS''s): Jan 10 14:11:43 Lustre: 27208:0:(router.c:167:lnet_notify()) Ignoring prediction from 130.239.78.233 at tcp of 130.239.78.238 at tcp down 4829687047 seconds in the future> So, what should happen in situations like this? What can I do to debug > it? Has anyone else seen this? > > 130.239.78.238 is the client, 130.239.78.233 is the MGS/MDT. > > Logs from the client: > ----------------------------8<------------------------ > Jan 10 12:40:38 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107 > Jan 10 12:40:38 LustreError: Skipped 1 previous similar message > Jan 10 12:40:38 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete. > Jan 10 12:40:38 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail. > Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 > Jan 10 12:40:38 LustreError: 7975:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8100c5298800 x649493/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 432/912 ref 1 fl Rpc:/0/0 rc 0/0 > Jan 10 12:40:38 LustreError: 7975:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 > Jan 10 12:41:40 LustreError: 7979:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff81005ee12e00 x649521/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0 > Jan 10 12:41:40 LustreError: 7979:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 > Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff810043ad4a00 x649532/t0 o35->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 296/784 ref 1 fl Rpc:/0/0 rc 0/0 > Jan 10 12:42:18 LustreError: 7976:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message > Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421378 mdc close failed: rc = -108 > Jan 10 12:42:18 LustreError: 7976:0:(file.c:98:ll_close_inode_openhandle()) inode 8421383 mdc close failed: rc = -108 > Jan 10 12:42:18 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp. > Jan 10 12:42:48 LustreError: 11-0: an error occurred while communicating with 130.239.78.235 at tcp. The ost_setattr operation failed with -107 > Jan 10 12:42:48 Lustre: hpfs-OST0007-osc-ffff8100016d2c00: Connection to service hpfs-OST0007 via nid 130.239.78.235 at tcp was lost; in progress operations using this service will wait for recovery to complete. > Jan 10 12:42:48 LustreError: 167-0: This client was evicted by hpfs-OST0007; in progress operations using this service will fail. > Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107 > Jan 10 12:42:48 LustreError: 7999:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5 > Jan 10 12:42:49 LustreError: 8001:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-5 > Jan 10 12:42:50 LustreError: 11-0: an error occurred while communicating with 130.239.78.239 at tcp. The ost_setattr operation failed with -107 > Jan 10 12:42:50 Lustre: hpfs-OST0000-osc-ffff8100016d2c00: Connection to service hpfs-OST0000 via nid 130.239.78.239 at tcp was lost; in progress operations using this service will wait for recovery to complete. > Jan 10 12:42:50 LustreError: 167-0: This client was evicted by hpfs-OST0000; in progress operations using this service will fail. > Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) obd_setattr_async fails: rc=-107 > Jan 10 12:42:50 LustreError: 8002:0:(llite_lib.c:1480:ll_setattr_raw()) Skipped 1 previous similar message > Jan 10 12:42:52 LustreError: 11-0: an error occurred while communicating with 130.239.78.236 at tcp. The ost_setattr operation failed with -107 > Jan 10 12:42:52 Lustre: hpfs-OST0002-osc-ffff8100016d2c00: Connection to service hpfs-OST0002 via nid 130.239.78.236 at tcp was lost; in progress operations using this service will wait for recovery to complete. > Jan 10 12:42:52 LustreError: 167-0: This client was evicted by hpfs-OST0002; in progress operations using this service will fail. > Jan 10 12:50:41 LustreError: 11-0: an error occurred while communicating with 130.239.78.233 at tcp. The ldlm_enqueue operation failed with -107 > Jan 10 12:50:41 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection to service hpfs-MDT0000 via nid 130.239.78.233 at tcp was lost; in progress operations using this service will wait for recovery to complete. > Jan 10 12:50:41 LustreError: 167-0: This client was evicted by hpfs-MDT0000; in progress operations using this service will fail. > Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -5 > Jan 10 12:50:41 LustreError: 8017:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message > Jan 10 12:50:41 LustreError: 8017:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377 > Jan 10 12:50:41 LustreError: 8016:0:(file.c:2513:ll_inode_revalidate_fini()) failure -5 inode 8421377 > Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8100f650e800 x649712/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0 > Jan 10 12:50:56 LustreError: 8019:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 1 previous similar message > Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) ldlm_cli_enqueue: -108 > Jan 10 12:50:57 LustreError: 8019:0:(mdc_locks.c:424:mdc_finish_enqueue()) Skipped 1 previous similar message > Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff8100f650e800 x649715/t0 o101->hpfs-MDT0000_UUID at 130.239.78.233@tcp:12 lens 512/912 ref 1 fl Rpc:P/0/0 rc 0/0 > Jan 10 12:50:59 LustreError: 8022:0:(client.c:519:ptlrpc_import_delay_req()) Skipped 2 previous similar messages > Jan 10 12:52:21 Lustre: hpfs-MDT0000-mdc-ffff8100016d2c00: Connection restored to service hpfs-MDT0000 using nid 130.239.78.233 at tcp. > Jan 10 12:57:46 Lustre: setting import hpfs-MDT0000_UUID INACTIVE by administrator request > Jan 10 12:59:26 Lustre: setting import hpfs-OST0000_UUID INACTIVE by administrator request > ----------------------------8<------------------------ > > Logs from the MGS/MDT: > ----------------------------8<------------------------ > Jan 10 12:20:31 Lustre: MGS: haven''t heard from client 01a8bcfc-fd98-90a9-6aeb-7c331a658b2e (at 130.239.78.238 at tcp) in 233 seconds. I think it''s dead, and I am evicting it. > Jan 10 12:21:11 Lustre: hpfs-MDT0000: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 271 seconds. I think it''s dead, and I am evicting it. > Jan 10 12:40:38 LustreError: 27332:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp > Jan 10 12:40:38 LustreError: 27332:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff8100e20ffe00 x649491/t0 o101-><?>@<?>:-1 lens 512/0 ref 0 fl Interpret:/0/0 rc -107/0 > Jan 10 12:47:26 Lustre: hpfs-MDT0000: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 274 seconds. I think it''s dead, and I am evicting it. > Jan 10 12:50:41 LustreError: 27322:0:(handler.c:1502:mds_handle()) operation 101 on unconnected MDS from 12345-130.239.78.238 at tcp > Jan 10 12:50:41 LustreError: 27322:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff8101fbdaf000 x649703/t0 o101-><?>@<?>:-1 lens 432/0 ref 0 fl Interpret:/0/0 rc -107/0 > Jan 10 12:57:26 Lustre: hpfs-MDT0000: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 251 seconds. I think it''s dead, and I am evicting it. > ----------------------------8<------------------------ > > Logs from one of the OSS''s (serves two OST''s). Logs similar on all > OSS''s: > ----------------------------8<------------------------ > Jan 10 12:20:46 Lustre: hpfs-OST0002: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 246 seconds. I think it''s dead, and I am evicting it. > Jan 10 12:20:56 Lustre: hpfs-OST0003: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 256 seconds. I think it''s dead, and I am evicting it. > Jan 10 12:42:52 LustreError: 6665:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-107) req at ffff810081af2400 x649565/t0 o2-><?>@<?>:-1 lens 336/0 ref 0 fl Interpret:/0/0 rc -107/0 > Jan 10 12:47:01 Lustre: hpfs-OST0002: haven''t heard from client c542d305-5995-f79d-1c8d-c9578393358a (at 130.239.78.238 at tcp) in 249 seconds. I think it''s dead, and I am evicting it. > ----------------------------8<------------------------ > > > /Nikke >/Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- Jimmy Hoffa is buried here ------> X =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Hello! On Jan 10, 2008, at 9:45 AM, Niklas Edmundsson wrote:> After I rebooted the client it recovered. However, I got this > rather suspicious log message on the MDS (similar on the OSS''s): > Jan 10 14:11:43 Lustre: 27208:0:(router.c:167:lnet_notify()) > Ignoring prediction from 130.239.78.233 at tcp of 130.239.78.238 at tcp > down 4829687047 seconds in the futureThis is harmless message that would be shut in 1.6.5 You can see details in bug 14300 As for your original message - hard to tell what caused it. We can see that servers decided the client was unresponsive. Could it be some network packet lost for example? Were not there any other messages at around 12:20 and before that (that''s when it was evicted) on a client? Because at 12:40 - that''s already 20 minutes past eviction. Bye, Oleg
On Thu, 10 Jan 2008, Oleg Drokin wrote:>> Ignoring prediction from 130.239.78.233 at tcp of 130.239.78.238 at tcp >> down 4829687047 seconds in the future > > This is harmless message that would be shut in 1.6.5 > You can see details in bug 14300OK.> As for your original message - hard to tell what caused it. We can see > that servers decided the client was unresponsive. > Could it be some network packet lost for example? > Were not there any other messages at around 12:20 and before that > (that''s when it was evicted) on a client? > Because at 12:40 - that''s already 20 minutes past eviction.Thats the weird thing - there''s nothing lustre-related logged before that on the client that day! The client seems oblivious to the fact that it''s been evicted, and this was while it was doing IO... Also the clocks are synced by ntp, and thus not off by much... I could accept network errors etc as an explanation, but then I would have assumed that the client would have logged stuff, tried reconnecting etc... As it was it was simply dead in the water until I rebooted the thing. What mechanism does Lustre use to check if a peer is up? Since lctl ping worked between all nodes I suspect it uses something more involved. Can I trigger the same check using lctl? /Nikke -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Niklas Edmundsson, Admin @ {acc,hpc2n}.umu.se | nikke at hpc2n.umu.se --------------------------------------------------------------------------- "Wow, Veronica, he totally wants to protect and serve you." - Meg =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Hello! On Jan 11, 2008, at 4:00 AM, Niklas Edmundsson wrote:>> As for your original message - hard to tell what caused it. We can >> see >> that servers decided the client was unresponsive. >> Could it be some network packet lost for example? >> Were not there any other messages at around 12:20 and before that >> (that''s when it was evicted) on a client? >> Because at 12:40 - that''s already 20 minutes past eviction. > Thats the weird thing - there''s nothing lustre-related logged before > that on the client that day! The client seems oblivious to the fact > that it''s been evicted, and this was while it was doing IO... Also the > clocks are synced by ntp, and thus not off by much...Clocks are not off at all, because there is a message in server log corresponding to first error from client log you provided. The fact that there were no prior messages in client log is very strange.> I could accept network errors etc as an explanation, but then I would > have assumed that the client would have logged stuff, tried > reconnecting etc... As it was it was simply dead in the water until I > rebooted the thing.That''s true.> What mechanism does Lustre use to check if a peer is up? Since lctl > ping worked between all nodes I suspect it uses something more > involved. Can I trigger the same check using lctl?Lustre sends periodic PING messages to servers with which it had no communication for some time. Any network activity on filesystem that triggers network traffic toward servers also works as a health check. Since we did not see any timeouts in client logs, it looks there were no traffic from client to servers for those 20 minutes at all, not even lustre-generated pings which is pretty strange. Too bad it is too late to ask for some lustre debug logs at this point. If you can replicate the problem somehow, it would be interesting to get lctl dk output. Bye, Oleg