Mike
2011-Dec-08 17:35 UTC
[Lustre-community] Problems with spontaneous evictions of clients on lustre 2.1
Hi everybody, We have struggled over time with weird problem - since version 1.8.4, then migration to 1.8.5, then migration to 2.1 we are still experiencing sporadic evictions by central server. Why it is happening is unknown. We have 15 clients connected to Lustre server, and in logs on server we see several times a day something like this:. ------ Lustre: 6452:0:(filter_io.c:869:filter_preprw_write()) storage-OST0000: slow start_page_write 31s Lustre: 6434:0:(filter.c:3455:filter_setattr()) storage-OST0000: slow i_alloc_sem 31s Lustre: 6456:0:(filter_io.c:869:filter_preprw_write()) storage-OST0000: slow start_page_write 36s Lustre: 6437:0:(filter.c:3455:filter_setattr()) storage-OST0000: slow i_alloc_sem 36s Lustre: 6454:0:(filter_io.c:869:filter_preprw_write()) storage-OST0001: slow start_page_write 45s Lustre: 6465:0:(filter_io.c:869:filter_preprw_write()) storage-OST0001: slow start_page_write 47s Lustre: 6470:0:(filter_io.c:869:filter_preprw_write()) storage-OST0001: slow start_page_write 31s Lustre: 6433:0:(filter.c:3455:filter_setattr()) storage-OST0001: slow i_alloc_sem 31s Lustre: 5243:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from 16eb8188-0ece-5499-1acb-f8626d6a80dd at 192.168.0.39@tcp t0 exp 0000000000000000 cur 1323314332 last 0 Lustre: 5243:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x20000c0a80027_UUID netid 20000: select flavor null Lustre: 5082:0:(ldlm_lib.c:877:target_handle_connect()) MGS: connection from f40ee2b0-3fa4-4b90-c7e0-8b2946e6bef7 at 192.168.0.37@tcp t0 exp 0000000000000000 cur 1323316215 last 0 Lustre: 5082:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 3 previous similar messages Lustre: 5082:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import MGS->NET_0x20000c0a80025_UUID netid 20000: select flavor null Lustre: 5082:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 3 previous similar messages LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.0.22 at tcp ns: mdt-ffff810636526000 lock: ffff81053233f900/0x246ade46d6c85c1c lrc: 3/0,0 mode: PR/PR res: 8589938338/130238 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x6578889ffc95160a expref: 12728 pid: 24880 timeout: 5707546384 LustreError: 24882:0:(client.c:1045:ptlrpc_import_delay_req()) @@@ IMP_CLOSED req at ffff810190b43c00 x1386133921685632/t0(0) o-1-> R6?T_0x20000c0a80016_UUID:15/16 lens 344/192 e 0 to 0 dl 0 ref 1 fl Rpc:N/ffffffff/ffffffff rc 0/-1 LustreError: 24882:0:(ldlm_lockd.c:638:ldlm_handle_ast_error()) ### client (nid 192.168.0.22 at tcp) returned 0 from completion AST ns: mdt-ffff810636526000 lock: ffff810587e6b6c0/0x246ade46d6c85ca1 lrc: 1/0,0 mode: --/PR res: 12312741/3540065265 bits 0x2 rrc: 24 type: IBT flags: 0x200000a0 remote: 0x6578889ffc951618 expref: 6939 pid: 24880 timeout: 5707647019 Lustre: 29331:0:(ldlm_lib.c:877:target_handle_connect()) storage-MDT0000: connection from 10db495f-d837-a550-0647-d02c826f684c at 192.168.0.22@tcp t322820289329 exp 0000000000000000 cur 1323332617 last 0 Lustre: 29331:0:(ldlm_lib.c:877:target_handle_connect()) Skipped 3 previous similar messages Lustre: 29331:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import storage-MDT0000->NET_0x20000c0a80016_UUID netid 20000: select flavor null Lustre: 29331:0:(sec.c:1474:sptlrpc_import_sec_adapt()) Skipped 3 previous similar messages LustreError: 5650:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-95) req at ffff8104d429d000 x1387597624433738/t0(0) o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983 at NET_0x20000c0a80025_UUID:0/0 lens 448/328 e 0 to 0 dl 1323355499 ref 1 fl Interpret:/ffffffff/ffffffff rc -95/-1 LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-95) req at ffff810627f44800 x1387597624456482/t0(0) o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983 at NET_0x20000c0a80025_UUID:0/0 lens 448/328 e 0 to 0 dl 1323355657 ref 1 fl Interpret:/ffffffff/ffffffff rc -95/-1 LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 5 previous similar messages LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-95) req at ffff810050b8ec00 x1387597624467844/t0(0) o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983 at NET_0x20000c0a80025_UUID:0/0 lens 448/328 e 0 to 0 dl 1323355868 ref 1 fl Interpret:/ffffffff/ffffffff rc -95/-1 LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 1 previous similar message LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-95) req at ffff810111ff5800 x1387597624486695/t0(0) o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983 at NET_0x20000c0a80025_UUID:0/0 lens 448/328 e 0 to 0 dl 1323356108 ref 1 fl Interpret:/ffffffff/ffffffff rc -95/-1 LustreError: 21858:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 5 previous similar messages LustreError: 21870:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-95) req at ffff8103262e1850 x1387597624493722/t0(0) o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983 at NET_0x20000c0a80025_UUID:0/0 lens 448/328 e 0 to 0 dl 1323356204 ref 1 fl Interpret:/ffffffff/ffffffff rc -95/-1 LustreError: 21870:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 2 previous similar messages LustreError: 21849:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-95) req at ffff810325c04050 x1387597624503062/t0(0) o-1->229d810c-2ff7-17dc-9e24-4fe1eeede983 at NET_0x20000c0a80025_UUID:0/0 lens 448/328 e 0 to 0 dl 1323356388 ref 1 fl Interpret:/ffffffff/ffffffff rc -95/-1 LustreError: 21849:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 6 previous similar messages LustreError: 0:0:(ldlm_lockd.c:356:waiting_locks_callback()) ### lock callback timer expired after 100s: evicting client at 192.168.0.26 at tcp ns: mdt-ffff810636526000 lock: ffff8105a51ba240/0x246ade46f6d980a8 lrc: 3/0,0 mode: PR/PR res: 8589938377/22137 bits 0x3 rrc: 2 type: IBT flags: 0x4000020 remote: 0x221c008cb5aab1 expref: 11285 pid: 21871 timeout: 5740267399 LustreError: 21865:0:(mdt_handler.c:2785:mdt_recovery()) operation 101 on unconnected MDS from 12345-192.168.0.26 at tcp LustreError: 21865:0:(mdt_handler.c:2785:mdt_recovery()) Skipped 4 previous similar messages LustreError: 21865:0:(ldlm_lib.c:2129:target_send_reply_msg()) @@@ processing error (-107) req at ffff8100439cac00 x1387008508758491/t0(0) o-1-><?>@<?>:0/0 lens 544/0 e 0 to 0 dl 1323365244 ref 1 fl Interpret:/ffffffff/ffffffff rc -107/-1 LustreError: 21865:0:(ldlm_lib.c:2129:target_send_reply_msg()) Skipped 4 previous similar messages Lustre: 21872:0:(ldlm_lib.c:877:target_handle_connect()) storage-MDT0000: connection from e22abfde-d76c-9a91-6492-457ac0d42a02 at 192.168.0.26@tcp t322669877067 exp 0000000000000000 cur 1323365138 last 0 Lustre: 21872:0:(sec.c:1474:sptlrpc_import_sec_adapt()) import storage-MDT0000->NET_0x20000c0a8001a_UUID netid 20000: select flavor null ------------------------------------------ At the same time on clients we see something like this: --- Lustre: storage-MDT0000-mdc-ffff880afb2a5c00: Connection to service storage-MDT0000 via nid 192.168.0.42 at tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: 25826:0:(ldlm_request.c:110:ldlm_expired_completion_wait()) ### lock timed out (enqueued at 1323332321, 296s ago), entering recovery for storage-MDT0000_UUID at 192.168.0.42@tcp ns: storage-MDT0000-mdc-ffff880afb2a5c00 lock: ffff880211a50240/0x6578889ffc951618 lrc: 4/1,0 mode: --/PR res: 12312741/3540065265 bits 0x2 rrc: 2 type: IBT flags: 0x0 remote: 0x246ade46d6c85ca1 expref: -99 pid: 25826 timeout: 0 LustreError: 167-0: This client was evicted by storage-MDT0000; in progress operations using this service will fail. LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff88059cf56400 x1386136653334645/t0(0) o-1->storage-MDT0000_UUID at 192.168.0.42@tcp:23/10 lens 360/824 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1 LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 847 previous similar messages LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) inode 144115249614728204 mdc close failed: rc = -108 LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) Skipped 14 previous similar messages LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at ffff88064358c400 x1386136653334718/t0(0) o-1->storage-MDT0000_UUID at 192.168.0.42@tcp:23/10 lens 360/824 e 0 to 0 dl 0 ref 2 fl Rpc:/ffffffff/ffffffff rc 0/-1 LustreError: 28938:0:(client.c:1060:ptlrpc_import_delay_req()) Skipped 72 previous similar messages LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) inode 11950252 mdc close failed: rc = -108 LustreError: 28938:0:(file.c:158:ll_close_inode_openhandle()) Skipped 72 previous similar messages LustreError: 25826:0:(mdc_locks.c:722:mdc_enqueue()) ldlm_cli_enqueue: -5 LustreError: 25826:0:(mdc_locks.c:722:mdc_enqueue()) Skipped 460 previous similar messages LustreError: 25826:0:(dir.c:424:ll_get_dir_page()) lock enqueue: [0xbbe0a5:0xd3011bf1:0x0] at 7331450859914670877: rc -5 LustreError: 25826:0:(dir.c:424:ll_get_dir_page()) Skipped 1 previous similar message LustreError: 15606:0:(lmv_obd.c:1036:lmv_fid_alloc()) Can''t alloc new fid, rc -19 LustreError: 26517:0:(file.c:2186:ll_inode_revalidate_fini()) failure -108 inode 9685121 LustreError: 26517:0:(file.c:2186:ll_inode_revalidate_fini()) Skipped 354 previous similar messages LustreError: 26517:0:(dir.c:424:ll_get_dir_page()) lock enqueue: [0xbbe0a5:0xd3011bf1:0x0] at 8925214612138510438: rc -108 LustreError: 26517:0:(dir.c:649:ll_readdir()) error reading dir [0xbbe0a5:0xd3011bf1:0x0] at 8925214612138510438: rc -108 LustreError: 19962:0:(mdc_locks.c:722:mdc_enqueue()) ldlm_cli_enqueue: -108 LustreError: 19962:0:(mdc_locks.c:722:mdc_enqueue()) Skipped 254 previous similar messages LustreError: 27542:0:(file.c:2186:ll_inode_revalidate_fini()) failure -108 inode 9685121 LustreError: 27542:0:(file.c:2186:ll_inode_revalidate_fini()) Skipped 174 previous similar messages LustreError: 28938:0:(ldlm_resource.c:748:ldlm_resource_complain()) Namespace storage-MDT0000-mdc-ffff880afb2a5c00 resource refcount nonzero (1) after lock cleanup; forcing cleanup. LustreError: 28938:0:(ldlm_resource.c:754:ldlm_resource_complain()) Resource: ffff8807bffa9800 (11888416/2271786101/0/0) (rc: 0) Lustre: storage-MDT0000-mdc-ffff880afb2a5c00: Connection restored to service storage-MDT0000 using nid 192.168.0.42 at tcp. --------------------------------------------------- Things like this happen usually with 1 to 7-8 clients simultaneously, for unknown reason. We have tried to increase timeout for lctl conf_param storage.sys.timeout=1000 as it was recommended in several lists we found on google, and it didn''t help anything. Any ideas what can be causing this, and how to proceed to solve this issue?>From what it looks like, it suppose to be fixed in reported bug 23352already in late version 1.8.x, so we assumed 2.1 has it already? Can someone confirm this? Thanks in advance, Mike.