Hi, After a failover one OSS is got into INACTIVE status: [root@sklusp01a ~]# lctl dl 0 UP mgs MGS MGS 145 1 UP mgc MGC10.214.127.54@tcp 6cc1bf8e-85f9-3d93-5d32-be3d441076a7 5 2 UP mdt MDS MDS_uuid 3 3 UP lov l1-mdtlov l1-mdtlov_UUID 4 4 UP mds l1-MDT0000 l1-MDT0000_UUID 131 5 UP osc l1-OST0000-osc l1-mdtlov_UUID 5 6 UP osc l1-OST0001-osc l1-mdtlov_UUID 5 7 IN osc l1-OST0002-osc l1-mdtlov_UUID 5 8 UP osc l1-OST0003-osc l1-mdtlov_UUID 5 9 UP osc l1-OST0004-osc l1-mdtlov_UUID 5 10 UP osc l1-OST0005-osc l1-mdtlov_UUID 5 [root@sklusp01a ~]# The interesting part of MDS log: Dec 27 22:58:45 sklusp01a kernel: Lustre: 1512:0:(mds_unlink_open.c:287:mds_cleanup_pending()) l1-MDT0000: orphan 44b58f0:ec7d7d52 re-opened during recovery Dec 27 22:58:45 sklusp01a kernel: Lustre: 1512:0:(quota_master.c:1722:mds_quota_recovery()) Only 2/6 OSTs are active, abort quota recovery Dec 27 22:58:45 sklusp01a kernel: Lustre: l1-MDT0000: Recovery period over after 10:00, of 64 clients 63 recovered and 1 was evicted. Dec 27 22:58:45 sklusp01a kernel: Lustre: l1-MDT0000: sending delayed replies to recovered clients Dec 27 22:58:45 sklusp01a kernel: LustreError: 1578:0:(mds_open.c:1645:mds_close()) @@@ no handle for file close ino 72927933: cookie 0xf5ee63c8f533fd1c req@ffff810c584cbc00 x1438686218045461/t0 o35->46a7866a-0713-ce2d-f66d-44fb9b42fef8@NET_0x200000ad67f58_UUID:0/0 lens 408/752 e 0 to 0 dl 1388181531 ref 1 fl Interpret:/0/0 rc 0/0 Dec 27 22:58:45 sklusp01a kernel: LustreError: 1578:0:(mds_open.c:1645:mds_close()) Skipped 3 previous similar messages Dec 27 22:58:45 sklusp01a kernel: LustreError: 1578:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-116) req@ffff810c584cbc00 x1438686218045461/t0 o35->46a7866a-0713-ce2d-f66d-44fb9b42fef8@NET_0x200000ad67f58_UUID:0/0 lens 408/560 e 0 to 0 dl 1388181531 ref 1 fl Interpret:/0/0 rc -116/0 Dec 27 22:58:45 sklusp01a kernel: Lustre: MDS l1-MDT0000: l1-OST0004_UUID now active, resetting orphans Dec 27 22:58:45 sklusp01a kernel: Lustre: Skipped 1 previous similar message Dec 27 22:59:22 sklusp01a kernel: Lustre: 25465:0:(quota_master.c:1722:mds_quota_recovery()) Only 2/6 OSTs are active, abort quota recovery Dec 27 22:59:22 sklusp01a kernel: Lustre: 25465:0:(quota_master.c:1722:mds_quota_recovery()) Skipped 6 previous similar messages Dec 27 22:59:22 sklusp01a kernel: Lustre: l1-OST0002-osc: Connection restored to service l1-OST0002 using nid 10.214.127.56@tcp. Dec 27 22:59:22 sklusp01a kernel: Lustre: MDS l1-MDT0000: l1-OST0002_UUID now active, resetting orphans Dec 27 22:59:22 sklusp01a kernel: Lustre: Skipped 3 previous similar messages Dec 27 22:59:22 sklusp01a kernel: LustreError: 3138:0:(lov_obd.c:1150:lov_clear_orphans()) error in orphan recovery on OST idx 2/6: rc = -16 Dec 27 22:59:22 sklusp01a kernel: LustreError: 3138:0:(mds_lov.c:1057:__mds_lov_synchronize()) l1-OST0002_UUID failed at mds_lov_clear_orphans: -16 Dec 27 22:59:22 sklusp01a kernel: LustreError: 3138:0:(mds_lov.c:1066:__mds_lov_synchronize()) l1-OST0002_UUID sync failed -16, deactivating Dec 27 22:59:22 sklusp01a kernel: LustreError: 3038:0:(llog_server.c:466:llog_origin_handle_cancel()) Cancel 61 of 122 llog-records failed: -2 Dec 27 23:01:39 sklusp01a kernel: LustreError: 3002:0:(handler.c:1513:mds_handle()) operation 101 on unconnected MDS from 12345-10.214.127.216@tcp Dec 27 23:01:39 sklusp01a kernel: LustreError: 3002:0:(handler.c:1513:mds_handle()) Skipped 8 previous similar messages Dec 27 23:01:39 sklusp01a kernel: LustreError: 3002:0:(ldlm_lib.c:1919:target_send_reply_msg()) @@@ processing error (-107) req@ffff810d0162e800 x1438688843764464/t0 o101-><?>@<?>:0/0 lens 512/0 e 0 to 0 dl 1388181741 ref 1 fl Interpret:/4/0 rc -107/0 The corresponding part of OSS log: Dec 27 22:59:22 sklusp03a kernel: Lustre: l1-OST0002: Recovery period over after 10:04, of 65 clients 64 recovered and 1 was evicted. Dec 27 22:59:22 sklusp03a kernel: Lustre: l1-OST0002: sending delayed replies to recovered clients Dec 27 22:59:22 sklusp03a kernel: Lustre: l1-OST0002: received MDS connection from 10.214.127.54@tcp Dec 27 22:59:22 sklusp03a kernel: Lustre: 15495:0:(filter.c:3127:filter_destroy_precreated()) l1-OST0002: deleting orphan objects from 53849605 to 53849665, orphan objids won't be reused any more. How to recover from this situation? Is it safe to activate OSS with lctl activate? Is it possible to tell why the OSS got into DEACTIVATED status? Thanks for any hint. Best regards, Akos _______________________________________________ Lustre-discuss mailing list Lustre-discuss-aLEFhgZF4x6X6Mz3xDxJMA@public.gmane.org http://lists.lustre.org/mailman/listinfo/lustre-discuss