Hi Franco-- Franco Broi wrote:> > After a reset the OST/MDS came back up I can mount the filesystem on the > OST but the clients have failed to reconnect. There are processes on the > clients with files open on the filesystem so I can''t unmount it. What do > I need to do to kick it back into life?Please see https://bugzilla.lustre.org/show_bug.cgi?id=2375 for a description of how to recover the MDC and OSCs on your clients. We have a script kicking around to automate that manual process, until the recovery is itself more automatic. I will track that down. -Phil
Hi The machine running the OST/MDS for my lustre filesystem crashed - there''s nothing in the messages file to indicate why. After a reset the OST/MDS came back up I can mount the filesystem on the OST but the clients have failed to reconnect. There are processes on the clients with files open on the filesystem so I can''t unmount it. What do I need to do to kick it back into life? Jan 12 08:12:00 echo5 kernel: LustreError: 2004:(../ldlm/ldlm_lib.c:628:target_abort_recovery()) disconnecting clients and aborting recovery Jan 12 08:12:00 echo5 kernel: Lustre: 2004:(mds_unlink_open.c:293:mds_cleanup_orphans()) removed orphan b46e:c8c50bde from MDS and OST Jan 12 08:12:00 echo5 kernel: Lustre: 2004:(mds_unlink_open.c:293:mds_cleanup_orphans()) removed orphan b47a:c8c50bf1 from MDS and OST Jan 12 08:12:00 echo5 kernel: Lustre: 1965:(filter.c:1538:filter_destroy_precreated()) deleting orphan objects from 429 to 2000 Jan 12 08:12:03 echo5 kernel: Lustre: 1966:(filter.c:1538:filter_destroy_precreated()) deleting orphan objects from 429 to 2000 Jan 12 08:12:05 echo5 kernel: Lustre: 1967:(filter.c:1538:filter_destroy_precreated()) deleting orphan objects from 429 to 2000 Jan 12 08:12:07 echo5 kernel: Lustre: 1968:(filter.c:1538:filter_destroy_precreated()) deleting orphan objects from 429 to 2000 Jan 12 08:12:09 echo5 kernel: Lustre: 2004:(../ldlm/ldlm_lib.c:648:target_abort_recovery()) Cleanup 2 orphans after recovery was aborted Jan 12 08:12:09 echo5 kernel: LustreError: 2004:(recover.c:68:ptlrpc_run_recovery_over_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall RECOVERY_OVER echo5-mds_UUID: -2; check /proc/sys/lustre/upcall On one of the clients: LustreError: 14682:(socknal_cb.c:2491:ksocknal_find_timed_out_conn()) Timed out TX to 0xa0002805 1812 de70a800 LustreError: 14682:(socknal_cb.c:2522:ksocknal_check_peer_timeouts()) Timeout out conn->0xa0002805 ip a0002805:988 LustreError: 15433:(client.c:810:ptlrpc_expire_one_request()) @@@ timeout req@ca522400 x1315973/t0 o41->echo5-mds_UUID@NID_160.0.40.5_UUID:12 lens 64/216 ref 1 fl RPC:/0/0 rc 0 LustreError: 15433:(recover.c:100:ptlrpc_run_failed_import_upcall()) Error invoking recovery upcall /usr/lib/lustre/lustre_upcall FAILED_IMPORT echo5-mds_UUID MDC_charlie10_echo5-mds_MNT_charlie10 NID_160.0.40.5_UUID: -2; check /proc/sys/lustre/lustre_upcall