Ms. Megan Larko
2008-Sep-02 21:05 UTC
[Lustre-discuss] Re-activating a partial lustre disk
Hello, Getting back to some hardware which experienced a failure, I followed instructions per Andreas Dilger and mounted the data storage targets (crew4-OST0001, crew4-OST0003 and crew4-OST0004----physical hardware for targets crew4-OST0000 and crew4-OST0002 have failed). Then I went to the MGS and mounted the crew4-MDT0000. Next I used lctl to dl (for device list, I''m assuming) and explicitly deactivated the ID numbers associated with crew4-OST0000 and crew4-OST0002. There were no errors on either the OSS computer hosting the OSTs nor on the MGS hosting the MDT. I attempted to mount the /crew4 lustre disk read only on a client. The activity timed out. On the MGS, the recovery status is indicated as follows:>>cat /proc/fs/lustre/mds/crew4-MDT0000/recovery_statusstatus: RECOVERING recovery_start: 1220380113 time remaining: 0 connected_clients: 0/2 completed_clients: 0/2 replayed_requests: 0/?? queued_requests: 0 next_transno: 112339940 There are zero seconds left but the status is still "RECOVERING". A tail of MGS /var/log/messages indicates: Sep 2 14:29:07 mds1 kernel: Lustre: setting import crew4-OST0000_UUID INACTIVE by administrator request Sep 2 14:29:37 mds1 kernel: Lustre: setting import crew4-OST0002_UUID INACTIVE by administrator request Sep 2 15:09:54 mds1 ntpd[2857]: no servers reachable Sep 2 15:28:00 mds1 kernel: Lustre: 3373:0:(ldlm_lib.c:1114:target_start_recovery_timer()) crew4-MDT0000: starting recovery timer (2500s) Sep 2 15:28:00 mds1 kernel: LustreError: 3373:0:(ldlm_lib.c:786:target_handle_connect()) crew4-MDT0000: denying connection for new client 172.18.0.11 at o2ib (1076f71f-3b0c-025c-586f-3f2649955011): 2 clients in recovery for 2500s Sep 2 15:28:00 mds1 kernel: LustreError: 3373:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-16) req at ffff81006533dc00 x2169503/t0 o38-><?>@<?>:-1 lens 240/144 ref 0 fl Interpret:/0/0 rc -16/0 Sep 2 15:32:10 mds1 kernel: LustreError: 3373:0:(ldlm_lib.c:786:target_handle_connect()) crew4-MDT0000: denying connection for new client 172.18.0.11 at o2ib (1076f71f-3b0c-025c-586f-3f2649955011): 2 clients in recovery for 2250s Sep 2 15:32:10 mds1 kernel: LustreError: 3373:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-16) req at ffff81002cc5b400 x2169552/t0 o38-><?>@<?>:-1 lens 240/144 ref 0 fl Interpret:/0/0 rc -16/0 Sep 2 15:44:04 mds1 ntpd[2857]: synchronized to 10.0.1.97, stratum 3 Sep 2 16:09:40 mds1 kernel: LustreError: 0:0:(ldlm_lib.c:1072:target_recovery_expired()) crew4-MDT0000: recovery timed out, aborting So---the recovery timed-out after more than one hour. Will MGS crew4-MDT0000 never recover because two of its OST''s are missing even though they have been deactivated? If yes, is there a way in which to mount the crew4 lustre disk with its remaining parts for recovery? Any and all suggestions are genuinely appreciated. megan
On Sep 02, 2008 17:05 -0400, Ms. Megan Larko wrote:> 3373:0:(ldlm_lib.c:1114:target_start_recovery_timer()) crew4-MDT0000: > starting recovery timer (2500s)What is your lustre timeout (/proc/sys/lustre/timeout)? The 2500s recovery timeout is much too large.> Sep 2 16:09:40 mds1 kernel: LustreError: > 0:0:(ldlm_lib.c:1072:target_recovery_expired()) crew4-MDT0000: > recovery timed out, abortingYou should be able to mount the client after this time. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Brian Behlendorf
2008-Sep-04 16:22 UTC
[Lustre-discuss] Re-activating a partial lustre disk
> On Sep 02, 2008 17:05 -0400, Ms. Megan Larko wrote: > > 3373:0:(ldlm_lib.c:1114:target_start_recovery_timer()) crew4-MDT0000: > > starting recovery timer (2500s) > > What is your lustre timeout (/proc/sys/lustre/timeout)? The 2500s > recovery timeout is much too large.We have seen a similar problem and Sun is working on it in bug 16389. bug 16389: replay-vbr; replay-ost-single too long recovery time -- Thanks, Brian -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: not available Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080904/77e646fa/attachment.bin
Ms. Megan Larko
2008-Sep-04 18:34 UTC
[Lustre-discuss] Re-activating a partial lustre disk
Greetings! Following the long recovery period I am able to mount the disk. The mount command returns very nearly immediately. The difficulty is that the mounted disk cannot be used. All commands such as "ls" or "df" or "cd" will hang. Eventually I "fuser -km /crew4" and "umount -f crew4" to clear the process and free the command line. So the disk now mounts but is unusable for all practical purposes. The log files contain the following information: The MGS/MDS: Sep 4 14:07:13 mds1 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.14 at o2ib. The ost_connect operation failed with -19 Sep 4 14:07:13 mds1 kernel: Lustre: Client crew4-client has started Sep 4 14:07:13 mds1 kernel: LustreError: Skipped 1 previous similar message The OSS: Sep 4 14:10:56 oss3 kernel: LustreError: 4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-19) req at ffff8103e7ba5a00 x5186/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Sep 4 14:10:56 oss3 kernel: LustreError: 4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 7 previous similar messages Sep 4 14:10:56 oss3 kernel: LustreError: Skipped 7 previous similar messages Sep 4 14:15:04 oss3 kernel: LustreError: 11-0: an error occurred while communicating with 0 at lo. The ost_connect operation failed with -19 The client box on which the /crew4 disk was mounted (and yes, it appears properly in mtab FWIW): First--- its own MGS: Sep 4 14:07:13 mds1 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.14 at o2ib. The ost_connect operation failed with -19 Sep 4 14:07:13 mds1 kernel: Lustre: Client crew4-client has started Sep 4 14:07:13 mds1 kernel: LustreError: Skipped 1 previous similar message Second-- its own OSS: Sep 4 14:10:56 oss3 kernel: LustreError: 4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) @@@ processing error (-19) req at ffff8103e7ba5a00 x5186/t0 o8-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Sep 4 14:10:56 oss3 kernel: LustreError: 4881:0:(ldlm_lib.c:1442:target_send_reply_msg()) Skipped 7 previous similar messages Sep 4 14:10:56 oss3 kernel: LustreError: Skipped 7 previous similar messages Third-- from a real client (which I don''t like to potentially hang): Sep 3 16:31:05 crew01 kernel: Lustre: Client crew4-client has started Sep 3 16:31:05 crew01 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.14 at o2ib. The ost_connect operation failed with -19 Sep 3 16:31:05 crew01 kernel: LustreError: Skipped 1 previous similar message Sep 3 16:35:15 crew01 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.14 at o2ib. The ost_connect operation failed with -19 Sep 3 16:35:15 crew01 kernel: LustreError: Skipped 1 previous similar message Sep 3 16:35:27 crew01 mountd[3994]: authenticated unmount request from crewtape1.iges.org:1015 for /crew3 (/crew3) Sep 3 16:39:25 crew01 kernel: LustreError: 11-0: an error occurred while communicating with 172.18.0.14 at o2ib. The ost_connect operation failed with -19 Note that the above messages on the real client will appear until I have unmounted the /crew4 lustre disk. On MGS/MDS: lctl > dl 0 UP mgs MGS MGS 13 1 UP mgc MGC172.18.0.10 at o2ib 81039216-0261-c74d-3f2f-a504788ad8f8 5 2 UP mdt MDS MDS_uuid 3 3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4 4 UP mds crew2-MDT0000 crew2mds_UUID 9 5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5 6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5 7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5 8 UP lov crew3-mdtlov crew3-mdtlov_UUID 4 9 UP mds crew3-MDT0000 crew3mds_UUID 9 10 UP osc crew3-OST0000-osc crew3-mdtlov_UUID 5 11 UP osc crew3-OST0001-osc crew3-mdtlov_UUID 5 12 UP osc crew3-OST0002-osc crew3-mdtlov_UUID 5 13 UP lov crew4-mdtlov crew4-mdtlov_UUID 4 14 UP osc crew4-OST0000-osc crew4-mdtlov_UUID 5 15 UP osc crew4-OST0001-osc crew4-mdtlov_UUID 5 16 UP osc crew4-OST0002-osc crew4-mdtlov_UUID 5 17 UP mds crew4-MDT0000 crew4-MDT0000_UUID 9 18 UP osc crew4-OST0003-osc crew4-mdtlov_UUID 5 19 UP osc crew4-OST0004-osc crew4-mdtlov_UUID 5 The other lustre disks /crew2 and /crew3 are working just fine; no errors. The /crew4 disk on the MGS/MDS shows the crew4-OST0000 and crew4-OST0002 as "UP". They have been specifically deactivated. On the OST, hosting the /crew4 disks, the lctl shows the following: 0 UP mgc MGC172.18.0.10 at o2ib b4c1b639-11d5-9092-c0d0-cebc2365afec 5 1 UP ost OSS OSS_uuid 3 2 UP obdfilter crew4-OST0001 crew4-OST0001_UUID 11 3 UP obdfilter crew4-OST0003 crew4-OST0003_UUID 11 4 UP obdfilter crew4-OST0004 crew4-OST0004_UUID 11 Most of the errors are ost_connect failed. Is the MGS/MDT disk crwe4-MDT0000 still trying to use all of the OST''s? Do I need to dummy some hardware to 8Tb partitions formatted lustre and named crew4-OST0000, crew4-OST0002 on the OSS to "trick" lustre into connecting with an OST which is deactivated? I would like to be able to get a few files from this damaged disk if possible. However, if that is not to be, I will learn and move on. Enjoy your day! megan