Ms. Megan Larko
2008-Sep-03 15:39 UTC
[Lustre-discuss] Re-activating a partial lustre disk--update
Hi, I tried again to mount read-only my partial lustre disk. On a client I issued the command:>>mount -t lustre -o ro ic-mds1 at o2ib:/crew4 /crew4The command line prompt returned immediately. The disk could not be accessed however. Looking at the MGS/MDT server /var/log/messages file I saw that the mount again tried to re-activate the damaged (and not non-existant) crew4-OST0000 and crew4-OST0002 parts of this volume.>From messages:Sep 3 11:16:03 mds1 kernel: LustreError: 3361:0:(genops.c:1005:class_disconnect_stale_exports()) crew4-MDT0000: disconnecting 2 stale clients Sep 3 11:16:03 mds1 kernel: Lustre: crew4-MDT0000: sending delayed replies to recovered clients Sep 3 11:16:03 mds1 kernel: Lustre: 3361:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are active, abort quota recovery Sep 3 11:16:03 mds1 kernel: Lustre: 3361:0:(quota_master.c:1100:mds_quota_recovery()) Not all osts are active, abort quota recovery Sep 3 11:16:03 mds1 kernel: Lustre: crew4-MDT0000: recovery complete: rc 0 Sep 3 11:16:03 mds1 kernel: Lustre: MDS crew4-MDT0000: crew4-OST0001_UUID now active, resetting orphans Sep 3 11:16:03 mds1 kernel: LustreError: 20824:0:(mds_lov.c:705:__mds_lov_synchronize()) crew4-OST0000_UUID failed at update_mds: -108 Sep 3 11:16:03 mds1 kernel: LustreError: 20824:0:(mds_lov.c:748:__mds_lov_synchronize()) crew4-OST0000_UUID sync failed -108, deactivating Sep 3 11:16:03 mds1 kernel: LustreError: 20826:0:(mds_lov.c:705:__mds_lov_synchronize()) crew4-OST0002_UUID failed at update_mds: -108 Sep 3 11:16:03 mds1 kernel: LustreError: 20826:0:(mds_lov.c:748:__mds_lov_synchronize()) crew4-OST0002_UUID sync failed -108, deactivating I again used lctl as in my previous post to deactivate the device id''s associated with the failed hardware crew4-OST0000, crew4-OST0002.>From messages:Sep 3 11:20:45 mds1 kernel: Lustre: setting import crew4-OST0000_UUID INACTIVE by administrator request Sep 3 11:21:04 mds1 kernel: Lustre: setting import crew4-OST0002_UUID INACTIVE by administrator request So my current understanding is that the "recovery" status before the mount was unchanged from when I left the office yesterday... [root at mds1 crew4-MDT0000]# cat recovery_status status: RECOVERING recovery_start: 1220380113 time remaining: 0 connected_clients: 0/2 completed_clients: 0/2 replayed_requests: 0/?? queued_requests: 0 next_transno: 112339940 And that the client is not able to use a partial lustre disk...>From messages on client:Sep 3 11:22:58 crew01 kernel: Lustre: setting import crew4-MDT0000_UUID INACTIVE by administrator request Sep 3 11:22:58 crew01 kernel: Lustre: setting import crew4-OST0000_UUID INACTIVE by administrator request Sep 3 11:22:58 crew01 kernel: LustreError: 8832:0:(llite_lib.c:1520:ll_statfs_internal()) obd_statfs fails: rc -5 ...and the mount hangs I guess waiting for the bad OSTs to return. However on the mgs/mdg the bad disks are automatically re-activated??? I did the lctl, dl below ten minutes after the above transactions: 13 UP lov crew4-mdtlov crew4-mdtlov_UUID 4 14 UP osc crew4-OST0000-osc crew4-mdtlov_UUID 5 15 UP osc crew4-OST0001-osc crew4-mdtlov_UUID 5 16 UP osc crew4-OST0002-osc crew4-mdtlov_UUID 5 17 UP mds crew4-MDT0000 crew4-MDT0000_UUID 5 18 UP osc crew4-OST0003-osc crew4-mdtlov_UUID 5 19 UP osc crew4-OST0004-osc crew4-mdtlov_UUID 5 The crew4-OST0000, crew4-OST0002 are again listed as UP. Why? Can I echo a zero or one to a /proc/fs/lustre file somewhere to keep these volumes from being re-activated? This is CentOS 4 linux kernel 2.6.18-53.1.13.el5 with lustre-1.6.4.3smp. Have a nice day! megan