Hello, I have a lustre system (still 1.6.3) that has an MDT which was too small and ran out of inodes. We removed files to take it back from the edge and then unmounted the Lustre disk from the clients and started to replace the MDT on the MGS. I followed the instructions in Chapter 15 of the Lustre Manual under "Backup and Restore". I had no trouble unmounting the MDT remounting it as -t ldiskfs and running the getfattr and tar commands. I umounted the original disk and then mounted a new larger disk as -t ldiskfs and proceded to restore the data to the bigger disk via setfattr and tar expansion of the data I had just gotten hours before from the original MDT (Recall all clients have had this disk unmounted so no activity should have occurred to it.). When I mount the new larger disk as -t lustre as the MDT I see no mount errors, but the following errors appear in MGS /var/log/messages (no client access at this point): Mar 20 12:39:40 mds1 kernel: Lustre: MDT crew8-MDT0000 now serving dev (f8a0e9b5-c2f1-8297-4ead-e34c9680b3cf) with recovery enabled Mar 20 12:39:40 mds1 kernel: Lustre: Server crew8-MDT0000 on device /dev/METADATA2/LV2 has started Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.64.215 at o2ib. The ost_connect operation failed with -114 Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.64.215 at o2ib. The ost_connect operation failed with -114 Mar 20 12:39:40 mds1 kernel: LustreError: Skipped 6 previous similar messages Mar 20 12:39:40 mds1 kernel: LustreError: 3460:0:(llog_lvfs.c:597:llog_lvfs_create()) error looking up logfile 0xa65662:0x9c30d2f6: rc -2 Mar 20 12:39:40 mds1 kernel: LustreError: 3460:0:(osc_request.c:3446:osc_llog_init()) failed LLOG_MDS_OST_ORIG_CTXT Mar 20 12:39:40 mds1 kernel: LustreError: 3460:0:(osc_request.c:3457:osc_llog_init()) osc ''crew8-OST0000-osc'' tgt ''crew8-MDT0000'' cnt 1 catid ffffc200050f8000 rc=-2 Mar 20 12:39:40 mds1 kernel: LustreError: 3460:0:(osc_request.c:3459:osc_llog_init()) logid 0xa65662:0x9c30d2f6 Mar 20 12:39:40 mds1 kernel: LustreError: 3460:0:(lov_log.c:214:lov_llog_init()) error osc_llog_init idx 0 osc ''crew8-OST0000-osc'' tgt ''crew8-MDT0000'' (rc=-2) Mar 20 12:39:40 mds1 kernel: LustreError: 3460:0:(mds_log.c:207:mds_llog_init()) lov_llog_init err -2 Mar 20 12:39:40 mds1 kernel: LustreError: 3460:0:(llog_obd.c:392:llog_cat_initialize()) rc: -2 Mar 20 12:40:05 mds1 kernel: LustreError: 11-0: an error occurred while communicating with 192.168.64.215 at o2ib. The ost_connect operation failed with -114 Mar 20 12:40:05 mds1 kernel: LustreError: Skipped 4 previous similar messages The df shows the volume (crew8-MDT0000) mounted as is the other disk (crew2-MDT0000). /dev/METADATA1/LV1 204G 4.7G 190G 3% /srv/lustre/mds/crew2-MDT0000 /dev/METADATA2/LV2 204G 5.3G 187G 3% /srv/lustre/mds/crew8-MDT0000 The lctl dl shows all of the disks as being up: [root at mds1 ~]# lctl lctl > dl 0 UP mgs MGS MGS 5 1 UP mgc MGC192.168.64.210 at o2ib b09fab05-c2ad-8ebb-553e-0e35f2fba17a 5 2 UP mdt MDS MDS_uuid 3 3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4 4 UP mds crew2-MDT0000 crew2mds_UUID 9 5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5 6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5 7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5 8 UP lov crew8-mdtlov crew8-mdtlov_UUID 4 9 UP mds crew8-MDT0000 crew8-MDT0000_UUID 15 10 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5 11 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5 12 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5 13 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5 14 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5 15 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5 16 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5 17 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5 18 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5 19 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5 20 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5 21 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5 Does this have anything to do with "Remove the recovery logs (now invalid), run ''rm OBJECTS/* CATALOGS''"? Should I just copy or rsync specific files from the smaller crew8-MDT0000 to the new, largre crew8-MDT0000? megan
On Fri, 2009-03-20 at 13:22 -0400, Ms. Megan Larko wrote:> > Mar 20 12:39:40 mds1 kernel: Lustre: MDT crew8-MDT0000 now serving dev > (f8a0e9b5-c2f1-8297-4ead-e34c9680b3cf) with recovery enabled > Mar 20 12:39:40 mds1 kernel: Lustre: Server crew8-MDT0000 on device > /dev/METADATA2/LV2 has started > Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred > while communicating with 192.168.64.215 at o2ib. The ost_connect > operation failed with -114114 == Operation already in progress Strange.> 3460:0:(llog_lvfs.c:597:llog_lvfs_create()) error looking up logfile > 0xa65662:0x9c30d2f6: rc -2Oh. A file is missing.> Mar 20 12:39:40 mds1 kernel: LustreError: > 3460:0:(osc_request.c:3446:osc_llog_init()) failed > LLOG_MDS_OST_ORIG_CTXT > Mar 20 12:39:40 mds1 kernel: LustreError: > 3460:0:(osc_request.c:3457:osc_llog_init()) osc ''crew8-OST0000-osc'' > tgt ''crew8-MDT0000'' cnt 1 catid ffffc200050f8000 rc=-2 > Mar 20 12:39:40 mds1 kernel: LustreError: > 3460:0:(osc_request.c:3459:osc_llog_init()) logid 0xa65662:0x9c30d2f6 > Mar 20 12:39:40 mds1 kernel: LustreError: > 3460:0:(lov_log.c:214:lov_llog_init()) error osc_llog_init idx 0 osc > ''crew8-OST0000-osc'' tgt ''crew8-MDT0000'' (rc=-2) > Mar 20 12:39:40 mds1 kernel: LustreError: > 3460:0:(mds_log.c:207:mds_llog_init()) lov_llog_init err -2 > Mar 20 12:39:40 mds1 kernel: LustreError: > 3460:0:(llog_obd.c:392:llog_cat_initialize()) rc: -2 > Mar 20 12:40:05 mds1 kernel: LustreError: 11-0: an error occurred > while communicating with 192.168.64.215 at o2ib. The ost_connect > operation failed with -114 > Mar 20 12:40:05 mds1 kernel: LustreError: Skipped 4 previous similar messagesSeems like maybe your backup/restore was not successful.> Should I just copy or rsync specific files from the smaller > crew8-MDT0000 to the new, largre crew8-MDT0000?Well, if they are both physically mountable at the same time, an rsync is what I would have chosen over doing a backup/restore process. Be sure to make sure you use (amongst others) the -X argument for rsync to get the EAs. b. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 197 bytes Desc: This is a digitally signed message part Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/163334bd/attachment.bin
It seems like you missed one step in the restore process. Try mounting the MDT partition as type "ldiskfs" and removing these files: rm OBJECTS/* CATALOGS You should be able to mount the MDT without any errors at this point. If you still see some errors do the following and try mounting it again: tunefs.lustre --erase-params <all the options used required> --writeconf <mdt-device> Nirmal