thr3ads.net - Lustre discuss - [Lustre-discuss] Replacing an MDT [Mar 2009]

If this information is useful, please help other people find it:
Share via:

Ms. Megan Larko

2009-Mar-20 17:22 UTC

[Lustre-discuss] Replacing an MDT

Hello,

I have a lustre system (still 1.6.3) that has an MDT which was too
small and ran out of inodes.  We removed files to take it back from
the edge and then unmounted the Lustre disk from the clients and
started to replace the MDT on the MGS.

I followed the instructions in Chapter 15 of the Lustre Manual under
"Backup and Restore".   I had no trouble unmounting the MDT remounting
it as -t ldiskfs and running the getfattr and tar commands.   I
umounted the original disk and then mounted a new larger disk as -t
ldiskfs and proceded to restore the data to the bigger disk via
setfattr and tar expansion of the data I had just gotten hours before
from the original MDT (Recall all clients have had this disk unmounted
so no activity should have occurred to it.).  When I mount the new
larger disk as -t lustre as the MDT I see no mount errors, but the
following errors appear in MGS /var/log/messages (no client access at
this point):

Mar 20 12:39:40 mds1 kernel: Lustre: MDT crew8-MDT0000 now serving dev
(f8a0e9b5-c2f1-8297-4ead-e34c9680b3cf) with recovery enabled
Mar 20 12:39:40 mds1 kernel: Lustre: Server crew8-MDT0000 on device
/dev/METADATA2/LV2 has started
Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.215 at o2ib. The ost_connect
operation failed with -114
Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.215 at o2ib. The ost_connect
operation failed with -114
Mar 20 12:39:40 mds1 kernel: LustreError: Skipped 6 previous similar messages
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(llog_lvfs.c:597:llog_lvfs_create()) error looking up logfile
0xa65662:0x9c30d2f6: rc -2
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(osc_request.c:3446:osc_llog_init()) failed
LLOG_MDS_OST_ORIG_CTXT
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(osc_request.c:3457:osc_llog_init()) osc
''crew8-OST0000-osc''
tgt ''crew8-MDT0000'' cnt 1 catid ffffc200050f8000 rc=-2
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(osc_request.c:3459:osc_llog_init()) logid 0xa65662:0x9c30d2f6
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(lov_log.c:214:lov_llog_init()) error osc_llog_init idx 0 osc
''crew8-OST0000-osc'' tgt ''crew8-MDT0000''
(rc=-2)
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(mds_log.c:207:mds_llog_init()) lov_llog_init err -2
Mar 20 12:39:40 mds1 kernel: LustreError:
3460:0:(llog_obd.c:392:llog_cat_initialize()) rc: -2
Mar 20 12:40:05 mds1 kernel: LustreError: 11-0: an error occurred
while communicating with 192.168.64.215 at o2ib. The ost_connect
operation failed with -114
Mar 20 12:40:05 mds1 kernel: LustreError: Skipped 4 previous similar messages

The df  shows the volume (crew8-MDT0000) mounted as is the other disk
(crew2-MDT0000).
/dev/METADATA1/LV1    204G  4.7G  190G   3% /srv/lustre/mds/crew2-MDT0000
/dev/METADATA2/LV2    204G  5.3G  187G   3% /srv/lustre/mds/crew8-MDT0000

The lctl dl shows all of the disks as being up:
[root at mds1 ~]# lctl
lctl > dl
  0 UP mgs MGS MGS 5
  1 UP mgc MGC192.168.64.210 at o2ib b09fab05-c2ad-8ebb-553e-0e35f2fba17a 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov crew2-mdtlov crew2-mdtlov_UUID 4
  4 UP mds crew2-MDT0000 crew2mds_UUID 9
  5 UP osc crew2-OST0000-osc crew2-mdtlov_UUID 5
  6 UP osc crew2-OST0001-osc crew2-mdtlov_UUID 5
  7 UP osc crew2-OST0002-osc crew2-mdtlov_UUID 5
  8 UP lov crew8-mdtlov crew8-mdtlov_UUID 4
  9 UP mds crew8-MDT0000 crew8-MDT0000_UUID 15
 10 UP osc crew8-OST0000-osc crew8-mdtlov_UUID 5
 11 UP osc crew8-OST0001-osc crew8-mdtlov_UUID 5
 12 UP osc crew8-OST0002-osc crew8-mdtlov_UUID 5
 13 UP osc crew8-OST0003-osc crew8-mdtlov_UUID 5
 14 UP osc crew8-OST0004-osc crew8-mdtlov_UUID 5
 15 UP osc crew8-OST0005-osc crew8-mdtlov_UUID 5
 16 UP osc crew8-OST0006-osc crew8-mdtlov_UUID 5
 17 UP osc crew8-OST0007-osc crew8-mdtlov_UUID 5
 18 UP osc crew8-OST0008-osc crew8-mdtlov_UUID 5
 19 UP osc crew8-OST0009-osc crew8-mdtlov_UUID 5
 20 UP osc crew8-OST000a-osc crew8-mdtlov_UUID 5
 21 UP osc crew8-OST000b-osc crew8-mdtlov_UUID 5

Does this have anything to do with "Remove the recovery logs (now
invalid), run  ''rm OBJECTS/* CATALOGS''"?
Should I just copy or rsync specific files from the smaller
crew8-MDT0000 to the new, largre crew8-MDT0000?

megan

Brian J. Murrell

2009-Mar-20 17:37 UTC

head link

[Lustre-discuss] Replacing an MDT

On Fri, 2009-03-20 at 13:22 -0400, Ms. Megan Larko
wrote:> 
> Mar 20 12:39:40 mds1 kernel: Lustre: MDT crew8-MDT0000 now serving dev
> (f8a0e9b5-c2f1-8297-4ead-e34c9680b3cf) with recovery enabled
> Mar 20 12:39:40 mds1 kernel: Lustre: Server crew8-MDT0000 on device
> /dev/METADATA2/LV2 has started
> Mar 20 12:39:40 mds1 kernel: LustreError: 11-0: an error occurred
> while communicating with 192.168.64.215 at o2ib. The ost_connect
> operation failed with -114
114 == Operation already in progress

Strange.
> 3460:0:(llog_lvfs.c:597:llog_lvfs_create()) error looking up logfile
> 0xa65662:0x9c30d2f6: rc -2
Oh.  A file is missing.
> Mar 20 12:39:40 mds1 kernel: LustreError:
> 3460:0:(osc_request.c:3446:osc_llog_init()) failed
> LLOG_MDS_OST_ORIG_CTXT
> Mar 20 12:39:40 mds1 kernel: LustreError:
> 3460:0:(osc_request.c:3457:osc_llog_init()) osc
''crew8-OST0000-osc''
> tgt ''crew8-MDT0000'' cnt 1 catid ffffc200050f8000 rc=-2
> Mar 20 12:39:40 mds1 kernel: LustreError:
> 3460:0:(osc_request.c:3459:osc_llog_init()) logid 0xa65662:0x9c30d2f6
> Mar 20 12:39:40 mds1 kernel: LustreError:
> 3460:0:(lov_log.c:214:lov_llog_init()) error osc_llog_init idx 0 osc
> ''crew8-OST0000-osc'' tgt ''crew8-MDT0000''
(rc=-2)
> Mar 20 12:39:40 mds1 kernel: LustreError:
> 3460:0:(mds_log.c:207:mds_llog_init()) lov_llog_init err -2
> Mar 20 12:39:40 mds1 kernel: LustreError:
> 3460:0:(llog_obd.c:392:llog_cat_initialize()) rc: -2
> Mar 20 12:40:05 mds1 kernel: LustreError: 11-0: an error occurred
> while communicating with 192.168.64.215 at o2ib. The ost_connect
> operation failed with -114
> Mar 20 12:40:05 mds1 kernel: LustreError: Skipped 4 previous similar
messages
Seems like maybe your backup/restore was not successful.
> Should I just copy or rsync specific files from the smaller
> crew8-MDT0000 to the new, largre crew8-MDT0000?
Well, if they are both physically mountable at the same time, an rsync
is what I would have chosen over doing a backup/restore process.  Be
sure to make sure you use (amongst others) the -X argument for rsync to
get the EAs.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: This is a digitally signed message part
Url :
http://lists.lustre.org/pipermail/lustre-discuss/attachments/20090320/163334bd/attachment.bin

Nirmal Seenu

2009-Mar-20 20:59 UTC

head link

[Lustre-discuss] Replacing an MDT

It seems like you missed one step in the restore process. Try mounting 
the MDT partition as type "ldiskfs" and removing these files:

rm OBJECTS/* CATALOGS

You should be able to mount the MDT without any errors at this point. If 
you still see some errors do the following and try mounting it again:

tunefs.lustre --erase-params <all the options used required> --writeconf 
  <mdt-device>

Nirmal

Lustre discuss - Mar 2009 - Replacing an MDT

[Lustre-discuss] Replacing an MDT

[Lustre-discuss] Replacing an MDT

[Lustre-discuss] Replacing an MDT