RS RS
2006-Aug-15 12:21 UTC
[Lustre-discuss] Bringing a cluster with failed MDS down then up
Hi, I have a question about failover and recovery. I am able to switchover my Active MDS (MDS-1) to a Standby MDS (call it MDS-2). If I then want to reboot my cluster, which MDS should become active? What commands do I need to give the client to help it find the correct MDS? What if MDS-1 is still dead? In this case, MDS-2 (after reboot) is getting the following error message: LustreError: 10083:0:(ldlm_lib.c:544:target_handle_connect()) @@@ UUID ''ha-mds_UUID'' is not available for connect (no target) req@000001007c037e00 x9/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc 0/0 LustreError: 10083:0:(ldlm_lib.c:1262:target_send_reply_msg()) @@@ processing error (-19) req@000001007c037e00 x9/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc -19/0 Any help would be appreciated. -Roger _________________________________________________________________ On the road to retirement? Check out MSN Life Events for advice on how to get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
Nathaniel Rutman
2006-Aug-15 14:44 UTC
[Lustre-discuss] Bringing a cluster with failed MDS down then up
RS RS wrote:> Hi, I have a question about failover and recovery. > > I am able to switchover my Active MDS (MDS-1) to a Standby MDS (call > it MDS-2). If I then want to reboot my cluster, which MDS should > become active?Reboot means reboot all Lustre services and clients? If you''re talking about zconf mounting, you need to specify at least the location of the active MDS; you can also specify other failover locations. It doesn''t matter which MDS node is active.> What commands do I need to give the client to help it find the correct > MDS?For a live client, nothing; for a restarting client, just the mount command with the list of MDS NIDs to try to contact, e.g. mount -t lustre MDS-1@tcp:MDS-2@tcp:/ha-mds/client /mnt/lustre> > What if MDS-1 is still dead? In this case, MDS-2 (after reboot) is > getting the following error message: > > LustreError: 10083:0:(ldlm_lib.c:544:target_handle_connect()) @@@ UUID > ''ha-mds_UUID'' is not available for connect (no target) > req@000001007c037e00 x9/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl > Interpret:/0/0 rc 0/0 > LustreError: 10083:0:(ldlm_lib.c:1262:target_send_reply_msg()) @@@ > processing error (-19) req@000001007c037e00 x9/t0 o38-><?>@<?>:-1 lens > 240/0 ref 0 fl Interpret:/0/0 rc -19/0 >That looks like LNET is running but not the MDS named "ha-mds". cat /proc/fs/lustre/devices to see what''s actually running.
Nathaniel Rutman
2006-Aug-17 12:55 UTC
[Lustre-discuss] Bringing a cluster with failed MDS down then up
RS RS wrote:>>> What commands do I need to give the client to help it find the >>> correct MDS? >> > > Nathan wrote: > >> For a live client, nothing; for a restarting client, just the mount >> command with the list of MDS NIDs to try to contact, e.g. mount -t >> lustre MDS-1@tcp:MDS-2@tcp:/ha-mds/client /mnt/lustre > > > Nathan, I typed the following mount command with the the following > results: > > blade-lustre0:~# mount -t lustre > roger-ha-1@tcp:roger-ha-2@tcp:/ha-mds/client /mnt/lustre > mount.lustre: mount(roger-ha-1@tcp:roger-ha-2@tcp:/ha-mds/client, > /mnt/lustre) failed: No such device > mds nid 0: roger-ha-1.terascala.com@tcp > mds name: roger-ha-2@tcp: > profile: ha-mds/client > options: rw >oops, it''s a '','' not a '':'' ./mount.lustre --help mount.lustre v1.3 usage: mount.lustre <mdsnode>[,<altmdsnode>]:/<mdsname>/<cfgname> <mountpt> [-fhnv] [-o mntopt]> > I?m running 1.4.6.4. Do you have any other suggestions as to which > mount options I should use? >And sorry I didn''t read to the end of your message :(> > If you have a moment, could you please also answer a couple of other > small questions: > > On the OSC, there is a file: > /proc/fs/lustre/mdc/MDC_blade-lustre0_ha-mds_MNT_client/mds_conn_uuid > that lists the MDS UUID. > > Is there an equivalent file on the OSC that lists the OSS? >A Lustre "client" has both an MetaDataClient (mdc) and ObjectStorageClient(s) (oscs): ./mdc/lustre-MDT0000-mdc-a841b400/mds_conn_uuid ./osc/lustre-OST0000-osc-a841b400/ost_conn_uuid An MDS also has osc(s), since it also acts as a client to the OSTs ./osc/lustre-OST0000-osc/ost_conn_uuid> Is there an equivalent file on the OSS that lists the MDS?An OST is not a client of anything, so it has no oscs, and no, we don''t have a list of it''s exports.
RS RS
2006-Aug-18 07:40 UTC
[Lustre-discuss] Bringing a cluster with failed MDS down then up
Nathan, I tried it with the comma, with the same results (I also added the -v option): mount -t lustre -v roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client /mnt/lustre verbose: 1 arg[0] = /sbin/mount.lustre arg[1] = -v arg[2] = -o arg[3] = rw arg[4] = roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client arg[5] = /mnt/lustre source = roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client, target = /mnt/lustre mds nid 0: roger-ha-1.terascala.com@tcp mds nid 1: roger-ha-2@tcp mds name: ha-mds profile: client options: rw mount.lustre: mount(roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client, /mnt/lustre) failed: No such device mds nid 0: roger-ha-1.terascala.com@tcp mds nid 1: roger-ha-2@tcp mds name: ha-mds profile: client options: rw The script that is used to build the .xml file is: lmc -m failoverLustre.xml--- --add net --node roger-ha-1 --nid roger-ha-1 --nettype tcp lmc -m failoverLustre.xml--- --add net --node roger-ha-2 --nid roger-ha-2 --nettype tcp lmc -m failoverLustre.xml--- --add net --node blade-lustre2 --nid blade-lustre2 --nettype tcp lmc -m failoverLustre.xml--- --add net --node client --nid * --nettype tcp lmc -m failoverLustre.xml--- --add mds --node roger-ha-1 --mds ha-mds --fstype ext3 --dev /dev/md1 --failover lmc -m failoverLustre.xml--- --add mds --node roger-ha-2 --mds ha-mds --fstype ext3 --dev /dev/md1 --failover lmc -m failoverLustre.xml--- --add lov --lov lov-ts --mds ha-mds --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0 lmc -m failoverLustre.xml--- --add ost --node blade-lustre2 --lov lov-ts --ost ost1-ts --fstype ext3 --dev /dev/sda1 lmc -m failoverLustre.xml--- --add mtpt --node client --path /mnt/lustre --mds ha-mds --lov lov-ts I''m able to run fine using the lconf command: lconf -v --node client /home/roger/lustreConfigs/failoverLustre/failoverLustre.xml -Roger _________________________________________________________________ Don’t just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/