thr3ads.net - Lustre discuss - [Lustre-discuss] Bringing a cluster with failed MDS down then up [Aug 2006]

If this information is useful, please help other people find it:
Share via:

RS RS

2006-Aug-15 12:21 UTC

[Lustre-discuss] Bringing a cluster with failed MDS down then up

Hi, I have a question about failover and recovery.

I am able to switchover my Active MDS (MDS-1) to a Standby MDS (call it 
MDS-2).  If I then want to reboot my cluster, which MDS should become 
active?

What commands do I need to give the client to help it find the correct MDS?

What if MDS-1 is still dead?  In this case, MDS-2 (after reboot) is getting 
the following error message:

LustreError: 10083:0:(ldlm_lib.c:544:target_handle_connect()) @@@ UUID 
''ha-mds_UUID'' is not available  for connect (no target)
req@000001007c037e00
x9/t0 o38-><?>@<?>:-1 lens 240/0 ref 0 fl Interpret:/0/0 rc 0/0
LustreError: 10083:0:(ldlm_lib.c:1262:target_send_reply_msg()) @@@ 
processing error (-19) req@000001007c037e00 x9/t0 o38-><?>@<?>:-1
lens 240/0
ref 0 fl Interpret:/0/0 rc -19/0

Any help would be appreciated.

-Roger

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to 
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement

Nathaniel Rutman

2006-Aug-15 14:44 UTC

head link

[Lustre-discuss] Bringing a cluster with failed MDS down then up

RS RS wrote:
> Hi, I have a question about failover and recovery.
>
> I am able to switchover my Active MDS (MDS-1) to a Standby MDS (call 
> it MDS-2).  If I then want to reboot my cluster, which MDS should 
> become active?
Reboot means reboot all Lustre services and clients?
If you''re talking about zconf mounting, you need to specify at least
the
location of the active MDS; you can also specify other failover 
locations.  It doesn''t matter which MDS node is active. 
> What commands do I need to give the client to help it find the correct 
> MDS?
For a live client, nothing; for a restarting client, just the mount 
command with the list of MDS NIDs to try to contact, e.g. mount -t 
lustre MDS-1@tcp:MDS-2@tcp:/ha-mds/client /mnt/lustre
>
> What if MDS-1 is still dead?  In this case, MDS-2 (after reboot) is 
> getting the following error message:
>
> LustreError: 10083:0:(ldlm_lib.c:544:target_handle_connect()) @@@ UUID 
> ''ha-mds_UUID'' is not available  for connect (no target) 
> req@000001007c037e00 x9/t0 o38-><?>@<?>:-1 lens 240/0 ref 0
fl
> Interpret:/0/0 rc 0/0
> LustreError: 10083:0:(ldlm_lib.c:1262:target_send_reply_msg()) @@@ 
> processing error (-19) req@000001007c037e00 x9/t0
o38-><?>@<?>:-1 lens
> 240/0 ref 0 fl Interpret:/0/0 rc -19/0
>That looks like LNET is running but not the MDS named "ha-mds".  cat 
/proc/fs/lustre/devices to see what''s actually running.

Nathaniel Rutman

2006-Aug-17 12:55 UTC

head link

[Lustre-discuss] Bringing a cluster with failed MDS down then up

RS RS wrote:
>>> What commands do I need to give the client to help it find the 
>>> correct MDS?
>>
>
> Nathan wrote:
>
>> For a live client, nothing; for a restarting client, just the mount 
>> command with the list of MDS NIDs to try to contact, e.g. mount -t 
>> lustre MDS-1@tcp:MDS-2@tcp:/ha-mds/client /mnt/lustre
>
>
> Nathan, I typed the following mount command with the the following 
> results:
>
> blade-lustre0:~# mount -t lustre 
> roger-ha-1@tcp:roger-ha-2@tcp:/ha-mds/client /mnt/lustre
> mount.lustre: mount(roger-ha-1@tcp:roger-ha-2@tcp:/ha-mds/client, 
> /mnt/lustre) failed: No such device
> mds nid 0: roger-ha-1.terascala.com@tcp
> mds name: roger-ha-2@tcp:
> profile: ha-mds/client
> options: rw
>oops, it''s a '','' not a '':''
./mount.lustre --help
mount.lustre v1.3
usage: mount.lustre
<mdsnode>[,<altmdsnode>]:/<mdsname>/<cfgname>
<mountpt> [-fhnv] [-o mntopt]
>
> I?m running 1.4.6.4. Do you have any other suggestions as to which 
> mount options I should use?
>And sorry I didn''t read to the end of your message :(
>
> If you have a moment, could you please also answer a couple of other 
> small questions:
>
> On the OSC, there is a file: 
> /proc/fs/lustre/mdc/MDC_blade-lustre0_ha-mds_MNT_client/mds_conn_uuid 
> that lists the MDS UUID.
>
> Is there an equivalent file on the OSC that lists the OSS?
>A Lustre "client" has both an MetaDataClient (mdc) and 
ObjectStorageClient(s) (oscs):
./mdc/lustre-MDT0000-mdc-a841b400/mds_conn_uuid
./osc/lustre-OST0000-osc-a841b400/ost_conn_uuid

An MDS also has osc(s), since it also acts as a client to the OSTs
./osc/lustre-OST0000-osc/ost_conn_uuid
> Is there an equivalent file on the OSS that lists the MDS?
An OST is not a client of anything, so it has no oscs, and no, we don''t
have a list of it''s exports.

RS RS

2006-Aug-18 07:40 UTC

head link

[Lustre-discuss] Bringing a cluster with failed MDS down then up

Nathan,
I tried it with the comma, with the same results (I also added the -v 
option):

mount -t lustre -v 
roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client 
/mnt/lustre
verbose: 1
arg[0] = /sbin/mount.lustre
arg[1] = -v
arg[2] = -o
arg[3] = rw
arg[4] = 
roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client
arg[5] = /mnt/lustre
source = 
roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client, 
target = /mnt/lustre
mds nid 0:       roger-ha-1.terascala.com@tcp
mds nid 1:       roger-ha-2@tcp
mds name:        ha-mds
profile:         client
options:         rw
mount.lustre: 
mount(roger-ha-1.terascala.com@tcp,roger-ha-2.terascala.com@tcp:/ha-mds/client, 
/mnt/lustre) failed: No such device
mds nid 0:       roger-ha-1.terascala.com@tcp
mds nid 1:       roger-ha-2@tcp
mds name:        ha-mds
profile:         client
options:         rw


The script that is used to build the .xml file is:
lmc -m failoverLustre.xml--- --add net --node roger-ha-1 --nid roger-ha-1 
--nettype tcp
lmc -m failoverLustre.xml--- --add net --node roger-ha-2 --nid roger-ha-2 
--nettype tcp
lmc -m failoverLustre.xml--- --add net --node blade-lustre2 --nid 
blade-lustre2 --nettype tcp
lmc -m failoverLustre.xml--- --add net --node client --nid * --nettype tcp
lmc -m failoverLustre.xml--- --add mds --node roger-ha-1 --mds ha-mds 
--fstype ext3 --dev /dev/md1 --failover
lmc -m failoverLustre.xml--- --add mds --node roger-ha-2 --mds ha-mds 
--fstype ext3 --dev /dev/md1 --failover
lmc -m failoverLustre.xml--- --add lov --lov lov-ts --mds ha-mds --stripe_sz 
1048576 --stripe_cnt 0 --stripe_pattern 0
lmc -m failoverLustre.xml--- --add ost --node blade-lustre2 --lov lov-ts 
--ost ost1-ts --fstype ext3 --dev /dev/sda1
lmc -m failoverLustre.xml--- --add mtpt --node client --path /mnt/lustre 
--mds ha-mds --lov lov-ts


I''m able to run fine using the lconf command:
lconf -v --node client 
/home/roger/lustreConfigs/failoverLustre/failoverLustre.xml

-Roger

_________________________________________________________________
Don’t just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/

Lustre discuss - Aug 2006 - Bringing a cluster with failed MDS down then up

[Lustre-discuss] Bringing a cluster with failed MDS down then up

[Lustre-discuss] Bringing a cluster with failed MDS down then up

[Lustre-discuss] Bringing a cluster with failed MDS down then up

[Lustre-discuss] Bringing a cluster with failed MDS down then up