Hi, I’ve read everything that I can find about lustre failover, and I’m still having trouble getting it to work for my MDS. If anyone can spare a few minutes to read this email and suggest what I’m doing wrong, I’d really appreciate it. My boss is on my case to get this working already! I have 4 nodes: roger-ha-1 -- Active MDS roger-ha-2 -- Standby MDS blade-lustre2 -- OST blade-lustre0 -- Client I use the following script to generate my .xms file: lmc -m failoverLustre.xml --add net --node roger-ha-1 --nid roger-ha-1 --nettype tcp lmc -m failoverLustre.xml --add net --node roger-ha-2 --nid roger-ha-2 --nettype tcp lmc -m failoverLustre.xml --add net --node blade-lustre2 --nid blade-lustre2 --nettype tcp lmc -m failoverLustre.xml --add net --node client --nid * --nettype tcp lmc -m failoverLustre.xml --add mds --node roger-ha-1 --mds ha-mds --fstype ext3 --dev /dev/md1 --failover lmc -m failoverLustre.xml --add mds --node roger-ha-2 --mds ha-mds --fstype ext3 --dev /dev/md1 --failover lmc -m failoverLustre.xml --add lov --lov lov-ts --mds ha-mds --stripe_sz 1048576 --stripe_cnt 0 --stripe_pattern 0 lmc -m failoverLustre.xml --add ost --node blade-lustre2 --lov lov-ts --ost ost1-ts --fstype ext3 --dev /dev/sda1 lmc -m failoverLustre.xml --add mtpt --node client --path /mnt/lustre --mds ha-mds --lov lov-ts I do the following in this order: 1. I bring up lustre on the OST using: lconf -v --reformat --upcall /root/roger/upcall --timeout 30 --node blade-lustre2 failoverLustre.xml 2. I bring up lustre on the Active MDS using: lconf -v --reformat --timeout 30 --node roger-ha-1 failoverLustre.xml 3. I bring up lustre on the client (blade-lustre0) using: lconf -v --upcall /root/roger/upcall --timeout 30 --node client failoverLustre.xml 4. I create a few files on the client. 5. I manually halt roger-ha-1, and I move the disks to the roger-ha-2 (Standby MDS), and start lustre there using: lconf -v --reformat --force --select mds=roger-ha-2 --timeout 30 --node roger-ha-2 failoverLustre.xml I believe that the first of my problems occurs here. Far fewer modules are loaded on roger-ha-2 than were originally loaded on roger-ha-1, and I get the message: ha-mds_UUID not active 6. Next, I go back to the client, expecting that my upcall would have been called. I have a very simple upcall script, namely: echo `date` $0 "$@" >> /root/roger/upcall.log However, as the file /root/roger/upcall.log is never created, it is clear that the upcall is not being called (my second problem). 7. So, I look at the /var/log/messages file, and I see that lustre is trying to call my upcall. I see messages like this: LustreError: 3094:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1155152056, 30s ago) req@000001007e59d400 x188/t0 o400->ha-mds_UUID@roger-ha-1_UUID:12 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: A connection with 10.200.1.251 timed out; the network or that node may be down. LustreError: 3071:0:(socklnd_cb.c:1981:ksocknal_check_peer_timeouts()) Timeout out conn->12345-10.200.1.251@tcp ip 10.200.1.251:988 Lustre: 3071:0:(router.c:184:lnet_notify()) Upcall: NID 10.200.1.251@tcp is dead Lustre: 4:0:(linux-debug.c:96:libcfs_run_upcall()) Invoked portals upcall /root/roger/upcall ROUTER_NOTIFY,10.200.1.251@tcp,down,1155152029 Lustre: 3073:0:(recover.c:117:ptlrpc_run_failed_import_upcall()) Invoked upcall /root/roger/upcall FAILED_IMPORT ha-mds_UUID MDC_blade-lustre0_ha-mds_MNT_client roger-ha-1_UUID 303a6_MNT_client_b06099ff7d LustreError: 3094:0:(client.c:951:ptlrpc_expire_one_request()) @@@ timeout (sent at 1155152063, 31s ago) req@000001007fbd0c00 x190/t0 o400->ha-mds_UUID@roger-ha-1_UUID:12 lens 64/64 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 3073:0:(recover.c:117:ptlrpc_run_failed_import_upcall()) Invoked upcall /root/roger/upcall FAILED_IMPORT ha-mds_UUID MDC_blade-lustre0_ha-mds_MNT_client roger-ha-1_UUID 303a6_MNT_client_b06099ff7d Lustre: 3073:0:(recover.c:117:ptlrpc_run_failed_import_upcall()) previously skipped 9 similar messages I don’t know why it can’t find my upcall. The file clearly exists and is executable, e.g. blade-lustre0:~/roger# ls -l upcall -rwxr-xr-x 1 root root 46 2006-08-09 09:34 upcall blade-lustre0:~/roger# ./upcall testing blade-lustre0:~/roger# cat upcall.log Wed Aug 9 15:55:52 EDT 2006 ./upcall testing blade-lustre0:~/roger# 8. So, I enter the following lconf command by hand on blade-lustre0 (the client) lconf --node blade-lustre0 --recover --select mds=10.200.1.252 --tgt_uuid ha-mds_UUID --client_uuid MDC_blade-lustre0_ha-mds_MNT_client --conn_uuid roger-ha-1_UUID failoverLustre.xml And, I get the error message: No host entry found. I don’t know what causes this. I’ve even added entries to the /etc/hosts file to help resolve things, but that hasn’t helped. Note that 10.200.1.252 is roger-ha-2. I originally said mds=roger-ha-2, but I changed it to the IP address when I started getting those error messages. If you can help me resolve these problems, I’d really appreciate it. Thanks, Roger _________________________________________________________________ Is your PC infected? Get a FREE online computer virus scan from McAfee® Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
On Thu, 10 Aug 2006, RS RS wrote:> 5. I manually halt roger-ha-1, and I move the disks to the roger-ha-2 > (Standby MDS), and start lustre there using: > lconf -v --reformat --force --select mds=roger-ha-2 --timeout 30 --node > roger-ha-2 failoverLustre.xmlAre you sure you want to *reformat* your MDT when moving it to the standby MDS?> 8. So, I enter the following lconf command by hand on blade-lustre0 (the > client) > lconf --node blade-lustre0 --recover --select mds=10.200.1.252 --tgt_uuid > ha-mds_UUID --client_uuid MDC_blade-lustre0_ha-mds_MNT_client --conn_uuid > roger-ha-1_UUID failoverLustre.xml > > And, I get the error message: > > No host entry found.IIRC this error is caused by lconf not finding in the xml a node description matching the supplied node name (here ''blade-lustre0'', otherwise the hostname). -- Jean-Marc Saffroy - jean-marc.saffroy@ext.bull.net
>Are you sure you want to *reformat* your MDT when moving it to the standby >MDS?Good point. I’ll fix this. No host entry found.> > No host entry found > >IIRC this error is caused by lconf not finding in the xml a node >description matching the supplied node name (here ''blade-lustre0'', >otherwise the hostname).At Jean-Marc’s suggestion, I changed my lmc commands from: lmc ... --add net --node client --nid * ... to: lmc ... --add net --node blade-lustre0 --nid blade-lustre0 ... and from: lmc ... --add mtpt --node client ... to: lmc ... --add mtpt --node blade-lustre0 ... (Does this mean that I’ll have to list every client in that file?) Then, on the client, I run: lconf --recover --node blade-lustre0 --tgt_uuid ha-mds_UUID --client_uuid MDC_blade-lustre0_ha-mds_MNT_client --conn_uuid roger-ha-1_UUID /home/roger/lustreConfigs/failoverLustre/failoverLustre.xml I get the following error from lconf. Traceback (most recent call last): File "/usr/sbin/lconf", line 2827, in ? main() File "/usr/sbin/lconf", line 2820, in main doHost(lustreDB, node_list) File "/usr/sbin/lconf", line 2218, in doHost config.conn_uuid) File "/usr/sbin/lconf", line 2419, in doRecovery srv_list = find_local_servers(get_ost_net(lustreDB, new_uuid)) NameError: global name ''find_local_servers'' is not defined As far as I can tell, this function does not exist in /usr/sbin/lconf. Should I be getting that from somewhere else? -Roger _________________________________________________________________ Don’t just search. Find. Check out the new MSN Search! http://search.msn.click-url.com/go/onm00200636ave/direct/01/
RS RS wrote:>> Are you sure you want to *reformat* your MDT when moving it to the >> standby MDS? > > > Good point. I?ll fix this.The --reformats will erase your disks. Get rid of them all.> No host entry found. > >> > No host entry found >> >> IIRC this error is caused by lconf not finding in the xml a node >> description matching the supplied node name (here ''blade-lustre0'', >> otherwise the hostname). > > > At Jean-Marc?s suggestion, I changed my lmc commands from: > lmc ... --add net --node client --nid * ... > to: > lmc ... --add net --node blade-lustre0 --nid blade-lustre0 ... > > and from: > > lmc ... --add mtpt --node client ... > > to: > lmc ... --add mtpt --node blade-lustre0 ... > > (Does this mean that I?ll have to list every client in that file?) >no - you should use the *, and specify --node client instead of --node blade-lustre0> Then, on the client, I run: > > lconf --recover --node blade-lustre0 --tgt_uuid ha-mds_UUID > --client_uuid MDC_blade-lustre0_ha-mds_MNT_client --conn_uuid > roger-ha-1_UUID > /home/roger/lustreConfigs/failoverLustre/failoverLustre.xml > > I get the following error from lconf. > > Traceback (most recent call last): > File "/usr/sbin/lconf", line 2827, in ? > main() > File "/usr/sbin/lconf", line 2820, in main > doHost(lustreDB, node_list) > File "/usr/sbin/lconf", line 2218, in doHost > config.conn_uuid) > File "/usr/sbin/lconf", line 2419, in doRecovery > srv_list = find_local_servers(get_ost_net(lustreDB, new_uuid)) > NameError: global name ''find_local_servers'' is not defined >The recovery won''t work since you reformatted your mds. You should also not specify your own recovery upcalls and leave the defaults; the client will automatically attempt to reconnect to the failover server. You shouldn''t need to run a lconf --recover. You can make sure your MDS is running with cat /proc/fs/lustre/devices - that should look the same on the primary and on the failover.
Eureka! It worked. I successfully failed over my MDS. Thanks to everyone who offered advice. Nathaniel wrote:>The recovery won''t work since you reformatted your mds. You should also not >specify your own recovery upcalls and leave the defaults; the client will >automatically attempt to reconnect to the failover server. You shouldn''t >need to run a lconf --recover.I took both your suggestions. Thanks again. By the way, the only reason I even considered using the upcall is because it suggests doing so in the manual (section 6.4.2): To quote: For example, one way to manage the current active node is to save the node name in a shared location that is accessible to the client upcall. When the upcall runs, it determines which service has failed, and looks up for the current active node to make the file system available. The current node and the upcall parameters are then passed to lconf in order to complete the recovery. . . . For example: $upcall FAILED_IMPORT ost1_UUID OSC_localhost_ost4_MNT_localhost NET_uml2_UUID ff151_lov1_6d1fce3b45 lconf --recover --select ost1=nodeB --target_uuid $2 --client_uuid $3 --conn_uuid $4 <config.xml> (I''ve CCed doc-request, with the hope that this can be fixed up in a future update). -Roger _________________________________________________________________ Is your PC infected? Get a FREE online computer virus scan from McAfee® Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963