Chadha, Narjit
2008-Feb-20 22:53 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
In short, I am working to failover the MDT to another node. I have activated heartbeat and it appears to be running properly. However, even if other resources failover, the Lustre filesystem does not appear to. The mount point on lustre01 (head mdt), does not transfer to lustre02 (slave mdt) given a failure or a simple ''/usr/lib/heartbeat/hb_takeover foreign'' from the backup mdt server. I am working with 2 nodes, both of which can see the same device, /dev/sdc1. I ensured that the device could be mounted by either server. The storage is Fibre Channel, if anybody is curious. Heartbeat was configured and set up as below /etc/ha.d/authkeys was set up (simple and the same on both servers). In /usr/lib/ocf/resource.d/heartbeat/Filesystem, I included the lustre filesystem as follows: if [ $blockdevice = "yes" ]; then if [ "$DEVICE" != "/dev/null" -a ! -b "$DEVICE" ] ; then ocf_log err "Couldn''t find device [$DEVICE]. Expected /dev/??? to exist" exit $OCF_ERR_ARGS fi if case $FSTYPE in ext3|reiserfs|reiser4|lustre|nss|xfs|jfs|vfat|fat|nfs|cifs|smbfs|ocfs2) false;; *) true;; esac then ocf_log info "Starting filesystem check on $DEVICE" if [ -z "$FSTYPE" ]; then $FSCK -a $DEVICE ---etc (this was the same on both servers) Nothing was changed in /etc/ha.d/resource.d/Filesystem, as /usr/lib/ocf/resource.d/heartbeat/Filesystem was used instead. /etc/ha.d/haresoureces contains the names and filesystems of the two servers. lustre01 is the primary mds server and lustre02 is the backup (same on both servers) lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre The /etc/ha.d/ha.cf file on both servers is: debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 keepalive 2 deadtime 15 initdead 60 udpport 694 bcast eth1 auto_failback off node lustre01 node lustre02 I have tried various orderings of starting heartbeat, but generally, I first format the lustre01 node using ''mkfs.lustre --mdt --mgs --fsname mylustre --failnode=lustre02 at tcp --reformat /dev/sdc1 ''. This works fine. Following this step, I mount the primary node (as shown on p.76 of the Lustre 1.6 manual), ''mount -t lustre /dev/sdc1 /lustremds''. /lustremds exists on both nodes. After this the ''service heartbeat start'' command is issued on both nodes. The results are as follows: Lustre01 (primary mdt) heartbeat[3727]: 2008/02/20_16:30:34 info: ************************** heartbeat[3727]: 2008/02/20_16:30:34 info: Configuration validated. Starting heartbeat 2.1.2 heartbeat[3728]: 2008/02/20_16:30:34 info: heartbeat: version 2.1.2 heartbeat[3728]: 2008/02/20_16:30:34 info: Heartbeat generation: 1200690464 heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3728]: 2008/02/20_16:30:34 info: Removing /var/run/heartbeat/rsctmp failed, recreating. heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth1 heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1 heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[3728]: 2008/02/20_16:30:34 info: Local status now set to: ''up'' heartbeat[3728]: 2008/02/20_16:30:35 info: Link lustre01:eth1 up. heartbeat[3728]: 2008/02/20_16:30:40 info: Link lustre02:eth1 up. heartbeat[3728]: 2008/02/20_16:30:40 info: Status update for node lustre02: status up harc[3735]: 2008/02/20_16:30:40 info: Running /etc/ha.d/rc.d/status status heartbeat[3728]: 2008/02/20_16:30:41 info: Comm_now_up(): updating status to active heartbeat[3728]: 2008/02/20_16:30:41 info: Local status now set to: ''active'' heartbeat[3728]: 2008/02/20_16:30:41 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 210 ms (> 50 ms) (GSource: 0x8432df8) heartbeat[3728]: 2008/02/20_16:30:41 info: Status update for node lustre02: status active harc[3752]: 2008/02/20_16:30:41 info: Running /etc/ha.d/rc.d/status status heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition completed. heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition completed. heartbeat[3728]: 2008/02/20_16:30:52 info: Initial resource acquisition complete (T_RESOURCES(us)) IPaddr[3804]: 2008/02/20_16:30:52 INFO: Resource is stopped heartbeat[3768]: 2008/02/20_16:30:52 info: Local Resource acquisition completed. harc[3843]: 2008/02/20_16:30:52 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[3843]: 2008/02/20_16:30:52 received ip-request-resp 192.168.100.1 OK yes ResourceManager[3864]: 2008/02/20_16:30:52 info: Acquiring resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre IPaddr[3891]: 2008/02/20_16:30:52 INFO: Resource is stopped ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 start IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated nic for 192.168.100.1: eth1 IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated netmask for 192.168.100.1: 255.255.255.0 IPaddr[3967]: 2008/02/20_16:30:52 INFO: eval ifconfig eth1:0 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 IPaddr[3967]: 2008/02/20_16:30:52 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[3950]: 2008/02/20_16:30:52 ERROR: Unknown error: 255 ResourceManager[3864]: 2008/02/20_16:30:52 ERROR: Return code 1 from /etc/ha.d/resource.d/IPaddr ResourceManager[3864]: 2008/02/20_16:30:52 CRIT: Giving up resources due to failure of 192.168.100.1 ResourceManager[3864]: 2008/02/20_16:30:52 info: Releasing resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[4103]: 2008/02/20_16:30:52 INFO: Running stop for /dev/sdc on /lustremds Filesystem[4092]: 2008/02/20_16:30:52 INFO: Success ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop IPaddr[4176]: 2008/02/20_16:30:52 INFO: Success heartbeat[3728]: 2008/02/20_16:31:22 info: lustre02 wants to go standby [foreign] hb_standby[4227]: 2008/02/20_16:31:23 Going standby [foreign]. heartbeat[3728]: 2008/02/20_16:31:23 WARN: Standby in progress- new request from lustre01 ignored [3600 seconds left] heartbeat[3728]: 2008/02/20_16:31:23 info: standby: acquire [foreign] resources from lustre02 heartbeat[4241]: 2008/02/20_16:31:23 info: acquire local HA resources (standby). ResourceManager[4254]: 2008/02/20_16:31:23 info: Acquiring resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre IPaddr[4281]: 2008/02/20_16:31:23 INFO: Resource is stopped ResourceManager[4254]: 2008/02/20_16:31:23 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 start IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated nic for 192.168.100.1: eth1 IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated netmask for 192.168.100.1: 255.255.255.0 IPaddr[4357]: 2008/02/20_16:31:23 INFO: eval ifconfig eth1:1 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 ResourceManager[4254]: 2008/02/20_16:31:24 ERROR: Return code 1 from /etc/ha.d/resource.d/IPaddr ResourceManager[4254]: 2008/02/20_16:31:24 CRIT: Giving up resources due to failure of 192.168.100.1 ResourceManager[4254]: 2008/02/20_16:31:24 info: Releasing resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[4254]: 2008/02/20_16:31:24 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[4491]: 2008/02/20_16:31:24 INFO: Running stop for /dev/sdc on /lustremds Filesystem[4480]: 2008/02/20_16:31:24 INFO: Success ResourceManager[4254]: 2008/02/20_16:31:24 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop IPaddr[4564]: 2008/02/20_16:31:24 INFO: Success heartbeat[4241]: 2008/02/20_16:31:24 info: local HA resource acquisition completed (standby). heartbeat[3728]: 2008/02/20_16:31:24 info: Standby resource acquisition done [foreign]. heartbeat[3728]: 2008/02/20_16:31:24 info: remote resource transition completed. hb_standby[4621]: 2008/02/20_16:31:54 Going standby [foreign]. heartbeat[3728]: 2008/02/20_16:31:54 info: lustre01 wants to go standby [foreign] heartbeat[3728]: 2008/02/20_16:31:54 info: standby: lustre02 can take our foreign resources heartbeat[4635]: 2008/02/20_16:31:54 info: give up foreign HA resources (standby). ResourceManager[4648]: 2008/02/20_16:31:55 info: Releasing resource group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[4648]: 2008/02/20_16:31:55 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[4697]: 2008/02/20_16:31:55 INFO: Running stop for /dev/sdc on /lustremds Filesystem[4686]: 2008/02/20_16:31:55 INFO: Success ResourceManager[4648]: 2008/02/20_16:31:55 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.2 stop IPaddr[4770]: 2008/02/20_16:31:55 INFO: Success heartbeat[4635]: 2008/02/20_16:31:55 info: foreign HA resource release completed (standby). heartbeat[3728]: 2008/02/20_16:31:55 info: Local standby process completed [foreign]. heartbeat[3728]: 2008/02/20_16:31:56 WARN: 1 lost packet(s) for [lustre02] [58:60] heartbeat[3728]: 2008/02/20_16:31:56 info: remote resource transition completed. heartbeat[3728]: 2008/02/20_16:31:56 info: No pkts missing from lustre02! heartbeat[3728]: 2008/02/20_16:31:56 info: Other node completed standby takeover of foreign resources. heartbeat[3728]: 2008/02/20_16:32:26 info: lustre02 wants to go standby [foreign] heartbeat[3728]: 2008/02/20_16:32:27 info: standby: acquire [foreign] resources from lustre02 heartbeat[4811]: 2008/02/20_16:32:27 info: acquire local HA resources (standby). ResourceManager[4824]: 2008/02/20_16:32:27 info: Acquiring resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre IPaddr[4851]: 2008/02/20_16:32:27 INFO: Resource is stopped ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 start IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated nic for 192.168.100.1: eth1 IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated netmask for 192.168.100.1: 255.255.255.0 IPaddr[4927]: 2008/02/20_16:32:27 INFO: eval ifconfig eth1:2 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 IPaddr[4927]: 2008/02/20_16:32:27 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[4910]: 2008/02/20_16:32:27 ERROR: Unknown error: 255 ResourceManager[4824]: 2008/02/20_16:32:27 ERROR: Return code 1 from /etc/ha.d/resource.d/IPaddr ResourceManager[4824]: 2008/02/20_16:32:27 CRIT: Giving up resources due to failure of 192.168.100.1 ResourceManager[4824]: 2008/02/20_16:32:27 info: Releasing resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[5061]: 2008/02/20_16:32:27 INFO: Running stop for /dev/sdc on /lustremds Filesystem[5050]: 2008/02/20_16:32:27 INFO: Success ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop IPaddr[5134]: 2008/02/20_16:32:27 INFO: Success heartbeat[4811]: 2008/02/20_16:32:27 info: local HA resource acquisition completed (standby). heartbeat[3728]: 2008/02/20_16:32:27 info: Standby resource acquisition done [foreign]. heartbeat[3728]: 2008/02/20_16:32:28 info: remote resource transition completed. Lustre02 (secondary mdt) heartbeat[4833]: 2008/02/20_16:39:24 info: lustre01 wants to go standby [foreign] heartbeat[4833]: 2008/02/20_16:39:25 info: standby: acquire [foreign] resources from lustre01 heartbeat[6658]: 2008/02/20_16:39:25 info: acquire local HA resources (standby). ResourceManager[6671]: 2008/02/20_16:39:25 info: Acquiring resource group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre However, I cannot see Lustre mounted on either device. Does anybody know what is the issue here? This statement concerns me: IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 BTW, 192.168.100.1 is the eth1 address on lustre01 (main mdt) Thanks -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080220/33567195/attachment-0002.html
Aaron Knister
2008-Feb-21 00:52 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
I''ve never used heartbeat before but just from reading what you wrote I see a couple things that could be wrong. The first is in /etc/ha.d/ haresoureces (pasted lines below)> lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustreIf I understand this correctly, you''re telling it to look for the filesystem on /dev/sdc...but your mkfs command created the filesystem on /dev/sdc1. I don''t think thats'' the main problem here, however. I believe the problem is that you have 192.168.100.1 assigned to eth1 already. It''s trying to re-assign it to eth1:0 which will cause it to fail. Try giving another ip to eth1 On Feb 20, 2008, at 5:53 PM, Chadha, Narjit wrote:> In short, I am working to failover the MDT to another node. I have > activated heartbeat and it appears to be running properly. However, > even if other resources failover, the Lustre filesystem does not > appear to. The mount point on lustre01 (head mdt), does not transfer > to lustre02 (slave mdt) given a failure or a simple ?/usr/lib/ > heartbeat/hb_takeover foreign? from the backup mdt server. > > I am working with 2 nodes, both of which can see the same device, / > dev/sdc1. I ensured that the device could be mounted by either > server. The storage is Fibre Channel, if anybody is curious. > Heartbeat was configured and set up as below > > /etc/ha.d/authkeys was set up (simple and the same on both servers). > In /usr/lib/ocf/resource.d/heartbeat/Filesystem, I included the > lustre filesystem as follows: > > if [ $blockdevice = "yes" ]; then > if [ "$DEVICE" != "/dev/null" -a ! -b "$DEVICE" ] ; > then > ocf_log err "Couldn''t find device [$DEVICE]. > Expected /dev/??? to exist" > exit $OCF_ERR_ARGS > fi > > if > case $FSTYPE in > ext3|reiserfs|reiser4|lustre|nss|xfs|jfs|vfat| > fat|nfs|cifs|smbfs|ocfs2) false;; > > *) > true;; > esac > then > ocf_log info "Starting filesystem check on > $DEVICE" > if [ -z "$FSTYPE" ]; then > $FSCK -a $DEVICE > ---etc > (this was the same on both servers) > > Nothing was changed in /etc/ha.d/resource.d/Filesystem, as /usr/lib/ > ocf/resource.d/heartbeat/Filesystem was used instead. > > > /etc/ha.d/haresoureces contains the names and filesystems of the two > servers. lustre01 is the primary mds server and lustre02 is the backup > (same on both servers) > > lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre > > > > The /etc/ha.d/ha.cf file on both servers is: > > debugfile /var/log/ha-debug > logfile /var/log/ha-log > logfacility local0 > keepalive 2 > deadtime 15 > initdead 60 > udpport 694 > bcast eth1 > auto_failback off > node lustre01 > node lustre02 > > I have tried various orderings of starting heartbeat, but generally, > I first format the lustre01 node using ?mkfs.lustre --mdt --mgs -- > fsname mylustre --failnode=lustre02 at tcp --reformat /dev/sdc1 ?. This > works fine. Following this step, I mount the primary node (as shown > on p.76 of the Lustre 1.6 manual), ?mount ?t lustre /dev/sdc1 / > lustremds?. /lustremds exists on both nodes. After this the ?service > heartbeat start? command is issued on both nodes. The results are as > follows: > > Lustre01 (primary mdt) > heartbeat[3727]: 2008/02/20_16:30:34 info: ************************** > heartbeat[3727]: 2008/02/20_16:30:34 info: Configuration validated. > Starting heartbeat 2.1.2 > heartbeat[3728]: 2008/02/20_16:30:34 info: heartbeat: version 2.1.2 > heartbeat[3728]: 2008/02/20_16:30:34 info: Heartbeat generation: > 1200690464 > heartbeat[3728]: 2008/02/20_16:30:34 info: > G_main_add_TriggerHandler: Added signal manual handler > heartbeat[3728]: 2008/02/20_16:30:34 info: > G_main_add_TriggerHandler: Added signal manual handler > heartbeat[3728]: 2008/02/20_16:30:34 info: Removing /var/run/ > heartbeat/rsctmp failed, recreating. > heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast > heartbeat started on port 694 (694) interface eth1 > heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast > heartbeat closed on port 694 interface eth1 - Status: 1 > heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_SignalHandler: > Added signal handler for signal 17 > heartbeat[3728]: 2008/02/20_16:30:34 info: Local status now set to: > ''up'' > heartbeat[3728]: 2008/02/20_16:30:35 info: Link lustre01:eth1 up. > heartbeat[3728]: 2008/02/20_16:30:40 info: Link lustre02:eth1 up. > heartbeat[3728]: 2008/02/20_16:30:40 info: Status update for node > lustre02: status up > harc[3735]: 2008/02/20_16:30:40 info: Running /etc/ha.d/rc.d/ > status status > heartbeat[3728]: 2008/02/20_16:30:41 info: Comm_now_up(): updating > status to active > heartbeat[3728]: 2008/02/20_16:30:41 info: Local status now set to: > ''active'' > heartbeat[3728]: 2008/02/20_16:30:41 WARN: G_CH_dispatch_int: > Dispatch function for read child took too long to execute: 210 ms (> > 50 ms) (GSource: 0x8432df8) > heartbeat[3728]: 2008/02/20_16:30:41 info: Status update for node > lustre02: status active > harc[3752]: 2008/02/20_16:30:41 info: Running /etc/ha.d/rc.d/ > status status > heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource > transition completed. > heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource > transition completed. > heartbeat[3728]: 2008/02/20_16:30:52 info: Initial resource > acquisition complete (T_RESOURCES(us)) > IPaddr[3804]: 2008/02/20_16:30:52 INFO: Resource is stopped > heartbeat[3768]: 2008/02/20_16:30:52 info: Local Resource > acquisition completed. > harc[3843]: 2008/02/20_16:30:52 info: Running /etc/ha.d/rc.d/ip- > request-resp ip-request-resp > ip-request-resp[3843]: 2008/02/20_16:30:52 received ip-request-resp > 192.168.100.1 OK yes > ResourceManager[3864]: 2008/02/20_16:30:52 info: Acquiring resource > group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > IPaddr[3891]: 2008/02/20_16:30:52 INFO: Resource is stopped > ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/ > resource.d/IPaddr 192.168.100.1 start > IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated nic for > 192.168.100.1: eth1 > IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated netmask > for 192.168.100.1: 255.255.255.0 > IPaddr[3967]: 2008/02/20_16:30:52 INFO: eval ifconfig eth1:0 > 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 > IPaddr[3967]: 2008/02/20_16:30:52 ERROR: Could not add > 192.168.100.1 to eth1: 255 > IPaddr[3950]: 2008/02/20_16:30:52 ERROR: Unknown error: 255 > ResourceManager[3864]: 2008/02/20_16:30:52 ERROR: Return code 1 > from /etc/ha.d/resource.d/IPaddr > ResourceManager[3864]: 2008/02/20_16:30:52 CRIT: Giving up > resources due to failure of 192.168.100.1 > ResourceManager[3864]: 2008/02/20_16:30:52 info: Releasing resource > group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/ > resource.d/Filesystem /dev/sdc /lustremds lustre stop > Filesystem[4103]: 2008/02/20_16:30:52 INFO: Running stop for / > dev/sdc on /lustremds > Filesystem[4092]: 2008/02/20_16:30:52 INFO: Success > ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/ > resource.d/IPaddr 192.168.100.1 stop > IPaddr[4176]: 2008/02/20_16:30:52 INFO: Success > heartbeat[3728]: 2008/02/20_16:31:22 info: lustre02 wants to go > standby [foreign] > hb_standby[4227]: 2008/02/20_16:31:23 Going standby [foreign]. > heartbeat[3728]: 2008/02/20_16:31:23 WARN: Standby in progress- new > request from lustre01 ignored [3600 seconds left] > heartbeat[3728]: 2008/02/20_16:31:23 info: standby: acquire > [foreign] resources from lustre02 > heartbeat[4241]: 2008/02/20_16:31:23 info: acquire local HA > resources (standby). > ResourceManager[4254]: 2008/02/20_16:31:23 info: Acquiring resource > group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > IPaddr[4281]: 2008/02/20_16:31:23 INFO: Resource is stopped > ResourceManager[4254]: 2008/02/20_16:31:23 info: Running /etc/ha.d/ > resource.d/IPaddr 192.168.100.1 start > IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated nic for > 192.168.100.1: eth1 > IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated netmask > for 192.168.100.1: 255.255.255.0 > IPaddr[4357]: 2008/02/20_16:31:23 INFO: eval ifconfig eth1:1 > 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 > IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add > 192.168.100.1 to eth1: 255 > IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 > ResourceManager[4254]: 2008/02/20_16:31:24 ERROR: Return code 1 > from /etc/ha.d/resource.d/IPaddr > ResourceManager[4254]: 2008/02/20_16:31:24 CRIT: Giving up > resources due to failure of 192.168.100.1 > ResourceManager[4254]: 2008/02/20_16:31:24 info: Releasing resource > group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > ResourceManager[4254]: 2008/02/20_16:31:24 info: Running /etc/ha.d/ > resource.d/Filesystem /dev/sdc /lustremds lustre stop > Filesystem[4491]: 2008/02/20_16:31:24 INFO: Running stop for / > dev/sdc on /lustremds > Filesystem[4480]: 2008/02/20_16:31:24 INFO: Success > ResourceManager[4254]: 2008/02/20_16:31:24 info: Running /etc/ha.d/ > resource.d/IPaddr 192.168.100.1 stop > IPaddr[4564]: 2008/02/20_16:31:24 INFO: Success > heartbeat[4241]: 2008/02/20_16:31:24 info: local HA resource > acquisition completed (standby). > heartbeat[3728]: 2008/02/20_16:31:24 info: Standby resource > acquisition done [foreign]. > heartbeat[3728]: 2008/02/20_16:31:24 info: remote resource > transition completed. > hb_standby[4621]: 2008/02/20_16:31:54 Going standby [foreign]. > heartbeat[3728]: 2008/02/20_16:31:54 info: lustre01 wants to go > standby [foreign] > heartbeat[3728]: 2008/02/20_16:31:54 info: standby: lustre02 can > take our foreign resources > heartbeat[4635]: 2008/02/20_16:31:54 info: give up foreign HA > resources (standby). > ResourceManager[4648]: 2008/02/20_16:31:55 info: Releasing resource > group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre > ResourceManager[4648]: 2008/02/20_16:31:55 info: Running /etc/ha.d/ > resource.d/Filesystem /dev/sdc /lustremds lustre stop > Filesystem[4697]: 2008/02/20_16:31:55 INFO: Running stop for / > dev/sdc on /lustremds > Filesystem[4686]: 2008/02/20_16:31:55 INFO: Success > ResourceManager[4648]: 2008/02/20_16:31:55 info: Running /etc/ha.d/ > resource.d/IPaddr 192.168.100.2 stop > IPaddr[4770]: 2008/02/20_16:31:55 INFO: Success > heartbeat[4635]: 2008/02/20_16:31:55 info: foreign HA resource > release completed (standby). > heartbeat[3728]: 2008/02/20_16:31:55 info: Local standby process > completed [foreign]. > heartbeat[3728]: 2008/02/20_16:31:56 WARN: 1 lost packet(s) for > [lustre02] [58:60] > heartbeat[3728]: 2008/02/20_16:31:56 info: remote resource > transition completed. > heartbeat[3728]: 2008/02/20_16:31:56 info: No pkts missing from > lustre02! > heartbeat[3728]: 2008/02/20_16:31:56 info: Other node completed > standby takeover of foreign resources. > heartbeat[3728]: 2008/02/20_16:32:26 info: lustre02 wants to go > standby [foreign] > heartbeat[3728]: 2008/02/20_16:32:27 info: standby: acquire > [foreign] resources from lustre02 > heartbeat[4811]: 2008/02/20_16:32:27 info: acquire local HA > resources (standby). > ResourceManager[4824]: 2008/02/20_16:32:27 info: Acquiring resource > group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > IPaddr[4851]: 2008/02/20_16:32:27 INFO: Resource is stopped > ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/ > resource.d/IPaddr 192.168.100.1 start > IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated nic for > 192.168.100.1: eth1 > IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated netmask > for 192.168.100.1: 255.255.255.0 > IPaddr[4927]: 2008/02/20_16:32:27 INFO: eval ifconfig eth1:2 > 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 > IPaddr[4927]: 2008/02/20_16:32:27 ERROR: Could not add > 192.168.100.1 to eth1: 255 > IPaddr[4910]: 2008/02/20_16:32:27 ERROR: Unknown error: 255 > ResourceManager[4824]: 2008/02/20_16:32:27 ERROR: Return code 1 > from /etc/ha.d/resource.d/IPaddr > ResourceManager[4824]: 2008/02/20_16:32:27 CRIT: Giving up > resources due to failure of 192.168.100.1 > ResourceManager[4824]: 2008/02/20_16:32:27 info: Releasing resource > group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre > ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/ > resource.d/Filesystem /dev/sdc /lustremds lustre stop > Filesystem[5061]: 2008/02/20_16:32:27 INFO: Running stop for / > dev/sdc on /lustremds > Filesystem[5050]: 2008/02/20_16:32:27 INFO: Success > ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/ > resource.d/IPaddr 192.168.100.1 stop > IPaddr[5134]: 2008/02/20_16:32:27 INFO: Success > heartbeat[4811]: 2008/02/20_16:32:27 info: local HA resource > acquisition completed (standby). > heartbeat[3728]: 2008/02/20_16:32:27 info: Standby resource > acquisition done [foreign]. > heartbeat[3728]: 2008/02/20_16:32:28 info: remote resource > transition completed. > > Lustre02 (secondary mdt) > heartbeat[4833]: 2008/02/20_16:39:24 info: lustre01 wants to go > standby [foreign] > heartbeat[4833]: 2008/02/20_16:39:25 info: standby: acquire > [foreign] resources from lustre01 > heartbeat[6658]: 2008/02/20_16:39:25 info: acquire local HA > resources (standby). > ResourceManager[6671]: 2008/02/20_16:39:25 info: Acquiring resource > group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre > > However, I cannot see Lustre mounted on either device. Does anybody > know what is the issue here? This statement concerns me: > IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add > 192.168.100.1 to eth1: 255 > IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 > > > BTW, 192.168.100.1 is the eth1 address on lustre01 (main mdt) > > Thanks > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discussAaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080220/9bbad756/attachment-0002.html
Chadha, Narjit
2008-Feb-21 21:14 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Thanks,>From what I found out, a single dummy address must be used in/etc/ha.d/haresources. That got rid of the ip conflict. I seem to be able to failover the mds now. The only thing left is to be able to mount the failover mds configuration on the OSS. The sytax: mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1 gives an error with parsing node names, yet this is what the Lustre manual indicates to do. A '','' separation of node names also does not work and will yield this type of error upon mounting: mount.lustre: mount /dev/sdb1 at /mnt/lustrefs failed: Input/output error Is the MGS running? I have seen a number of people having the same problem, but have not seen a resolution posted yet. Regards, N. ________________________________ From: Aaron Knister [mailto:aaron at iges.org] Sent: Wednesday, February 20, 2008 6:52 PM To: Chadha, Narjit Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with Lustre FS I''ve never used heartbeat before but just from reading what you wrote I see a couple things that could be wrong. The first is in /etc/ha.d/haresoureces (pasted lines below) lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre If I understand this correctly, you''re telling it to look for the filesystem on /dev/sdc...but your mkfs command created the filesystem on /dev/sdc1. I don''t think thats'' the main problem here, however. I believe the problem is that you have 192.168.100.1 assigned to eth1 already. It''s trying to re-assign it to eth1:0 which will cause it to fail. Try giving another ip to eth1 On Feb 20, 2008, at 5:53 PM, Chadha, Narjit wrote: In short, I am working to failover the MDT to another node. I have activated heartbeat and it appears to be running properly. However, even if other resources failover, the Lustre filesystem does not appear to. The mount point on lustre01 (head mdt), does not transfer to lustre02 (slave mdt) given a failure or a simple ''/usr/lib/heartbeat/hb_takeover foreign'' from the backup mdt server. I am working with 2 nodes, both of which can see the same device, /dev/sdc1. I ensured that the device could be mounted by either server. The storage is Fibre Channel, if anybody is curious. Heartbeat was configured and set up as below /etc/ha.d/authkeys was set up (simple and the same on both servers). In /usr/lib/ocf/resource.d/heartbeat/Filesystem, I included the lustre filesystem as follows: if [ $blockdevice = "yes" ]; then if [ "$DEVICE" != "/dev/null" -a ! -b "$DEVICE" ] ; then ocf_log err "Couldn''t find device [$DEVICE]. Expected /dev/??? to exist" exit $OCF_ERR_ARGS fi if case $FSTYPE in ext3|reiserfs|reiser4|lustre|nss|xfs|jfs|vfat|fat|nfs|cifs|smbfs|ocfs2) false;; *) true;; esac then ocf_log info "Starting filesystem check on $DEVICE" if [ -z "$FSTYPE" ]; then $FSCK -a $DEVICE ---etc (this was the same on both servers) Nothing was changed in /etc/ha.d/resource.d/Filesystem, as /usr/lib/ocf/resource.d/heartbeat/Filesystem was used instead. /etc/ha.d/haresoureces contains the names and filesystems of the two servers. lustre01 is the primary mds server and lustre02 is the backup (same on both servers) lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre The /etc/ha.d/ha.cf file on both servers is: debugfile /var/log/ha-debug logfile /var/log/ha-log logfacility local0 keepalive 2 deadtime 15 initdead 60 udpport 694 bcast eth1 auto_failback off node lustre01 node lustre02 I have tried various orderings of starting heartbeat, but generally, I first format the lustre01 node using ''mkfs.lustre --mdt --mgs --fsname mylustre --failnode=lustre02 at tcp --reformat /dev/sdc1 ''. This works fine. Following this step, I mount the primary node (as shown on p.76 of the Lustre 1.6 manual), ''mount -t lustre /dev/sdc1 /lustremds''. /lustremds exists on both nodes. After this the ''service heartbeat start'' command is issued on both nodes. The results are as follows: Lustre01 (primary mdt) heartbeat[3727]: 2008/02/20_16:30:34 info: ************************** heartbeat[3727]: 2008/02/20_16:30:34 info: Configuration validated. Starting heartbeat 2.1.2 heartbeat[3728]: 2008/02/20_16:30:34 info: heartbeat: version 2.1.2 heartbeat[3728]: 2008/02/20_16:30:34 info: Heartbeat generation: 1200690464 heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler: Added signal manual handler heartbeat[3728]: 2008/02/20_16:30:34 info: Removing /var/run/heartbeat/rsctmp failed, recreating. heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast heartbeat started on port 694 (694) interface eth1 heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast heartbeat closed on port 694 interface eth1 - Status: 1 heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_SignalHandler: Added signal handler for signal 17 heartbeat[3728]: 2008/02/20_16:30:34 info: Local status now set to: ''up'' heartbeat[3728]: 2008/02/20_16:30:35 info: Link lustre01:eth1 up. heartbeat[3728]: 2008/02/20_16:30:40 info: Link lustre02:eth1 up. heartbeat[3728]: 2008/02/20_16:30:40 info: Status update for node lustre02: status up harc[3735]: 2008/02/20_16:30:40 info: Running /etc/ha.d/rc.d/status status heartbeat[3728]: 2008/02/20_16:30:41 info: Comm_now_up(): updating status to active heartbeat[3728]: 2008/02/20_16:30:41 info: Local status now set to: ''active'' heartbeat[3728]: 2008/02/20_16:30:41 WARN: G_CH_dispatch_int: Dispatch function for read child took too long to execute: 210 ms (> 50 ms) (GSource: 0x8432df8) heartbeat[3728]: 2008/02/20_16:30:41 info: Status update for node lustre02: status active harc[3752]: 2008/02/20_16:30:41 info: Running /etc/ha.d/rc.d/status status heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition completed. heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition completed. heartbeat[3728]: 2008/02/20_16:30:52 info: Initial resource acquisition complete (T_RESOURCES(us)) IPaddr[3804]: 2008/02/20_16:30:52 INFO: Resource is stopped heartbeat[3768]: 2008/02/20_16:30:52 info: Local Resource acquisition completed. harc[3843]: 2008/02/20_16:30:52 info: Running /etc/ha.d/rc.d/ip-request-resp ip-request-resp ip-request-resp[3843]: 2008/02/20_16:30:52 received ip-request-resp 192.168.100.1 OK yes ResourceManager[3864]: 2008/02/20_16:30:52 info: Acquiring resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre IPaddr[3891]: 2008/02/20_16:30:52 INFO: Resource is stopped ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 start IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated nic for 192.168.100.1: eth1 IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated netmask for 192.168.100.1: 255.255.255.0 IPaddr[3967]: 2008/02/20_16:30:52 INFO: eval ifconfig eth1:0 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 IPaddr[3967]: 2008/02/20_16:30:52 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[3950]: 2008/02/20_16:30:52 ERROR: Unknown error: 255 ResourceManager[3864]: 2008/02/20_16:30:52 ERROR: Return code 1 from /etc/ha.d/resource.d/IPaddr ResourceManager[3864]: 2008/02/20_16:30:52 CRIT: Giving up resources due to failure of 192.168.100.1 ResourceManager[3864]: 2008/02/20_16:30:52 info: Releasing resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[4103]: 2008/02/20_16:30:52 INFO: Running stop for /dev/sdc on /lustremds Filesystem[4092]: 2008/02/20_16:30:52 INFO: Success ResourceManager[3864]: 2008/02/20_16:30:52 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop IPaddr[4176]: 2008/02/20_16:30:52 INFO: Success heartbeat[3728]: 2008/02/20_16:31:22 info: lustre02 wants to go standby [foreign] hb_standby[4227]: 2008/02/20_16:31:23 Going standby [foreign]. heartbeat[3728]: 2008/02/20_16:31:23 WARN: Standby in progress- new request from lustre01 ignored [3600 seconds left] heartbeat[3728]: 2008/02/20_16:31:23 info: standby: acquire [foreign] resources from lustre02 heartbeat[4241]: 2008/02/20_16:31:23 info: acquire local HA resources (standby). ResourceManager[4254]: 2008/02/20_16:31:23 info: Acquiring resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre IPaddr[4281]: 2008/02/20_16:31:23 INFO: Resource is stopped ResourceManager[4254]: 2008/02/20_16:31:23 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 start IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated nic for 192.168.100.1: eth1 IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated netmask for 192.168.100.1: 255.255.255.0 IPaddr[4357]: 2008/02/20_16:31:23 INFO: eval ifconfig eth1:1 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 ResourceManager[4254]: 2008/02/20_16:31:24 ERROR: Return code 1 from /etc/ha.d/resource.d/IPaddr ResourceManager[4254]: 2008/02/20_16:31:24 CRIT: Giving up resources due to failure of 192.168.100.1 ResourceManager[4254]: 2008/02/20_16:31:24 info: Releasing resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[4254]: 2008/02/20_16:31:24 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[4491]: 2008/02/20_16:31:24 INFO: Running stop for /dev/sdc on /lustremds Filesystem[4480]: 2008/02/20_16:31:24 INFO: Success ResourceManager[4254]: 2008/02/20_16:31:24 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop IPaddr[4564]: 2008/02/20_16:31:24 INFO: Success heartbeat[4241]: 2008/02/20_16:31:24 info: local HA resource acquisition completed (standby). heartbeat[3728]: 2008/02/20_16:31:24 info: Standby resource acquisition done [foreign]. heartbeat[3728]: 2008/02/20_16:31:24 info: remote resource transition completed. hb_standby[4621]: 2008/02/20_16:31:54 Going standby [foreign]. heartbeat[3728]: 2008/02/20_16:31:54 info: lustre01 wants to go standby [foreign] heartbeat[3728]: 2008/02/20_16:31:54 info: standby: lustre02 can take our foreign resources heartbeat[4635]: 2008/02/20_16:31:54 info: give up foreign HA resources (standby). ResourceManager[4648]: 2008/02/20_16:31:55 info: Releasing resource group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[4648]: 2008/02/20_16:31:55 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[4697]: 2008/02/20_16:31:55 INFO: Running stop for /dev/sdc on /lustremds Filesystem[4686]: 2008/02/20_16:31:55 INFO: Success ResourceManager[4648]: 2008/02/20_16:31:55 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.2 stop IPaddr[4770]: 2008/02/20_16:31:55 INFO: Success heartbeat[4635]: 2008/02/20_16:31:55 info: foreign HA resource release completed (standby). heartbeat[3728]: 2008/02/20_16:31:55 info: Local standby process completed [foreign]. heartbeat[3728]: 2008/02/20_16:31:56 WARN: 1 lost packet(s) for [lustre02] [58:60] heartbeat[3728]: 2008/02/20_16:31:56 info: remote resource transition completed. heartbeat[3728]: 2008/02/20_16:31:56 info: No pkts missing from lustre02! heartbeat[3728]: 2008/02/20_16:31:56 info: Other node completed standby takeover of foreign resources. heartbeat[3728]: 2008/02/20_16:32:26 info: lustre02 wants to go standby [foreign] heartbeat[3728]: 2008/02/20_16:32:27 info: standby: acquire [foreign] resources from lustre02 heartbeat[4811]: 2008/02/20_16:32:27 info: acquire local HA resources (standby). ResourceManager[4824]: 2008/02/20_16:32:27 info: Acquiring resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre IPaddr[4851]: 2008/02/20_16:32:27 INFO: Resource is stopped ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 start IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated nic for 192.168.100.1: eth1 IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated netmask for 192.168.100.1: 255.255.255.0 IPaddr[4927]: 2008/02/20_16:32:27 INFO: eval ifconfig eth1:2 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 IPaddr[4927]: 2008/02/20_16:32:27 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[4910]: 2008/02/20_16:32:27 ERROR: Unknown error: 255 ResourceManager[4824]: 2008/02/20_16:32:27 ERROR: Return code 1 from /etc/ha.d/resource.d/IPaddr ResourceManager[4824]: 2008/02/20_16:32:27 CRIT: Giving up resources due to failure of 192.168.100.1 ResourceManager[4824]: 2008/02/20_16:32:27 info: Releasing resource group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop Filesystem[5061]: 2008/02/20_16:32:27 INFO: Running stop for /dev/sdc on /lustremds Filesystem[5050]: 2008/02/20_16:32:27 INFO: Success ResourceManager[4824]: 2008/02/20_16:32:27 info: Running /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop IPaddr[5134]: 2008/02/20_16:32:27 INFO: Success heartbeat[4811]: 2008/02/20_16:32:27 info: local HA resource acquisition completed (standby). heartbeat[3728]: 2008/02/20_16:32:27 info: Standby resource acquisition done [foreign]. heartbeat[3728]: 2008/02/20_16:32:28 info: remote resource transition completed. Lustre02 (secondary mdt) heartbeat[4833]: 2008/02/20_16:39:24 info: lustre01 wants to go standby [foreign] heartbeat[4833]: 2008/02/20_16:39:25 info: standby: acquire [foreign] resources from lustre01 heartbeat[6658]: 2008/02/20_16:39:25 info: acquire local HA resources (standby). ResourceManager[6671]: 2008/02/20_16:39:25 info: Acquiring resource group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre However, I cannot see Lustre mounted on either device. Does anybody know what is the issue here? This statement concerns me: IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1 to eth1: 255 IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 BTW, 192.168.100.1 is the eth1 address on lustre01 (main mdt) Thanks _______________________________________________ Lustre-discuss mailing list Lustre-discuss at lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss Aaron Knister Associate Systems Analyst Center for Ocean-Land-Atmosphere Studies (301) 595-7000 aaron at iges.org -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080221/159304ac/attachment-0002.html
Cliff White
2008-Feb-21 21:50 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Aaron Knister wrote:> I''ve never used heartbeat before but just from reading what you wrote I > see a couple things that could be wrong. The first is > in /etc/ha.d/haresoureces (pasted lines below) > >> lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre > > If I understand this correctly, you''re telling it to look for the > filesystem on /dev/sdc...but your mkfs command created the filesystem on > /dev/sdc1. I don''t think thats'' the main problem here, however. I > believe the problem is that you have 192.168.100.1 assigned to eth1 > already. It''s trying to re-assign it to eth1:0 which will cause it to > fail. Try giving another ip to eth1This is indeed the cause of the heartbeat failure, also the filesystem type is ''ldiskfs'' not lustre, you need to change the Filesystem script, as it will fail the FSTYPE check. cliffw> > On Feb 20, 2008, at 5:53 PM, Chadha, Narjit wrote: > >> In short, I am working to failover the MDT to another node. I have >> activated heartbeat and it appears to be running properly. However, >> even if other resources failover, the Lustre filesystem does not >> appear to. The mount point on lustre01 (head mdt), does not transfer >> to lustre02 (slave mdt) given a failure or a simple >> ?/usr/lib/heartbeat/hb_takeover foreign? from the backup mdt server. >> >> I am working with 2 nodes, both of which can see the same device, >> /dev/sdc1. I ensured that the device could be mounted by either >> server. The storage is Fibre Channel, if anybody is curious. Heartbeat >> was configured and set up as below >> >> /etc/ha.d/authkeys was set up (simple and the same on both servers). >> In /usr/lib/ocf/resource.d/heartbeat/Filesystem, I included the lustre >> filesystem as follows: >> >> if [ $blockdevice = "yes" ]; then >> if [ "$DEVICE" != "/dev/null" -a ! -b "$DEVICE" ] ; then >> ocf_log err "Couldn''t find device [$DEVICE]. >> Expected /dev/??? to exist" >> exit $OCF_ERR_ARGS >> fi >> >> if >> case $FSTYPE in >> >> ext3|reiserfs|reiser4|lustre|nss|xfs|jfs|vfat|fat|nfs|cifs|smbfs|ocfs2) >> false;; >> >> *) true;; >> esac >> then >> ocf_log info "Starting filesystem check on >> $DEVICE" >> if [ -z "$FSTYPE" ]; then >> $FSCK -a $DEVICE >> ---etc >> (this was the same on both servers) >> >> Nothing was changed in /etc/ha.d/resource.d/Filesystem, as >> /usr/lib/ocf/resource.d/heartbeat/Filesystem was used instead. >> >> >> /etc/ha.d/haresoureces contains the names and filesystems of the two >> servers. lustre01 is the primary mds server and lustre02 is the backup >> (same on both servers) >> >> lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre >> >> >> >> The /etc/ha.d/ha.cf file on both servers is: >> >> debugfile /var/log/ha-debug >> logfile /var/log/ha-log >> logfacility local0 >> keepalive 2 >> deadtime 15 >> initdead 60 >> udpport 694 >> bcast eth1 >> auto_failback off >> node lustre01 >> node lustre02 >> >> I have tried various orderings of starting heartbeat, but generally, I >> first format the lustre01 node using ?mkfs.lustre --mdt --mgs --fsname >> mylustre --failnode=lustre02 at tcp --reformat /dev/sdc1 ?. This works >> fine. Following this step, I mount the primary node (as shown on p.76 >> of the Lustre 1.6 manual), ?mount ?t lustre /dev/sdc1 /lustremds?. >> /lustremds exists on both nodes. After this the ?service heartbeat >> start? command is issued on both nodes. The results are as follows: >> >> Lustre01 (primary mdt) >> heartbeat[3727]: 2008/02/20_16:30:34 info: ************************** >> heartbeat[3727]: 2008/02/20_16:30:34 info: Configuration validated. >> Starting heartbeat 2.1.2 >> heartbeat[3728]: 2008/02/20_16:30:34 info: heartbeat: version 2.1.2 >> heartbeat[3728]: 2008/02/20_16:30:34 info: Heartbeat generation: >> 1200690464 >> heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler: >> Added signal manual handler >> heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_TriggerHandler: >> Added signal manual handler >> heartbeat[3728]: 2008/02/20_16:30:34 info: Removing >> /var/run/heartbeat/rsctmp failed, recreating. >> heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast >> heartbeat started on port 694 (694) interface eth1 >> heartbeat[3728]: 2008/02/20_16:30:34 info: glib: UDP Broadcast >> heartbeat closed on port 694 interface eth1 - Status: 1 >> heartbeat[3728]: 2008/02/20_16:30:34 info: G_main_add_SignalHandler: >> Added signal handler for signal 17 >> heartbeat[3728]: 2008/02/20_16:30:34 info: Local status now set to: ''up'' >> heartbeat[3728]: 2008/02/20_16:30:35 info: Link lustre01:eth1 up. >> heartbeat[3728]: 2008/02/20_16:30:40 info: Link lustre02:eth1 up. >> heartbeat[3728]: 2008/02/20_16:30:40 info: Status update for node >> lustre02: status up >> harc[3735]: 2008/02/20_16:30:40 info: Running >> /etc/ha.d/rc.d/status status >> heartbeat[3728]: 2008/02/20_16:30:41 info: Comm_now_up(): updating >> status to active >> heartbeat[3728]: 2008/02/20_16:30:41 info: Local status now set to: >> ''active'' >> heartbeat[3728]: 2008/02/20_16:30:41 WARN: G_CH_dispatch_int: Dispatch >> function for read child took too long to execute: 210 ms (> 50 ms) >> (GSource: 0x8432df8) >> heartbeat[3728]: 2008/02/20_16:30:41 info: Status update for node >> lustre02: status active >> harc[3752]: 2008/02/20_16:30:41 info: Running >> /etc/ha.d/rc.d/status status >> heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition >> completed. >> heartbeat[3728]: 2008/02/20_16:30:52 info: remote resource transition >> completed. >> heartbeat[3728]: 2008/02/20_16:30:52 info: Initial resource >> acquisition complete (T_RESOURCES(us)) >> IPaddr[3804]: 2008/02/20_16:30:52 INFO: Resource is stopped >> heartbeat[3768]: 2008/02/20_16:30:52 info: Local Resource acquisition >> completed. >> harc[3843]: 2008/02/20_16:30:52 info: Running >> /etc/ha.d/rc.d/ip-request-resp ip-request-resp >> ip-request-resp[3843]: 2008/02/20_16:30:52 received ip-request-resp >> 192.168.100.1 OK yes >> ResourceManager[3864]: 2008/02/20_16:30:52 info: Acquiring resource >> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> IPaddr[3891]: 2008/02/20_16:30:52 INFO: Resource is stopped >> ResourceManager[3864]: 2008/02/20_16:30:52 info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.100.1 start >> IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated nic for >> 192.168.100.1: eth1 >> IPaddr[3967]: 2008/02/20_16:30:52 INFO: Using calculated netmask for >> 192.168.100.1: 255.255.255.0 >> IPaddr[3967]: 2008/02/20_16:30:52 INFO: eval ifconfig eth1:0 >> 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 >> IPaddr[3967]: 2008/02/20_16:30:52 ERROR: Could not add 192.168.100.1 >> to eth1: 255 >> IPaddr[3950]: 2008/02/20_16:30:52 ERROR: Unknown error: 255 >> ResourceManager[3864]: 2008/02/20_16:30:52 ERROR: Return code 1 from >> /etc/ha.d/resource.d/IPaddr >> ResourceManager[3864]: 2008/02/20_16:30:52 CRIT: Giving up resources >> due to failure of 192.168.100.1 >> ResourceManager[3864]: 2008/02/20_16:30:52 info: Releasing resource >> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> ResourceManager[3864]: 2008/02/20_16:30:52 info: Running >> /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop >> Filesystem[4103]: 2008/02/20_16:30:52 INFO: Running stop for >> /dev/sdc on /lustremds >> Filesystem[4092]: 2008/02/20_16:30:52 INFO: Success >> ResourceManager[3864]: 2008/02/20_16:30:52 info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop >> IPaddr[4176]: 2008/02/20_16:30:52 INFO: Success >> heartbeat[3728]: 2008/02/20_16:31:22 info: lustre02 wants to go >> standby [foreign] >> hb_standby[4227]: 2008/02/20_16:31:23 Going standby [foreign]. >> heartbeat[3728]: 2008/02/20_16:31:23 WARN: Standby in progress- new >> request from lustre01 ignored [3600 seconds left] >> heartbeat[3728]: 2008/02/20_16:31:23 info: standby: acquire [foreign] >> resources from lustre02 >> heartbeat[4241]: 2008/02/20_16:31:23 info: acquire local HA resources >> (standby). >> ResourceManager[4254]: 2008/02/20_16:31:23 info: Acquiring resource >> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> IPaddr[4281]: 2008/02/20_16:31:23 INFO: Resource is stopped >> ResourceManager[4254]: 2008/02/20_16:31:23 info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.100.1 start >> IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated nic for >> 192.168.100.1: eth1 >> IPaddr[4357]: 2008/02/20_16:31:23 INFO: Using calculated netmask for >> 192.168.100.1: 255.255.255.0 >> IPaddr[4357]: 2008/02/20_16:31:23 INFO: eval ifconfig eth1:1 >> 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 >> IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1 >> to eth1: 255 >> IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 >> ResourceManager[4254]: 2008/02/20_16:31:24 ERROR: Return code 1 from >> /etc/ha.d/resource.d/IPaddr >> ResourceManager[4254]: 2008/02/20_16:31:24 CRIT: Giving up resources >> due to failure of 192.168.100.1 >> ResourceManager[4254]: 2008/02/20_16:31:24 info: Releasing resource >> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> ResourceManager[4254]: 2008/02/20_16:31:24 info: Running >> /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop >> Filesystem[4491]: 2008/02/20_16:31:24 INFO: Running stop for >> /dev/sdc on /lustremds >> Filesystem[4480]: 2008/02/20_16:31:24 INFO: Success >> ResourceManager[4254]: 2008/02/20_16:31:24 info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop >> IPaddr[4564]: 2008/02/20_16:31:24 INFO: Success >> heartbeat[4241]: 2008/02/20_16:31:24 info: local HA resource >> acquisition completed (standby). >> heartbeat[3728]: 2008/02/20_16:31:24 info: Standby resource >> acquisition done [foreign]. >> heartbeat[3728]: 2008/02/20_16:31:24 info: remote resource transition >> completed. >> hb_standby[4621]: 2008/02/20_16:31:54 Going standby [foreign]. >> heartbeat[3728]: 2008/02/20_16:31:54 info: lustre01 wants to go >> standby [foreign] >> heartbeat[3728]: 2008/02/20_16:31:54 info: standby: lustre02 can take >> our foreign resources >> heartbeat[4635]: 2008/02/20_16:31:54 info: give up foreign HA >> resources (standby). >> ResourceManager[4648]: 2008/02/20_16:31:55 info: Releasing resource >> group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre >> ResourceManager[4648]: 2008/02/20_16:31:55 info: Running >> /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop >> Filesystem[4697]: 2008/02/20_16:31:55 INFO: Running stop for >> /dev/sdc on /lustremds >> Filesystem[4686]: 2008/02/20_16:31:55 INFO: Success >> ResourceManager[4648]: 2008/02/20_16:31:55 info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.100.2 stop >> IPaddr[4770]: 2008/02/20_16:31:55 INFO: Success >> heartbeat[4635]: 2008/02/20_16:31:55 info: foreign HA resource release >> completed (standby). >> heartbeat[3728]: 2008/02/20_16:31:55 info: Local standby process >> completed [foreign]. >> heartbeat[3728]: 2008/02/20_16:31:56 WARN: 1 lost packet(s) for >> [lustre02] [58:60] >> heartbeat[3728]: 2008/02/20_16:31:56 info: remote resource transition >> completed. >> heartbeat[3728]: 2008/02/20_16:31:56 info: No pkts missing from lustre02! >> heartbeat[3728]: 2008/02/20_16:31:56 info: Other node completed >> standby takeover of foreign resources. >> heartbeat[3728]: 2008/02/20_16:32:26 info: lustre02 wants to go >> standby [foreign] >> heartbeat[3728]: 2008/02/20_16:32:27 info: standby: acquire [foreign] >> resources from lustre02 >> heartbeat[4811]: 2008/02/20_16:32:27 info: acquire local HA resources >> (standby). >> ResourceManager[4824]: 2008/02/20_16:32:27 info: Acquiring resource >> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> IPaddr[4851]: 2008/02/20_16:32:27 INFO: Resource is stopped >> ResourceManager[4824]: 2008/02/20_16:32:27 info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.100.1 start >> IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated nic for >> 192.168.100.1: eth1 >> IPaddr[4927]: 2008/02/20_16:32:27 INFO: Using calculated netmask for >> 192.168.100.1: 255.255.255.0 >> IPaddr[4927]: 2008/02/20_16:32:27 INFO: eval ifconfig eth1:2 >> 192.168.100.1 netmask 255.255.255.0 broadcast 192.168.100.255 >> IPaddr[4927]: 2008/02/20_16:32:27 ERROR: Could not add 192.168.100.1 >> to eth1: 255 >> IPaddr[4910]: 2008/02/20_16:32:27 ERROR: Unknown error: 255 >> ResourceManager[4824]: 2008/02/20_16:32:27 ERROR: Return code 1 from >> /etc/ha.d/resource.d/IPaddr >> ResourceManager[4824]: 2008/02/20_16:32:27 CRIT: Giving up resources >> due to failure of 192.168.100.1 >> ResourceManager[4824]: 2008/02/20_16:32:27 info: Releasing resource >> group: lustre01 192.168.100.1 Filesystem::/dev/sdc::/lustremds::lustre >> ResourceManager[4824]: 2008/02/20_16:32:27 info: Running >> /etc/ha.d/resource.d/Filesystem /dev/sdc /lustremds lustre stop >> Filesystem[5061]: 2008/02/20_16:32:27 INFO: Running stop for >> /dev/sdc on /lustremds >> Filesystem[5050]: 2008/02/20_16:32:27 INFO: Success >> ResourceManager[4824]: 2008/02/20_16:32:27 info: Running >> /etc/ha.d/resource.d/IPaddr 192.168.100.1 stop >> IPaddr[5134]: 2008/02/20_16:32:27 INFO: Success >> heartbeat[4811]: 2008/02/20_16:32:27 info: local HA resource >> acquisition completed (standby). >> heartbeat[3728]: 2008/02/20_16:32:27 info: Standby resource >> acquisition done [foreign]. >> heartbeat[3728]: 2008/02/20_16:32:28 info: remote resource transition >> completed. >> >> Lustre02 (secondary mdt) >> heartbeat[4833]: 2008/02/20_16:39:24 info: lustre01 wants to go >> standby [foreign] >> heartbeat[4833]: 2008/02/20_16:39:25 info: standby: acquire [foreign] >> resources from lustre01 >> heartbeat[6658]: 2008/02/20_16:39:25 info: acquire local HA resources >> (standby). >> ResourceManager[6671]: 2008/02/20_16:39:25 info: Acquiring resource >> group: lustre02 192.168.100.2 Filesystem::/dev/sdc::/lustremds::lustre >> >> However, I cannot see Lustre mounted on either device. Does anybody >> know what is the issue here? This statement concerns me: >> IPaddr[4357]: 2008/02/20_16:31:23 ERROR: Could not add 192.168.100.1 >> to eth1: 255 >> IPaddr[4340]: 2008/02/20_16:31:23 ERROR: Unknown error: 255 >> >> >> BTW, 192.168.100.1 is the eth1 address on lustre01 (main mdt) >> >> Thanks >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org <mailto:Lustre-discuss at lists.lustre.org> >> http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Aaron Knister > Associate Systems Analyst > Center for Ocean-Land-Atmosphere Studies > > (301) 595-7000 > aaron at iges.org <mailto:aaron at iges.org> > > > > > > ------------------------------------------------------------------------ > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Andreas Dilger
2008-Feb-23 01:07 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote:> The only thing left is to be able to mount > the failover mds configuration on the OSS. The sytax: > > > > mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1I think this is a defect in the manual. It should be "--fsname" and "--mgsnid" I believe. Please confirm that is the issue and the manual can be updated. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Chadha, Narjit
2008-Feb-23 22:56 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
I do not believe this is the issue, although it was written this way in my training manual. I had used something to the effect of ''mkfs.lustre --mdt --mgs --fsname mylustre --failnode=lustre02 at tcp --reformat /dev/sdc1'' when formatting the ost, as you have it. The mkfs.lustre command works properly with a single mdt, but it appears for a failover configuration, the ''lustre0[1-2]'' syntax is not being accepted, and this needs to be fixed in the manual once the correct syntax is known. Regards, Narjit -----Original Message----- From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf Of Andreas Dilger Sent: Friday, February 22, 2008 7:08 PM To: Chadha, Narjit Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with Lustre FS On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote:> The only thing left is to be able to mount the failover mds > configuration on the OSS. The sytax: > > > > mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1I think this is a defect in the manual. It should be "--fsname" and "--mgsnid" I believe. Please confirm that is the issue and the manual can be updated. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Chadha, Narjit
2008-Mar-06 20:39 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Would you know the correct MDS mount syntax (for OSTs and Clients) for an MDS failover? For the OSSs , it does not appear to take the form: mkfs.lustre --ost --fsname=lustrefs --mgsnid=mds[1-2] /dev/sdb1 ** mount -t lustre /dev/sdb1 /mnt/lustre ,where mds1,mds2 are the mgsnids of the primary and failover MDSs. There is parsing error for mds[1-2] **, but knows of both mds1 and mds2 independently. For the clients, it does not appear to take the form: mount -t lustre mds[1-2]:/lustrefs /mnt/lustre ** Lustre cannot parse mds[1-2] ** , but knows of both mds1 and mds2 independently. As mentioned, I have tried comma separated names and colon separated names as well with no effect. The problem is with the mgsnid name structure. The ** only show where the errors occur. The MDS on its own fails over seamlessly and was not mentioned above (it also has the same fsname). Thanks N. -----Original Message----- From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf Of Andreas Dilger Sent: Friday, February 22, 2008 7:08 PM To: Chadha, Narjit Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with Lustre FS On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote:> The only thing left is to be able to mount > the failover mds configuration on the OSS. The sytax: > > > > mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1I think this is a defect in the manual. It should be "--fsname" and "--mgsnid" I believe. Please confirm that is the issue and the manual can be updated. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc.
Klaus Steden
2008-Mar-06 20:50 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Hi Narjit, Note that ''[]'' notation is a shell construct ... depending on the shell, it might get expanded different ways, or not at all, and subsequently mangled by the time it gets to mkfs.lustre. The help output for mkfs.lustre on my 1.6.x system also uses ''--mgs'' and ''--mgsnode'', but there is no mention of a ''--mgsnid'' option. I find for clarity I use statements like this when working with my Lustre FS on the client side: -- cut -- mount -t lustre hm0-0 at tcp:hm0-1 at tcp:/lustre /mnt/lustre -- cut -- And on the OSS server side: -- cut -- mount -t lustre /dev/sdi /mnt/lustreost -- cut -- Check your command lines, I think they''re slightly incorrect. hth, Klaus On 3/6/08 12:39 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch on stone tablets:> Would you know the correct MDS mount syntax (for OSTs and Clients) for > an MDS failover? > > For the OSSs , it does not appear to take the form: > > mkfs.lustre --ost --fsname=lustrefs --mgsnid=mds[1-2] /dev/sdb1 ** > mount -t lustre /dev/sdb1 /mnt/lustre > > ,where mds1,mds2 are the mgsnids of the primary and failover MDSs. There > is parsing error for mds[1-2] **, but knows of both mds1 and mds2 > independently. > > For the clients, it does not appear to take the form: > > mount -t lustre mds[1-2]:/lustrefs /mnt/lustre ** > > Lustre cannot parse mds[1-2] ** , but knows of both mds1 and mds2 > independently. > > As mentioned, I have tried comma separated names and colon separated > names as well with no effect. The problem is with the mgsnid name > structure. The ** only show where the errors occur. The MDS on its own > fails over seamlessly and was not mentioned above (it also has the same > fsname). > > Thanks > > N. > > > -----Original Message----- > From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf > Of Andreas Dilger > Sent: Friday, February 22, 2008 7:08 PM > To: Chadha, Narjit > Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com > Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with > Lustre FS > > On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote: >> The only thing left is to be able to mount >> the failover mds configuration on the OSS. The sytax: >> >> >> >> mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1 > > I think this is a defect in the manual. It should be "--fsname" and > "--mgsnid" I believe. Please confirm that is the issue and the manual > can be updated. > > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Chadha, Narjit
2008-Mar-06 21:45 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Hi Klaus, Actually mgsnid and mgsnode appear to be interchangeable, but they results are the same. It is likely that the command lines being used are slightly incorrect in that the [] syntax is getting mangled. I am using bash on Red Hat 5 though. I wonder how the proper mount should look on the clients if I am using : instead of [1-2] to designate the mgsnids(mgsnodes) on the OSSs. Regards, N. -----Original Message----- From: Klaus Steden [mailto:klaus.steden at thomson.net] Sent: Thursday, March 06, 2008 2:50 PM To: Chadha, Narjit; Andreas Dilger Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with Lustre FS Hi Narjit, Note that ''[]'' notation is a shell construct ... depending on the shell, it might get expanded different ways, or not at all, and subsequently mangled by the time it gets to mkfs.lustre. The help output for mkfs.lustre on my 1.6.x system also uses ''--mgs'' and ''--mgsnode'', but there is no mention of a ''--mgsnid'' option. I find for clarity I use statements like this when working with my Lustre FS on the client side: -- cut -- mount -t lustre hm0-0 at tcp:hm0-1 at tcp:/lustre /mnt/lustre -- cut -- And on the OSS server side: -- cut -- mount -t lustre /dev/sdi /mnt/lustreost -- cut -- Check your command lines, I think they''re slightly incorrect. hth, Klaus On 3/6/08 12:39 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch on stone tablets:> Would you know the correct MDS mount syntax (for OSTs and Clients) for > an MDS failover? > > For the OSSs , it does not appear to take the form: > > mkfs.lustre --ost --fsname=lustrefs --mgsnid=mds[1-2] /dev/sdb1 ** > mount -t lustre /dev/sdb1 /mnt/lustre > > ,where mds1,mds2 are the mgsnids of the primary and failover MDSs.There> is parsing error for mds[1-2] **, but knows of both mds1 and mds2 > independently. > > For the clients, it does not appear to take the form: > > mount -t lustre mds[1-2]:/lustrefs /mnt/lustre ** > > Lustre cannot parse mds[1-2] ** , but knows of both mds1 and mds2 > independently. > > As mentioned, I have tried comma separated names and colon separated > names as well with no effect. The problem is with the mgsnid name > structure. The ** only show where the errors occur. The MDS on its own > fails over seamlessly and was not mentioned above (it also has thesame> fsname). > > Thanks > > N. > > > -----Original Message----- > From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf > Of Andreas Dilger > Sent: Friday, February 22, 2008 7:08 PM > To: Chadha, Narjit > Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com > Subject: Re: [Lustre-discuss] MDT Failover not functioning properlywith> Lustre FS > > On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote: >> The only thing left is to be able to mount >> the failover mds configuration on the OSS. The sytax: >> >> >> >> mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1 > > I think this is a defect in the manual. It should be "--fsname" and > "--mgsnid" I believe. Please confirm that is the issue and the manual > can be updated. > > > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss
Klaus Steden
2008-Mar-06 23:50 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Hi Narjit, My usual syntax with mkfs.lustre/tunefs.lustre is something like this: tunefs.lustre --erase-params --writeconf --failnode=tm0-1 at tcp --failnode=tm0-0 at tcp --failnode=172.16.130.249 at tcp1 --failnode=172.16.131.249 at tcp2 --failnode=172.16.130.252 at tcp1 --failnode=172.16.131.252 at tcp2 /dev/sdd i.e. I use more explicit notation. Specifying ''--mgsnode=mgs1 --mgsnode=mgs2'' will likely work as you expected. If you''re using bash, the ''[]'' expansion is usually only suited to filename expansion, thusly: -- cut -- [root at tiger ~]# echo mgs[1-2] mgs[1-2] [root at tiger ~]# touch mgs1 mgs2 [root at tiger ~]# echo mgs[1-2] mgs1 mgs2 [root at tiger ~]# rm mgs[1-2] rm: remove regular empty file `mgs1''? y rm: remove regular empty file `mgs2''? y [root at tiger ~]# touch mgs[1-2] [root at tiger ~]# ls mgs\[1-2\] mgs[1-2] [root at tiger ~]# rm mgs\[1-2\] rm: remove regular empty file `mgs[1-2]''? y [root at tiger ~]# touch mgs{1,2} [root at tiger ~]# ls mgs[1-2] mgs1 mgs2 [root at tiger ~]# -- cut -- Your mileage may vary, but again, using the full notation can save some confusion over things like this. cheers, Klaus On 3/6/08 1:45 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch on stone tablets:> Hi Klaus, > > Actually mgsnid and mgsnode appear to be interchangeable, but they > results are the same. > > It is likely that the command lines being used are slightly incorrect in > that the [] syntax is getting mangled. I am using bash on Red Hat 5 > though. > > I wonder how the proper mount should look on the clients if I am using : > instead of [1-2] to designate the mgsnids(mgsnodes) on the OSSs. > > Regards, > > N. > > -----Original Message----- > From: Klaus Steden [mailto:klaus.steden at thomson.net] > Sent: Thursday, March 06, 2008 2:50 PM > To: Chadha, Narjit; Andreas Dilger > Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com > Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with > Lustre FS > > > Hi Narjit, > > Note that ''[]'' notation is a shell construct ... depending on the shell, > it > might get expanded different ways, or not at all, and subsequently > mangled > by the time it gets to mkfs.lustre. > > The help output for mkfs.lustre on my 1.6.x system also uses ''--mgs'' and > ''--mgsnode'', but there is no mention of a ''--mgsnid'' option. > > I find for clarity I use statements like this when working with my > Lustre FS > on the client side: > > -- cut -- > mount -t lustre hm0-0 at tcp:hm0-1 at tcp:/lustre /mnt/lustre > -- cut -- > > And on the OSS server side: > > -- cut -- > mount -t lustre /dev/sdi /mnt/lustreost > -- cut -- > > Check your command lines, I think they''re slightly incorrect. > > hth, > Klaus > > On 3/6/08 12:39 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch > on > stone tablets: > >> Would you know the correct MDS mount syntax (for OSTs and Clients) for >> an MDS failover? >> >> For the OSSs , it does not appear to take the form: >> >> mkfs.lustre --ost --fsname=lustrefs --mgsnid=mds[1-2] /dev/sdb1 ** >> mount -t lustre /dev/sdb1 /mnt/lustre >> >> ,where mds1,mds2 are the mgsnids of the primary and failover MDSs. > There >> is parsing error for mds[1-2] **, but knows of both mds1 and mds2 >> independently. >> >> For the clients, it does not appear to take the form: >> >> mount -t lustre mds[1-2]:/lustrefs /mnt/lustre ** >> >> Lustre cannot parse mds[1-2] ** , but knows of both mds1 and mds2 >> independently. >> >> As mentioned, I have tried comma separated names and colon separated >> names as well with no effect. The problem is with the mgsnid name >> structure. The ** only show where the errors occur. The MDS on its own >> fails over seamlessly and was not mentioned above (it also has the > same >> fsname). >> >> Thanks >> >> N. >> >> >> -----Original Message----- >> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On Behalf >> Of Andreas Dilger >> Sent: Friday, February 22, 2008 7:08 PM >> To: Chadha, Narjit >> Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com >> Subject: Re: [Lustre-discuss] MDT Failover not functioning properly > with >> Lustre FS >> >> On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote: >>> The only thing left is to be able to mount >>> the failover mds configuration on the OSS. The sytax: >>> >>> >>> >>> mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1 >> >> I think this is a defect in the manual. It should be "--fsname" and >> "--mgsnid" I believe. Please confirm that is the issue and the manual >> can be updated. >> >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Chadha, Narjit
2008-Mar-07 17:44 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Thanks for the assistance Klaus. It appears that the primary and failover mds servers must be separated by a '':'' or '','' at least from the mkfs command on the OSSs: mkfs.lustre --ost --fsname=mylustre --mgsnode=lustre01:lustre02 --reformat /dev/sdb1 Permanent disk data: Target: mylustre-OSTffff Index: unassigned Lustre FS: mylustre Mount type: ldiskfs Flags: 0x72 (OST needs_index first_time update ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=192.168.100.1 at tcp,192.168.100.2 at tcp device size = 35000MB formatting backing filesystem ldiskfs on /dev/sdb1 target name mylustre-OSTffff 4k blocks 0 options -J size=400 -i 16384 -I 256 -q -O dir_index -F mkfs_cmd = mkfs.ext2 -j -b 4096 -L mylustre-OSTffff -J size=400 -i 16384 -I 256 -q -O dir_index -F /dev/sdb1 Writing CONFIGS/mountdata However, when trying to mount the filesystem on the OSS, the following error occurs: [root at lustre03 ~]# mount -t lustre /dev/sdb1 /mnt/lustrefs/ mount.lustre: mount /dev/sdb1 at /mnt/lustrefs failed: Input/output error Is the MGS running? The MGS is running on lustre01, but not lustre02 as this is a failover MGS/MDS node. The MGS will transfer over only at the result of a failover. From my understanding of Lustre, there can only be 1 MGS/MDS active per filesystem. Also, if I use the syntax ''--mgsnode lustre01 --mgsnode lustre02'' this works on the OSSs, but a ''mount -t lustre /dev/sdb1 /mnt/lustrefs'' on the clients will freeze, meaning that the clients are being confused with 2 MDS nodes. Regards, N. -----Original Message----- From: Klaus Steden [mailto:klaus.steden at thomson.net] Sent: Thursday, March 06, 2008 5:50 PM To: Chadha, Narjit Cc: lustre-discuss at lists.lustre.org Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with Lustre FS Hi Narjit, My usual syntax with mkfs.lustre/tunefs.lustre is something like this: tunefs.lustre --erase-params --writeconf --failnode=tm0-1 at tcp --failnode=tm0-0 at tcp --failnode=172.16.130.249 at tcp1 --failnode=172.16.131.249 at tcp2 --failnode=172.16.130.252 at tcp1 --failnode=172.16.131.252 at tcp2 /dev/sdd i.e. I use more explicit notation. Specifying ''--mgsnode=mgs1 --mgsnode=mgs2'' will likely work as you expected. If you''re using bash, the ''[]'' expansion is usually only suited to filename expansion, thusly: -- cut -- [root at tiger ~]# echo mgs[1-2] mgs[1-2] [root at tiger ~]# touch mgs1 mgs2 [root at tiger ~]# echo mgs[1-2] mgs1 mgs2 [root at tiger ~]# rm mgs[1-2] rm: remove regular empty file `mgs1''? y rm: remove regular empty file `mgs2''? y [root at tiger ~]# touch mgs[1-2] [root at tiger ~]# ls mgs\[1-2\] mgs[1-2] [root at tiger ~]# rm mgs\[1-2\] rm: remove regular empty file `mgs[1-2]''? y [root at tiger ~]# touch mgs{1,2} [root at tiger ~]# ls mgs[1-2] mgs1 mgs2 [root at tiger ~]# -- cut -- Your mileage may vary, but again, using the full notation can save some confusion over things like this. cheers, Klaus On 3/6/08 1:45 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch on stone tablets:> Hi Klaus, > > Actually mgsnid and mgsnode appear to be interchangeable, but they > results are the same. > > It is likely that the command lines being used are slightly incorrectin> that the [] syntax is getting mangled. I am using bash on Red Hat 5 > though. > > I wonder how the proper mount should look on the clients if I am using:> instead of [1-2] to designate the mgsnids(mgsnodes) on the OSSs. > > Regards, > > N. > > -----Original Message----- > From: Klaus Steden [mailto:klaus.steden at thomson.net] > Sent: Thursday, March 06, 2008 2:50 PM > To: Chadha, Narjit; Andreas Dilger > Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com > Subject: Re: [Lustre-discuss] MDT Failover not functioning properlywith> Lustre FS > > > Hi Narjit, > > Note that ''[]'' notation is a shell construct ... depending on theshell,> it > might get expanded different ways, or not at all, and subsequently > mangled > by the time it gets to mkfs.lustre. > > The help output for mkfs.lustre on my 1.6.x system also uses ''--mgs''and> ''--mgsnode'', but there is no mention of a ''--mgsnid'' option. > > I find for clarity I use statements like this when working with my > Lustre FS > on the client side: > > -- cut -- > mount -t lustre hm0-0 at tcp:hm0-1 at tcp:/lustre /mnt/lustre > -- cut -- > > And on the OSS server side: > > -- cut -- > mount -t lustre /dev/sdi /mnt/lustreost > -- cut -- > > Check your command lines, I think they''re slightly incorrect. > > hth, > Klaus > > On 3/6/08 12:39 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch > on > stone tablets: > >> Would you know the correct MDS mount syntax (for OSTs and Clients)for>> an MDS failover? >> >> For the OSSs , it does not appear to take the form: >> >> mkfs.lustre --ost --fsname=lustrefs --mgsnid=mds[1-2] /dev/sdb1 ** >> mount -t lustre /dev/sdb1 /mnt/lustre >> >> ,where mds1,mds2 are the mgsnids of the primary and failover MDSs. > There >> is parsing error for mds[1-2] **, but knows of both mds1 and mds2 >> independently. >> >> For the clients, it does not appear to take the form: >> >> mount -t lustre mds[1-2]:/lustrefs /mnt/lustre ** >> >> Lustre cannot parse mds[1-2] ** , but knows of both mds1 and mds2 >> independently. >> >> As mentioned, I have tried comma separated names and colon separated >> names as well with no effect. The problem is with the mgsnid name >> structure. The ** only show where the errors occur. The MDS on itsown>> fails over seamlessly and was not mentioned above (it also has the > same >> fsname). >> >> Thanks >> >> N. >> >> >> -----Original Message----- >> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] OnBehalf>> Of Andreas Dilger >> Sent: Friday, February 22, 2008 7:08 PM >> To: Chadha, Narjit >> Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com >> Subject: Re: [Lustre-discuss] MDT Failover not functioning properly > with >> Lustre FS >> >> On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote: >>> The only thing left is to be able to mount >>> the failover mds configuration on the OSS. The sytax: >>> >>> >>> >>> mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1 >> >> I think this is a defect in the manual. It should be "--fsname" and >> "--mgsnid" I believe. Please confirm that is the issue and themanual>> can be updated. >> >> >> >> Cheers, Andreas >> -- >> Andreas Dilger >> Sr. Staff Engineer, Lustre Group >> Sun Microsystems of Canada, Inc. >> >> _______________________________________________ >> Lustre-discuss mailing list >> Lustre-discuss at lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-discuss >
Klaus Steden
2008-Mar-07 19:15 UTC
[Lustre-discuss] MDT Failover not functioning properly with Lustre FS
Narjit, I''ve found with my failover setup here that when I change the attributes of my file system (i.e. what mkfs.lustre or tunefs.lustre would do), in order to get my OSTs to mount, I have to mount the MDT on both servers. Try mounting the MDT on the primary MDS, unmounting, mounting on the secondary, unmounting, and then mounting again on the primary before you try to mount any of your OSTs. I think there must be some sort of cache effect that happens when doing this, and once it''s done, you don''t have to jump through that hoop again unless to alter the settings of your file system (i.e. add another network, etc.) I don''t know if this is expected behaviour, but it seems like a minor nuisance rather than a bug. hth, Klaus On 3/7/08 9:44 AM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch on stone tablets:> [snip] > > However, when trying to mount the filesystem on the OSS, the following > error occurs: > > [root at lustre03 ~]# mount -t lustre /dev/sdb1 /mnt/lustrefs/ > mount.lustre: mount /dev/sdb1 at /mnt/lustrefs failed: Input/output > error > Is the MGS running? > > The MGS is running on lustre01, but not lustre02 as this is a failover > MGS/MDS node. The MGS will transfer over only at the result of a > failover. From my understanding of Lustre, there can only be 1 MGS/MDS > active per filesystem. Also, if I use the syntax ''--mgsnode lustre01 > --mgsnode lustre02'' this works on the OSSs, but a ''mount -t lustre > /dev/sdb1 /mnt/lustrefs'' on the clients will freeze, meaning that the > clients are being confused with 2 MDS nodes. > > Regards, > > N. > > -----Original Message----- > From: Klaus Steden [mailto:klaus.steden at thomson.net] > Sent: Thursday, March 06, 2008 5:50 PM > To: Chadha, Narjit > Cc: lustre-discuss at lists.lustre.org > Subject: Re: [Lustre-discuss] MDT Failover not functioning properly with > Lustre FS > > > Hi Narjit, > > My usual syntax with mkfs.lustre/tunefs.lustre is something like this: > > tunefs.lustre --erase-params --writeconf --failnode=tm0-1 at tcp > --failnode=tm0-0 at tcp --failnode=172.16.130.249 at tcp1 > --failnode=172.16.131.249 at tcp2 --failnode=172.16.130.252 at tcp1 > --failnode=172.16.131.252 at tcp2 /dev/sdd > > i.e. I use more explicit notation. Specifying ''--mgsnode=mgs1 > --mgsnode=mgs2'' will likely work as you expected. > > If you''re using bash, the ''[]'' expansion is usually only suited to > filename > expansion, thusly: > > -- cut -- > [root at tiger ~]# echo mgs[1-2] > mgs[1-2] > [root at tiger ~]# touch mgs1 mgs2 > [root at tiger ~]# echo mgs[1-2] > mgs1 mgs2 > [root at tiger ~]# rm mgs[1-2] > rm: remove regular empty file `mgs1''? y > rm: remove regular empty file `mgs2''? y > [root at tiger ~]# touch mgs[1-2] > [root at tiger ~]# ls mgs\[1-2\] > mgs[1-2] > [root at tiger ~]# rm mgs\[1-2\] > rm: remove regular empty file `mgs[1-2]''? y > [root at tiger ~]# touch mgs{1,2} > [root at tiger ~]# ls mgs[1-2] > mgs1 mgs2 > [root at tiger ~]# > -- cut -- > > Your mileage may vary, but again, using the full notation can save some > confusion over things like this. > > cheers, > Klaus > > On 3/6/08 1:45 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch on > stone tablets: > >> Hi Klaus, >> >> Actually mgsnid and mgsnode appear to be interchangeable, but they >> results are the same. >> >> It is likely that the command lines being used are slightly incorrect > in >> that the [] syntax is getting mangled. I am using bash on Red Hat 5 >> though. >> >> I wonder how the proper mount should look on the clients if I am using > : >> instead of [1-2] to designate the mgsnids(mgsnodes) on the OSSs. >> >> Regards, >> >> N. >> >> -----Original Message----- >> From: Klaus Steden [mailto:klaus.steden at thomson.net] >> Sent: Thursday, March 06, 2008 2:50 PM >> To: Chadha, Narjit; Andreas Dilger >> Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com >> Subject: Re: [Lustre-discuss] MDT Failover not functioning properly > with >> Lustre FS >> >> >> Hi Narjit, >> >> Note that ''[]'' notation is a shell construct ... depending on the > shell, >> it >> might get expanded different ways, or not at all, and subsequently >> mangled >> by the time it gets to mkfs.lustre. >> >> The help output for mkfs.lustre on my 1.6.x system also uses ''--mgs'' > and >> ''--mgsnode'', but there is no mention of a ''--mgsnid'' option. >> >> I find for clarity I use statements like this when working with my >> Lustre FS >> on the client side: >> >> -- cut -- >> mount -t lustre hm0-0 at tcp:hm0-1 at tcp:/lustre /mnt/lustre >> -- cut -- >> >> And on the OSS server side: >> >> -- cut -- >> mount -t lustre /dev/sdi /mnt/lustreost >> -- cut -- >> >> Check your command lines, I think they''re slightly incorrect. >> >> hth, >> Klaus >> >> On 3/6/08 12:39 PM, "Chadha, Narjit" <Narjit.Chadha at necam.com>did etch >> on >> stone tablets: >> >>> Would you know the correct MDS mount syntax (for OSTs and Clients) > for >>> an MDS failover? >>> >>> For the OSSs , it does not appear to take the form: >>> >>> mkfs.lustre --ost --fsname=lustrefs --mgsnid=mds[1-2] /dev/sdb1 ** >>> mount -t lustre /dev/sdb1 /mnt/lustre >>> >>> ,where mds1,mds2 are the mgsnids of the primary and failover MDSs. >> There >>> is parsing error for mds[1-2] **, but knows of both mds1 and mds2 >>> independently. >>> >>> For the clients, it does not appear to take the form: >>> >>> mount -t lustre mds[1-2]:/lustrefs /mnt/lustre ** >>> >>> Lustre cannot parse mds[1-2] ** , but knows of both mds1 and mds2 >>> independently. >>> >>> As mentioned, I have tried comma separated names and colon separated >>> names as well with no effect. The problem is with the mgsnid name >>> structure. The ** only show where the errors occur. The MDS on its > own >>> fails over seamlessly and was not mentioned above (it also has the >> same >>> fsname). >>> >>> Thanks >>> >>> N. >>> >>> >>> -----Original Message----- >>> From: Andreas.Dilger at sun.com [mailto:Andreas.Dilger at sun.com] On > Behalf >>> Of Andreas Dilger >>> Sent: Friday, February 22, 2008 7:08 PM >>> To: Chadha, Narjit >>> Cc: lustre-discuss at lists.lustre.org; Sheila.Barthel at sun.com >>> Subject: Re: [Lustre-discuss] MDT Failover not functioning properly >> with >>> Lustre FS >>> >>> On Feb 21, 2008 13:14 -0800, Chadha, Narjit wrote: >>>> The only thing left is to be able to mount >>>> the failover mds configuration on the OSS. The sytax: >>>> >>>> >>>> >>>> mkfs.lustre --ost -fsname=mylustre -mgsnid=lustre0[1-2] /dev/sdb1 >>> >>> I think this is a defect in the manual. It should be "--fsname" and >>> "--mgsnid" I believe. Please confirm that is the issue and the > manual >>> can be updated. >>> >>> >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Sr. Staff Engineer, Lustre Group >>> Sun Microsystems of Canada, Inc. >>> >>> _______________________________________________ >>> Lustre-discuss mailing list >>> Lustre-discuss at lists.lustre.org >>> http://lists.lustre.org/mailman/listinfo/lustre-discuss >> >