Hi, I have a heartbeat problem while trying automatic failover. Manual failover works great, unmounting a partitition from an OSS and remounting it on another one makes the clients recover. It all starts with this error: Filesystem[7650]: 2011/12/22_14:36:05 ERROR: Couldn''t mount filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60 Filesystem[7639]: 2011/12/22_14:36:05 ERROR: Generic error As a result, the failover OSS is the wrong one and the clients stays in this state forever: sata-OST0000_UUID : Resource temporarily unavailable Here is my heartbeat config: [root at ib3-st02 ~]# cat /etc/ha.d/ha.cf # log file settings # write debug output to /var/log/ha-debug debugfile /var/log/ha-debug # write log messages to /var/log/ha-log logfile /var/log/ha-log # use syslog to write to logfiles logfacility local0 # set some time-outs. these values are only recommendations, which # depend e.g. on the OSS load # send keep-alive packages every 2 seconds keepalive 2 # wait 90 seconds before declaring a node dead deadtime 90 # write a warning to the logfile after 30 seconds without an answer # from the failover node warntime 30 # wait for 120 seconds before declaring a node dead after heartbeat # is brought up initdead 120 # define communication channels # use port 12345 to communicate with fail-over node udpport 12345 # use network interfaces eth0 and ib0 to detect a failed node bcast eth0 bond0 # Use manual failback auto_failback off # node names in this failover-pair. These names must match the # output of `hostname` node ib3-st01 node ib3-st02 node ib3-st03 node ib3-st04 [root at ib3-st02 ~]# cat /etc/ha.d/haresources ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre ib3-st01 Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre ib3-st02 Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre ib3-st03 Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre ib3-st04 Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre ib3-st01 Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre ib3-st02 Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre ib3-st03 Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre ib3-st04 Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre It is all the same on all OSS''s. Does anybody ever encounter that problem? Thanks for help. -- Patrice Hamelin Specialiste s?nior en syst?mes d''exploitation | Senior OS specialist Environnement Canada | Environment Canada 2121, route Transcanadienne | 2121 Transcanada Highway Dorval, QC H9P 1J3 T?l?phone | Telephone 514-421-5303 T?l?copieur | Facsimile 514-421-7231 Gouvernement du Canada | Government of Canada
Hi, we had had the same problem. We ''fixed'' it by increasing the start parameter in Linux-HA script /usr/lib/ocf/resource.d/heartbeat/Filesystem ... <action name="start" timeout="300" /> ... If you use pacemaker or RH cluster suite (although your config dir looks like linux-ha) there''s probably a similar parameter. Cheers -Frank On Thu, 2011-12-22 at 16:38 +0100, Patrice Hamelin wrote:> Hi, > > I have a heartbeat problem while trying automatic failover. Manual > failover works great, unmounting a partitition from an OSS and > remounting it on another one makes the clients recover. It all starts > with this error: > > Filesystem[7650]: 2011/12/22_14:36:05 ERROR: Couldn''t mount > filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60 > Filesystem[7639]: 2011/12/22_14:36:05 ERROR: Generic error > > As a result, the failover OSS is the wrong one and the clients stays > in this state forever: > > sata-OST0000_UUID : Resource temporarily unavailable > > Here is my heartbeat config: > > [root at ib3-st02 ~]# cat /etc/ha.d/ha.cf > # log file settings > # write debug output to /var/log/ha-debug > debugfile /var/log/ha-debug > # write log messages to /var/log/ha-log > logfile /var/log/ha-log > # use syslog to write to logfiles > logfacility local0 > # set some time-outs. these values are only recommendations, which > # depend e.g. on the OSS load > # send keep-alive packages every 2 seconds > keepalive 2 > # wait 90 seconds before declaring a node dead > deadtime 90 > # write a warning to the logfile after 30 seconds without an answer > # from the failover node > warntime 30 > # wait for 120 seconds before declaring a node dead after heartbeat > # is brought up > initdead 120 > # define communication channels > # use port 12345 to communicate with fail-over node > udpport 12345 > # use network interfaces eth0 and ib0 to detect a failed node > bcast eth0 bond0 > # Use manual failback > auto_failback off > # node names in this failover-pair. These names must match the > # output of `hostname` > node ib3-st01 > node ib3-st02 > node ib3-st03 > node ib3-st04 > > [root at ib3-st02 ~]# cat /etc/ha.d/haresources > ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre > ib3-st01 > Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre > ib3-st02 > Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre > ib3-st03 > Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre > ib3-st04 > Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre > ib3-st01 > Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre > ib3-st02 > Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre > ib3-st03 > Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre > ib3-st04 > Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre > > > It is all the same on all OSS''s. > > Does anybody ever encounter that problem? > Thanks for help. > > > >------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------ Forschungszentrum Juelich GmbH 52425 Juelich Sitz der Gesellschaft: Juelich Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, Prof. Dr. Sebastian M. Schmidt ------------------------------------------------------------------------------------------------ ------------------------------------------------------------------------------------------------
Thanks Franks, Works just great! Greetings! On 12/23/11 13:18, Frank Heckes wrote:> Hi, > > we had had the same problem. We ''fixed'' it by increasing the start > parameter in Linux-HA > script /usr/lib/ocf/resource.d/heartbeat/Filesystem > > ... > <action name="start" timeout="300" /> > ... > > If you use pacemaker or RH cluster suite (although your config dir looks > like linux-ha) there''s probably a similar parameter. > > Cheers > > -Frank > > On Thu, 2011-12-22 at 16:38 +0100, Patrice Hamelin wrote: >> Hi, >> >> I have a heartbeat problem while trying automatic failover. Manual >> failover works great, unmounting a partitition from an OSS and >> remounting it on another one makes the clients recover. It all starts >> with this error: >> >> Filesystem[7650]: 2011/12/22_14:36:05 ERROR: Couldn''t mount >> filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60 >> Filesystem[7639]: 2011/12/22_14:36:05 ERROR: Generic error >> >> As a result, the failover OSS is the wrong one and the clients stays >> in this state forever: >> >> sata-OST0000_UUID : Resource temporarily unavailable >> >> Here is my heartbeat config: >> >> [root at ib3-st02 ~]# cat /etc/ha.d/ha.cf >> # log file settings >> # write debug output to /var/log/ha-debug >> debugfile /var/log/ha-debug >> # write log messages to /var/log/ha-log >> logfile /var/log/ha-log >> # use syslog to write to logfiles >> logfacility local0 >> # set some time-outs. these values are only recommendations, which >> # depend e.g. on the OSS load >> # send keep-alive packages every 2 seconds >> keepalive 2 >> # wait 90 seconds before declaring a node dead >> deadtime 90 >> # write a warning to the logfile after 30 seconds without an answer >> # from the failover node >> warntime 30 >> # wait for 120 seconds before declaring a node dead after heartbeat >> # is brought up >> initdead 120 >> # define communication channels >> # use port 12345 to communicate with fail-over node >> udpport 12345 >> # use network interfaces eth0 and ib0 to detect a failed node >> bcast eth0 bond0 >> # Use manual failback >> auto_failback off >> # node names in this failover-pair. These names must match the >> # output of `hostname` >> node ib3-st01 >> node ib3-st02 >> node ib3-st03 >> node ib3-st04 >> >> [root at ib3-st02 ~]# cat /etc/ha.d/haresources >> ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre >> ib3-st01 >> Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre >> ib3-st02 >> Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre >> ib3-st03 >> Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre >> ib3-st04 >> Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre >> ib3-st01 >> Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre >> ib3-st02 >> Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre >> ib3-st03 >> Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre >> ib3-st04 >> Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre >> >> >> It is all the same on all OSS''s. >> >> Does anybody ever encounter that problem? >> Thanks for help. >> >> >> >> > > > ------------------------------------------------------------------------------------------------ > ------------------------------------------------------------------------------------------------ > Forschungszentrum Juelich GmbH > 52425 Juelich > Sitz der Gesellschaft: Juelich > Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 > Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher > Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender), > Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, > Prof. Dr. Sebastian M. Schmidt > ------------------------------------------------------------------------------------------------ > ------------------------------------------------------------------------------------------------ > _______________________________________________ > Lustre-discuss mailing list > Lustre-discuss at lists.lustre.org > http://lists.lustre.org/mailman/listinfo/lustre-discuss-- Patrice Hamelin Specialiste s?nior en syst?mes d''exploitation | Senior OS specialist Environnement Canada | Environment Canada 2121, route Transcanadienne | 2121 Transcanada Highway Dorval, QC H9P 1J3 T?l?phone | Telephone 514-421-5303 T?l?copieur | Facsimile 514-421-7231 Gouvernement du Canada | Government of Canada