thr3ads.net - Lustre discuss - [Lustre-discuss] Heartbeat problem [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Patrice Hamelin

2011-Dec-22 15:38 UTC

[Lustre-discuss] Heartbeat problem

Hi,

   I have a heartbeat problem while trying automatic failover.  Manual 
failover works great, unmounting a  partitition from an OSS and 
remounting it on another one makes the clients recover.  It all starts 
with this error:

Filesystem[7650]:       2011/12/22_14:36:05 ERROR: Couldn''t mount 
filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60
Filesystem[7639]:       2011/12/22_14:36:05 ERROR:  Generic error

   As a result, the failover OSS is the wrong one and the clients stays 
in this state forever:

sata-OST0000_UUID   : Resource temporarily unavailable

   Here is my heartbeat config:

[root at ib3-st02 ~]# cat /etc/ha.d/ha.cf
# log file settings
# write debug output to /var/log/ha-debug
debugfile /var/log/ha-debug
# write log messages to /var/log/ha-log
logfile /var/log/ha-log
# use syslog to write to logfiles
logfacility local0
# set some time-outs. these values are only recommendations, which
# depend e.g. on the OSS load
# send keep-alive packages every 2 seconds
keepalive 2
# wait 90 seconds before declaring a node dead
deadtime 90
# write a warning to the logfile after 30 seconds without an answer
# from the failover node
warntime 30
# wait for 120 seconds before declaring a node dead after heartbeat
# is brought up
initdead 120
# define communication channels
# use port 12345 to communicate with fail-over node
udpport 12345
# use network interfaces eth0 and ib0 to detect a failed node
bcast eth0 bond0
# Use manual failback
auto_failback off
# node names in this failover-pair. These names must match the
# output of `hostname`
node ib3-st01
node ib3-st02
node ib3-st03
node ib3-st04

[root at ib3-st02 ~]# cat /etc/ha.d/haresources
ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre
ib3-st01 
Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre
ib3-st02 
Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre
ib3-st03 
Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre
ib3-st04 
Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre
ib3-st01 
Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre
ib3-st02 
Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre
ib3-st03 
Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre
ib3-st04 
Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre


   It is all the same on all OSS''s.

Does anybody ever encounter  that problem?
Thanks for help.




-- 
Patrice Hamelin
Specialiste s?nior en syst?mes d''exploitation | Senior OS specialist
Environnement Canada | Environment Canada
2121, route Transcanadienne | 2121 Transcanada Highway
Dorval, QC H9P 1J3
T?l?phone | Telephone 514-421-5303
T?l?copieur | Facsimile 514-421-7231
Gouvernement du Canada | Government of Canada

Frank Heckes

2011-Dec-23 13:18 UTC

head link

[Lustre-discuss] Heartbeat problem

Hi,

we had had the same problem. We ''fixed'' it by increasing the
start
parameter in Linux-HA
script /usr/lib/ocf/resource.d/heartbeat/Filesystem

        ...
        <action name="start" timeout="300" />
        ...

If you use pacemaker or RH cluster suite (although your config dir looks
like linux-ha) there''s probably a similar parameter.

Cheers

-Frank

On Thu, 2011-12-22 at 16:38 +0100, Patrice Hamelin
wrote:> Hi,
>
>    I have a heartbeat problem while trying automatic failover.  Manual
> failover works great, unmounting a  partitition from an OSS and
> remounting it on another one makes the clients recover.  It all starts
> with this error:
>
> Filesystem[7650]:       2011/12/22_14:36:05 ERROR: Couldn''t mount
> filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60
> Filesystem[7639]:       2011/12/22_14:36:05 ERROR:  Generic error
>
>    As a result, the failover OSS is the wrong one and the clients stays
> in this state forever:
>
> sata-OST0000_UUID   : Resource temporarily unavailable
>
>    Here is my heartbeat config:
>
> [root at ib3-st02 ~]# cat /etc/ha.d/ha.cf
> # log file settings
> # write debug output to /var/log/ha-debug
> debugfile /var/log/ha-debug
> # write log messages to /var/log/ha-log
> logfile /var/log/ha-log
> # use syslog to write to logfiles
> logfacility local0
> # set some time-outs. these values are only recommendations, which
> # depend e.g. on the OSS load
> # send keep-alive packages every 2 seconds
> keepalive 2
> # wait 90 seconds before declaring a node dead
> deadtime 90
> # write a warning to the logfile after 30 seconds without an answer
> # from the failover node
> warntime 30
> # wait for 120 seconds before declaring a node dead after heartbeat
> # is brought up
> initdead 120
> # define communication channels
> # use port 12345 to communicate with fail-over node
> udpport 12345
> # use network interfaces eth0 and ib0 to detect a failed node
> bcast eth0 bond0
> # Use manual failback
> auto_failback off
> # node names in this failover-pair. These names must match the
> # output of `hostname`
> node ib3-st01
> node ib3-st02
> node ib3-st03
> node ib3-st04
>
> [root at ib3-st02 ~]# cat /etc/ha.d/haresources
> ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre
> ib3-st01
> Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre
> ib3-st02
> Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre
> ib3-st03
> Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre
> ib3-st04
> Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre
> ib3-st01
> Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre
> ib3-st02
> Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre
> ib3-st03
> Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre
> ib3-st04
> Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre
>
>
>    It is all the same on all OSS''s.
>
> Does anybody ever encounter  that problem?
> Thanks for help.
>
>
>
>


------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

Patrice Hamelin

2011-Dec-23 13:49 UTC

head link

[Lustre-discuss] Heartbeat problem

Thanks Franks,

   Works just great!

Greetings!

On 12/23/11 13:18, Frank Heckes wrote:> Hi,
>
> we had had the same problem. We ''fixed'' it by increasing
the start
> parameter in Linux-HA
> script /usr/lib/ocf/resource.d/heartbeat/Filesystem
>
>          ...
>          <action name="start" timeout="300" />
>          ...
>
> If you use pacemaker or RH cluster suite (although your config dir looks
> like linux-ha) there''s probably a similar parameter.
>
> Cheers
>
> -Frank
>
> On Thu, 2011-12-22 at 16:38 +0100, Patrice Hamelin wrote:
>> Hi,
>>
>>     I have a heartbeat problem while trying automatic failover.  Manual
>> failover works great, unmounting a  partitition from an OSS and
>> remounting it on another one makes the clients recover.  It all starts
>> with this error:
>>
>> Filesystem[7650]:       2011/12/22_14:36:05 ERROR: Couldn''t
mount
>> filesystem /dev/mpath/colosse4-lun60-sata on /mnt/data/clun60
>> Filesystem[7639]:       2011/12/22_14:36:05 ERROR:  Generic error
>>
>>     As a result, the failover OSS is the wrong one and the clients
stays
>> in this state forever:
>>
>> sata-OST0000_UUID   : Resource temporarily unavailable
>>
>>     Here is my heartbeat config:
>>
>> [root at ib3-st02 ~]# cat /etc/ha.d/ha.cf
>> # log file settings
>> # write debug output to /var/log/ha-debug
>> debugfile /var/log/ha-debug
>> # write log messages to /var/log/ha-log
>> logfile /var/log/ha-log
>> # use syslog to write to logfiles
>> logfacility local0
>> # set some time-outs. these values are only recommendations, which
>> # depend e.g. on the OSS load
>> # send keep-alive packages every 2 seconds
>> keepalive 2
>> # wait 90 seconds before declaring a node dead
>> deadtime 90
>> # write a warning to the logfile after 30 seconds without an answer
>> # from the failover node
>> warntime 30
>> # wait for 120 seconds before declaring a node dead after heartbeat
>> # is brought up
>> initdead 120
>> # define communication channels
>> # use port 12345 to communicate with fail-over node
>> udpport 12345
>> # use network interfaces eth0 and ib0 to detect a failed node
>> bcast eth0 bond0
>> # Use manual failback
>> auto_failback off
>> # node names in this failover-pair. These names must match the
>> # output of `hostname`
>> node ib3-st01
>> node ib3-st02
>> node ib3-st03
>> node ib3-st04
>>
>> [root at ib3-st02 ~]# cat /etc/ha.d/haresources
>> ib3-st01 Filesystem::/dev/emcssd-1/mdt-sata::/mnt/mdt-colosse::lustre
>> ib3-st01
>> Filesystem::/dev/mpath/colosse4-lun53-sata::/mnt/data/clun53::lustre
>> ib3-st02
>> Filesystem::/dev/mpath/colosse4-lun54-sata::/mnt/data/clun54::lustre
>> ib3-st03
>> Filesystem::/dev/mpath/colosse4-lun55-sata::/mnt/data/clun55::lustre
>> ib3-st04
>> Filesystem::/dev/mpath/colosse4-lun56-sata::/mnt/data/clun56::lustre
>> ib3-st01
>> Filesystem::/dev/mpath/colosse4-lun57-sata::/mnt/data/clun57::lustre
>> ib3-st02
>> Filesystem::/dev/mpath/colosse4-lun58-sata::/mnt/data/clun58::lustre
>> ib3-st03
>> Filesystem::/dev/mpath/colosse4-lun59-sata::/mnt/data/clun59::lustre
>> ib3-st04
>> Filesystem::/dev/mpath/colosse4-lun60-sata::/mnt/data/clun60::lustre
>>
>>
>>     It is all the same on all OSS''s.
>>
>> Does anybody ever encounter  that problem?
>> Thanks for help.
>>
>>
>>
>>
>
>
>
------------------------------------------------------------------------------------------------
>
------------------------------------------------------------------------------------------------
> Forschungszentrum Juelich GmbH
> 52425 Juelich
> Sitz der Gesellschaft: Juelich
> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
> Vorsitzender des Aufsichtsrats: MinDirig Dr. Karl Eugen Huthmacher
> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
> Prof. Dr. Sebastian M. Schmidt
>
------------------------------------------------------------------------------------------------
>
------------------------------------------------------------------------------------------------
> _______________________________________________
> Lustre-discuss mailing list
> Lustre-discuss at lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
-- 
Patrice Hamelin
Specialiste s?nior en syst?mes d''exploitation | Senior OS specialist
Environnement Canada | Environment Canada
2121, route Transcanadienne | 2121 Transcanada Highway
Dorval, QC H9P 1J3
T?l?phone | Telephone 514-421-5303
T?l?copieur | Facsimile 514-421-7231
Gouvernement du Canada | Government of Canada

Lustre discuss - Dec 2011 - Heartbeat problem

[Lustre-discuss] Heartbeat problem

[Lustre-discuss] Heartbeat problem

[Lustre-discuss] Heartbeat problem