thr3ads.net - Ocfs2 users - [Ocfs2-users] Failover testing problem and a heartbeat question [May 2010]

If this information is useful, please help other people find it:
Share via:

Daniel McDonald

2010-May-26 19:53 UTC

[Ocfs2-users] Failover testing problem and a heartbeat question

We have a setup with 15 hosts fibre attached via a switch to a common SAN. Each
host has a single fibre port, the SAN has two controllers each with two ports.
The SAN is exposing four OCFS2 v1.4.2 volumes. While performing a failover test,
we observed 8 hosts fence and 2 reboot _without_ fencing. The OCFS2 FAQ
recommends a default disk heartbeat of 31 - 61 loops for multipath io users. Our
initial thought was to increase the default from 31 to 61.

I have a two hopefully simple questions. First, is there any reason why we would
not want to increase the threshold to 61? Performance or otherwise?

Second, is there any reason in which, during IO operations and experiencing a
single fibre path (out of 4) failure, an OCFS2 node would reset itself without
_any_ kernel log message?

Thank you for your time
-Daniel

Sunil Mushran

2010-May-26 20:22 UTC

head link

[Ocfs2-users] Failover testing problem and a heartbeat question

When a node dies, the cluster ops pause for the node to be first
declared dead followed by recovery. Threshold governs the time
it takes to declare the node dead. The higher the value, the longer
the pause.

ocfs2 does not reset without a log message. Do you have netconsole
setup? Messages logged a tick before reset can only be captured by
netconsole/kdump etc.

On 05/26/2010 12:53 PM, Daniel McDonald wrote:> We have a setup with 15 hosts fibre attached via a switch to a common SAN.
Each host has a single fibre port, the SAN has two controllers each with two
ports. The SAN is exposing four OCFS2 v1.4.2 volumes. While performing a
failover test, we observed 8 hosts fence and 2 reboot _without_ fencing. The
OCFS2 FAQ recommends a default disk heartbeat of 31 - 61 loops for multipath io
users. Our initial thought was to increase the default from 31 to 61.
>
> I have a two hopefully simple questions. First, is there any reason why we
would not want to increase the threshold to 61? Performance or otherwise?
>
> Second, is there any reason in which, during IO operations and experiencing
a single fibre path (out of 4) failure, an OCFS2 node would reset itself without
_any_ kernel log message?
>
> Thank you for your time
> -Daniel
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users
>

Maybe Matching Threads

Search for more possibly parallel threads

Ocfs2 users - May 2010 - Failover testing problem and a heartbeat question

[Ocfs2-users] Failover testing problem and a heartbeat question

[Ocfs2-users] Failover testing problem and a heartbeat question

Maybe Matching Threads