Daniel McDonald
2010-May-26 19:53 UTC
[Ocfs2-users] Failover testing problem and a heartbeat question
We have a setup with 15 hosts fibre attached via a switch to a common SAN. Each host has a single fibre port, the SAN has two controllers each with two ports. The SAN is exposing four OCFS2 v1.4.2 volumes. While performing a failover test, we observed 8 hosts fence and 2 reboot _without_ fencing. The OCFS2 FAQ recommends a default disk heartbeat of 31 - 61 loops for multipath io users. Our initial thought was to increase the default from 31 to 61. I have a two hopefully simple questions. First, is there any reason why we would not want to increase the threshold to 61? Performance or otherwise? Second, is there any reason in which, during IO operations and experiencing a single fibre path (out of 4) failure, an OCFS2 node would reset itself without _any_ kernel log message? Thank you for your time -Daniel
Sunil Mushran
2010-May-26 20:22 UTC
[Ocfs2-users] Failover testing problem and a heartbeat question
When a node dies, the cluster ops pause for the node to be first declared dead followed by recovery. Threshold governs the time it takes to declare the node dead. The higher the value, the longer the pause. ocfs2 does not reset without a log message. Do you have netconsole setup? Messages logged a tick before reset can only be captured by netconsole/kdump etc. On 05/26/2010 12:53 PM, Daniel McDonald wrote:> We have a setup with 15 hosts fibre attached via a switch to a common SAN. Each host has a single fibre port, the SAN has two controllers each with two ports. The SAN is exposing four OCFS2 v1.4.2 volumes. While performing a failover test, we observed 8 hosts fence and 2 reboot _without_ fencing. The OCFS2 FAQ recommends a default disk heartbeat of 31 - 61 loops for multipath io users. Our initial thought was to increase the default from 31 to 61. > > I have a two hopefully simple questions. First, is there any reason why we would not want to increase the threshold to 61? Performance or otherwise? > > Second, is there any reason in which, during IO operations and experiencing a single fibre path (out of 4) failure, an OCFS2 node would reset itself without _any_ kernel log message? > > Thank you for your time > -Daniel > _______________________________________________ > Ocfs2-users mailing list > Ocfs2-users at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-users >