I am hoping we would not have this problem with the new
cluster stacks that are in development, cman and pacemaker.
But always good to hear about the issues being encountered
by the users. Well, not good... but you know what I mean.
Michael Moody wrote:>
> I have a suggestion about the heartbeat and the way that ?downed node?
> detection works.
>
> There are occasions where a node is up, and for whatever reason, I
> need to power cycle it (for instance, a frozen process, etc). In these
> instances, my other nodes are unable to perform file system operations
> until the heartbeat period expires. This ends up being somewhere
> around 30-60 seconds (this is the value which works best for me, and
> does not cause self fencing). It would be useful to allow me to force
> the remaining nodes to just understand the node was taken down
> purposefully, and move on with their lives.
>
> A real world example:
>
> OCFS2 hosting files used with a website, driven by Apache. If a node
> goes down, the load average on all remaining nodes skyrockets to 500
> or more, as the Apache processes all enter a state of uninterruptible
> sleep. This triggers alerts, pages, and on occasion, application
> specific triggers (web app) that show a ?Too Busy? page when the load
> average is too high (for instance, that which vBulletin does).
>
> It would be magnificent to be able to instruct the remaining nodes
> that the node in question was taken down purposefully, and to go on
> about their lives immediately (beginning of course with the journal
> replay, etc).
>
> It?s very simple in concept, and probably also execution. Could
> something like this be added? It would allow me to really do wonderful
> things from a STONITH perspective.
>
> Thanks,
>
> Michael
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ocfs2-users mailing list
> Ocfs2-users at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-users