Hi, Just wanted to understand how OCFS2 fencing works. Sorry if this has already been discussed... (1) -------------- A node has quorum when: * it sees an odd number of heartbeating nodes and has network connectivity to more than half of them. or * it sees an even number of heartbeating nodes and has network connectivity to at least half of them *and* has connectivity to the heartbeating node with the lowest node number. -------------- Now, Think about a case where there are 5 nodes in an OCFS2 cluster. Consider that split-brain happens and it's divided into 2 subclusters of 3-node and 2-node. In this case, this algorithm will work fine and the cluster with 3-node sub cluster will win the race. But think about the case, where there is a serial split-brain and you have 2-node, 2-node and 1-node (3 sub-clusters) after 2 split-brains at the same time. In this case, this algorithm will fail and all sub-clusters will be paniced, because on each sub cluster, none of the nodes has connectivity to more than (5/2 = 2) nodes, while each node can get disk hearbeat from 5 nodes. This may be the case with any cluster configuration, if there are serial split-brains. Has the algorithm been designed for handling serial split-brains? If yes, then how? Is there anything else which is to be considered? (2) In ocfs2_faq I read that for quorum process to get stabilzed it may take 28 seconds. -------------- Q05 How long does the quorum process take? A05 First a node will realize that it doesn't have connectivity with another node. This can happen immediately if the connection is closed but can take a maximum of 10 seconds of idle time. Then the node must wait long enough to give heartbeating a chance to declare the node dead. It does this by waiting two iterations longer than the number of iterations needed to consider a node dead (see Q03 in the Heartbeat section of this FAQ). The current default of 7 iterations of 2 seconds results in waiting for 9 iterations or 18 seconds. By default, then, a maximum of 28 seconds can pass from the time a network fault occurs until a node fences itself. -------------- I don't understand why are we giving heartbeating extra 2 iterations to declare a node dead in case of split brain? What I think is, if we are already missing disk heartbeat for a node, then it's missed heartbeat counter has already been started and we would declare that node dead after 7 iterations. How do we include these extra 2 iterations? What I want to say here is, after 10 seconds of TCP idle timeout for a node, we believe that we will start missing disk heartbeats for that node and we allow 9 iterations of such missed heartbeats, but how do you inform the other thread, which is already doing this missed heartbeat calculation (because we are missing disk hearbeats), that it needs to wait for 2 more iterations before declaring the node dead. If you don't inform that thread about this, then it will declare the other node dead after 7 iterations only. So how this extra 2 iterations concept will come into picture? Thanks. Sumsha.
With some experiments and going through OCFS2's quorum code, I am sure that in case of serial split-brain, quorum algorithm will surely break and will cause complete cluster shutdown. It will cause all the subcluster nodes to panic themselves. Please correct me if I am wrong... o2quo_make_decision() function, which is responsible for taking the final decision during hb_up and hb_down, makes lots of assumptions (which may fail) and it may take wrong decision in serial split brain cases. Probably this problem will be resolved once "we have some more rational approach that is driven from userspace" as mentioned in quorum.c Thanks. Sumsha. On 5/17/06, Sum Sha <sumsha.matrixreloaded at gmail.com> wrote:> Hi, > Just wanted to understand how OCFS2 fencing works. Sorry if this has > already been discussed... > (1) > -------------- > A node has quorum when: > * it sees an odd number of heartbeating nodes and has network > connectivity to more than half of them. > or > * it sees an even number of heartbeating nodes and has network > connectivity to at least half of them *and* has connectivity to > the heartbeating node with the lowest node number. > -------------- > Now, Think about a case where there are 5 nodes in an OCFS2 cluster. > Consider that split-brain happens and it's divided into 2 subclusters > of 3-node and 2-node. In this case, this algorithm will work fine and > the cluster with 3-node sub cluster will win the race. But think about > the case, where there is a serial split-brain and you have 2-node, > 2-node and 1-node (3 sub-clusters) after 2 split-brains at the same > time. In this case, this algorithm will fail and all sub-clusters will > be paniced, because on each sub cluster, none of the nodes has > connectivity to more than (5/2 = 2) nodes, while each node can get > disk hearbeat from 5 nodes. > > This may be the case with any cluster configuration, if there are > serial split-brains. Has the algorithm been designed for handling > serial split-brains? If yes, then how? > Is there anything else which is to be considered? > > (2) In ocfs2_faq I read that for quorum process to get stabilzed it > may take 28 seconds. > -------------- > Q05 How long does the quorum process take? > A05 First a node will realize that it doesn't have connectivity with > another node. This can happen immediately if the connection is > closed > but can take a maximum of 10 seconds of idle time. Then the node > must wait long enough to give heartbeating a chance to declare the > node dead. It does this by waiting two iterations longer than > the number of iterations needed to consider a node dead (see Q03 in > the Heartbeat section of this FAQ). The current default of 7 > iterations of 2 seconds results in waiting for 9 iterations or 18 > seconds. By default, then, a maximum of 28 seconds can pass from > the > time a network fault occurs until a node fences itself. > -------------- > > I don't understand why are we giving heartbeating extra 2 iterations > to declare a node dead in case of split brain? What I think is, if we > are already missing disk heartbeat for a node, then it's missed > heartbeat counter has already been started and we would declare that > node dead after 7 iterations. How do we include these extra 2 > iterations? > > What I want to say here is, after 10 seconds of TCP idle timeout for a > node, we believe that we will start missing disk heartbeats for that > node and we allow 9 iterations of such missed heartbeats, but how do > you inform the other thread, which is already doing this missed > heartbeat calculation (because we are missing disk hearbeats), that it > needs to wait for 2 more iterations before declaring the node dead. If > you don't inform that thread about this, then it will declare the > other node dead after 7 iterations only. So how this extra 2 > iterations concept will come into picture? > > Thanks. > Sumsha. >
Sum Sha wrote:> -------------- > Q05 How long does the quorum process take? > A05 First a node will realize that it doesn't have connectivity with > another node. This can happen immediately if the connection is closed > but can take a maximum of 10 seconds of idle time. Then the node > must wait long enough to give heartbeating a chance to declare the > node dead. It does this by waiting two iterations longer than > the number of iterations needed to consider a node dead (see Q03 in > the Heartbeat section of this FAQ). The current default of 7 > iterations of 2 seconds results in waiting for 9 iterations or 18 > seconds. By default, then, a maximum of 28 seconds can pass from the > time a network fault occurs until a node fences itself. > -------------- > > I don't understand why are we giving heartbeating extra 2 iterations > to declare a node dead in case of split brain? What I think is, if we > are already missing disk heartbeat for a node, then it's missed > heartbeat counter has already been started and we would declare that > node dead after 7 iterations. How do we include these extra 2 > iterations?While working on the fencing harness RFC I realized why that extra wait is necessary. Heartbeat will continue pinging a node some number of periods even while it receives no responses from the node. The trouble is, the remote node may be receiving the pings and answering them, but the answers are getting lost somewhere along the route back. So the remote node does not yet know it is incommunicado. Then heartbeat gives up and stops pinging. It is only at this point that the remote node is sure to start its watchdog count. Given: A = number of missed answers before heartbeat stops pinging B = number of missed pings before watchdog triggers H = heartbeat period L = maximum network latency within some confidence factor W = maximum latency between watchdog trigger and shutdown the time to declare a node dead is: P(A + B) + 2L so with: A = 2 B = 2 H = 2 seconds L = .5 seconds W = 10 seconds we have: 8 + 1 + 10 = 19 seconds Network latency includes the maximum time to notice a ping and respond to it, and the time required for heartbeat to notice the answer. There is no need to incorporate a safety factor because allowing more than one missed ping is already a safety factor. Did I miss anything in my bookkeeping? I did not check to see if OCFS2's heartbeat obeys this formula. Unfortunately, it is difficult to establish dependable bounds for network latency, so heartbeating is really a game of probabilities. We should set the safety factor high enough so that false positives do not cost more downtime than would be saved by shorter timeouts. Now, if we use a storage-side fencing method instead of a watchdog we can set B and W to zero, giving 5 seconds using the example above. This is three times better and shows why we need a proper fencing harness sooner rather than later. Regards, Daniel
Daniel Phillips wrote:> so with: > > A = 2 > B = 2 > H = 2 seconds > L = .5 seconds > W = 10 seconds > > we have: > > 8 + 1 + 10 = 19 secondsOops, sorry, I should not have set W (maximum latency between watchdog trigger and shutdown) as high as 10, since the panic shuts down interrupts much faster than that, which in theory stops any disk or network hardware from transmitting. This should just take a few ms, however there may be write traffic in flight, so we need to set W to a second or two. This gets our "safe" watchdog wait down to 10 seconds or so, which is still twice as bad as the 5 seconds we get with storage side fencing. Regards, Daniel