Junxiao Bi
2016-Jan-20 13:27 UTC
[Ocfs2-devel] ocfs2: o2hb: not fence self if storage down
Hi Joseph,> ? 2016?1?20????5:18?Joseph Qi <joseph.qi at huawei.com> ??? > > Hi Junxiao, > Thanks for the patch set. > In case only one node storage link down, if this node doesn't fence > self, other nodes will still check and mark this node dead, which will > cause cluster membership inconsistency. > In your patch set, I cannot see any logic to handle this. Am I missing > something?No, there is no logic for this. But why didn?t node fence self when storage down? What make a softirq timer can?t be run, another bug? Thanks, Junxiao.> > On 2016/1/20 11:13, Junxiao Bi wrote: >> Hi, >> >> This serial of patches is to fix the issue that when storage down, >> all nodes will fence self due to write timeout. >> With this patch set, all nodes will keep going until storage back >> online, except if the following issue happens, then all nodes will >> do as before to fence self. >> 1. io error got >> 2. network between nodes down >> 3. nodes panic >> >> Junxiao Bi (6): >> ocfs2: o2hb: add negotiate timer >> ocfs2: o2hb: add NEGO_TIMEOUT message >> ocfs2: o2hb: add NEGOTIATE_APPROVE message >> ocfs2: o2hb: add some user/debug log >> ocfs2: o2hb: don't negotiate if last hb fail >> ocfs2: o2hb: fix hb hung time >> >> fs/ocfs2/cluster/heartbeat.c | 181 ++++++++++++++++++++++++++++++++++++++++-- >> 1 file changed, 175 insertions(+), 6 deletions(-) >> >> Thanks, >> Junxiao. >> >> _______________________________________________ >> Ocfs2-devel mailing list >> Ocfs2-devel at oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >> >> > >
Hi Junxiao, So you mean the negotiation you added only happens if all nodes storage link down? Thanks, Joseph On 2016/1/20 21:27, Junxiao Bi wrote:> Hi Joseph, > >> ? 2016?1?20????5:18?Joseph Qi <joseph.qi at huawei.com> ??? >> >> Hi Junxiao, >> Thanks for the patch set. >> In case only one node storage link down, if this node doesn't fence >> self, other nodes will still check and mark this node dead, which will >> cause cluster membership inconsistency. >> In your patch set, I cannot see any logic to handle this. Am I missing >> something? > No, there is no logic for this. But why didn?t node fence self when storage down? What make a softirq timer can?t be run, another bug? > > Thanks, > Junxiao. >> >> On 2016/1/20 11:13, Junxiao Bi wrote: >>> Hi, >>> >>> This serial of patches is to fix the issue that when storage down, >>> all nodes will fence self due to write timeout. >>> With this patch set, all nodes will keep going until storage back >>> online, except if the following issue happens, then all nodes will >>> do as before to fence self. >>> 1. io error got >>> 2. network between nodes down >>> 3. nodes panic >>> >>> Junxiao Bi (6): >>> ocfs2: o2hb: add negotiate timer >>> ocfs2: o2hb: add NEGO_TIMEOUT message >>> ocfs2: o2hb: add NEGOTIATE_APPROVE message >>> ocfs2: o2hb: add some user/debug log >>> ocfs2: o2hb: don't negotiate if last hb fail >>> ocfs2: o2hb: fix hb hung time >>> >>> fs/ocfs2/cluster/heartbeat.c | 181 ++++++++++++++++++++++++++++++++++++++++-- >>> 1 file changed, 175 insertions(+), 6 deletions(-) >>> >>> Thanks, >>> Junxiao. >>> >>> _______________________________________________ >>> Ocfs2-devel mailing list >>> Ocfs2-devel at oss.oracle.com >>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >>> >>> >> >> > > > . >