Junxiao Bi
2016-Jan-21 01:48 UTC
[Ocfs2-devel] ocfs2: o2hb: not fence self if storage down
On 01/21/2016 08:46 AM, Joseph Qi wrote:> Hi Junxiao, > So you mean the negotiation you added only happens if all nodes storage > link down?Negotiation happened when one node found its storage link down, but success when all nodes storage link down, or it will keep the same behavior like before. Thanks, Junxiao.> > Thanks, > Joseph > > On 2016/1/20 21:27, Junxiao Bi wrote: >> Hi Joseph, >> >>> ? 2016?1?20????5:18?Joseph Qi <joseph.qi at huawei.com> ??? >>> >>> Hi Junxiao, >>> Thanks for the patch set. >>> In case only one node storage link down, if this node doesn't fence >>> self, other nodes will still check and mark this node dead, which will >>> cause cluster membership inconsistency. >>> In your patch set, I cannot see any logic to handle this. Am I missing >>> something? >> No, there is no logic for this. But why didn?t node fence self when storage down? What make a softirq timer can?t be run, another bug? >> >> Thanks, >> Junxiao. >>> >>> On 2016/1/20 11:13, Junxiao Bi wrote: >>>> Hi, >>>> >>>> This serial of patches is to fix the issue that when storage down, >>>> all nodes will fence self due to write timeout. >>>> With this patch set, all nodes will keep going until storage back >>>> online, except if the following issue happens, then all nodes will >>>> do as before to fence self. >>>> 1. io error got >>>> 2. network between nodes down >>>> 3. nodes panic >>>> >>>> Junxiao Bi (6): >>>> ocfs2: o2hb: add negotiate timer >>>> ocfs2: o2hb: add NEGO_TIMEOUT message >>>> ocfs2: o2hb: add NEGOTIATE_APPROVE message >>>> ocfs2: o2hb: add some user/debug log >>>> ocfs2: o2hb: don't negotiate if last hb fail >>>> ocfs2: o2hb: fix hb hung time >>>> >>>> fs/ocfs2/cluster/heartbeat.c | 181 ++++++++++++++++++++++++++++++++++++++++-- >>>> 1 file changed, 175 insertions(+), 6 deletions(-) >>>> >>>> Thanks, >>>> Junxiao. >>>> >>>> _______________________________________________ >>>> Ocfs2-devel mailing list >>>> Ocfs2-devel at oss.oracle.com >>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >>>> >>>> >>> >>> >> >> >> . >> > >
Hi Junxiao, On 2016/1/21 9:48, Junxiao Bi wrote:> On 01/21/2016 08:46 AM, Joseph Qi wrote: >> Hi Junxiao, >> So you mean the negotiation you added only happens if all nodes storage >> link down? > Negotiation happened when one node found its storage link down, but > success when all nodes storage link down, or it will keep the same > behavior like before.IC, thanks for your explanation. IMHO, if storage down, all business deployed on the storage will be impacted even nodes won't fence. I have another scenario, only several paths (multipath environment) in several nodes have problems, as a result, ocfs2 will fence these nodes. So I wonder if we have a better way to resolve this issue. Thanks, Joseph> > Thanks, > Junxiao. >> >> Thanks, >> Joseph >> >> On 2016/1/20 21:27, Junxiao Bi wrote: >>> Hi Joseph, >>> >>>> ? 2016?1?20????5:18?Joseph Qi <joseph.qi at huawei.com> ??? >>>> >>>> Hi Junxiao, >>>> Thanks for the patch set. >>>> In case only one node storage link down, if this node doesn't fence >>>> self, other nodes will still check and mark this node dead, which will >>>> cause cluster membership inconsistency. >>>> In your patch set, I cannot see any logic to handle this. Am I missing >>>> something? >>> No, there is no logic for this. But why didn?t node fence self when storage down? What make a softirq timer can?t be run, another bug? >>> >>> Thanks, >>> Junxiao. >>>> >>>> On 2016/1/20 11:13, Junxiao Bi wrote: >>>>> Hi, >>>>> >>>>> This serial of patches is to fix the issue that when storage down, >>>>> all nodes will fence self due to write timeout. >>>>> With this patch set, all nodes will keep going until storage back >>>>> online, except if the following issue happens, then all nodes will >>>>> do as before to fence self. >>>>> 1. io error got >>>>> 2. network between nodes down >>>>> 3. nodes panic >>>>> >>>>> Junxiao Bi (6): >>>>> ocfs2: o2hb: add negotiate timer >>>>> ocfs2: o2hb: add NEGO_TIMEOUT message >>>>> ocfs2: o2hb: add NEGOTIATE_APPROVE message >>>>> ocfs2: o2hb: add some user/debug log >>>>> ocfs2: o2hb: don't negotiate if last hb fail >>>>> ocfs2: o2hb: fix hb hung time >>>>> >>>>> fs/ocfs2/cluster/heartbeat.c | 181 ++++++++++++++++++++++++++++++++++++++++-- >>>>> 1 file changed, 175 insertions(+), 6 deletions(-) >>>>> >>>>> Thanks, >>>>> Junxiao. >>>>> >>>>> _______________________________________________ >>>>> Ocfs2-devel mailing list >>>>> Ocfs2-devel at oss.oracle.com >>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >>>>> >>>>> >>>> >>>> >>> >>> >>> . >>> >> >> > > > . >