thr3ads.net - Ocfs2 devel - [Ocfs2-devel] ocfs2: o2hb: not fence self if storage down [Jan 2016]

If this information is useful, please help other people find it:
Share via:

Joseph Qi

2016-Jan-20 09:18 UTC

[Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

Hi Junxiao,
Thanks for the patch set.
In case only one node storage link down, if this node doesn't fence
self, other nodes will still check and mark this node dead, which will
cause cluster membership inconsistency.
In your patch set, I cannot see any logic to handle this. Am I missing
something?

On 2016/1/20 11:13, Junxiao Bi wrote:> Hi,
> 
> This serial of patches is to fix the issue that when storage down,
> all nodes will fence self due to write timeout.
> With this patch set, all nodes will keep going until storage back
> online, except if the following issue happens, then all nodes will
> do as before to fence self.
> 1. io error got
> 2. network between nodes down
> 3. nodes panic
> 
> Junxiao Bi (6):
>       ocfs2: o2hb: add negotiate timer
>       ocfs2: o2hb: add NEGO_TIMEOUT message
>       ocfs2: o2hb: add NEGOTIATE_APPROVE message
>       ocfs2: o2hb: add some user/debug log
>       ocfs2: o2hb: don't negotiate if last hb fail
>       ocfs2: o2hb: fix hb hung time
> 
>  fs/ocfs2/cluster/heartbeat.c |  181
++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 175 insertions(+), 6 deletions(-)
> 
>  Thanks,
>  Junxiao.
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
>

Junxiao Bi

2016-Jan-20 13:27 UTC

head link

[Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

Hi Joseph,
> ? 2016?1?20????5:18?Joseph Qi <joseph.qi at huawei.com> ???
> 
> Hi Junxiao,
> Thanks for the patch set.
> In case only one node storage link down, if this node doesn't fence
> self, other nodes will still check and mark this node dead, which will
> cause cluster membership inconsistency.
> In your patch set, I cannot see any logic to handle this. Am I missing
> something?No, there is no logic for this. But why didn?t node fence self when storage
down? What make a softirq timer can?t be run, another bug?

Thanks,
Junxiao.> 
> On 2016/1/20 11:13, Junxiao Bi wrote:
>> Hi,
>> 
>> This serial of patches is to fix the issue that when storage down,
>> all nodes will fence self due to write timeout.
>> With this patch set, all nodes will keep going until storage back
>> online, except if the following issue happens, then all nodes will
>> do as before to fence self.
>> 1. io error got
>> 2. network between nodes down
>> 3. nodes panic
>> 
>> Junxiao Bi (6):
>>      ocfs2: o2hb: add negotiate timer
>>      ocfs2: o2hb: add NEGO_TIMEOUT message
>>      ocfs2: o2hb: add NEGOTIATE_APPROVE message
>>      ocfs2: o2hb: add some user/debug log
>>      ocfs2: o2hb: don't negotiate if last hb fail
>>      ocfs2: o2hb: fix hb hung time
>> 
>> fs/ocfs2/cluster/heartbeat.c |  181
++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 175 insertions(+), 6 deletions(-)
>> 
>> Thanks,
>> Junxiao.
>> 
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>> 
>> 
> 
>

Ocfs2 devel - Jan 2016 - ocfs2: o2hb: not fence self if storage down

[Ocfs2-devel] ocfs2: o2hb: not fence self if storage down

[Ocfs2-devel] ocfs2: o2hb: not fence self if storage down