thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability [Aug 2017]

If this information is useful, please help other people find it:
Share via:

Junxiao Bi

2017-Aug-23 02:23 UTC

[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

On 08/10/2017 06:49 PM, Changwei Ge wrote:> Hi Joseph,
> 
> 
> On 2017/8/10 17:53, Joseph Qi wrote:
>> Hi Changwei,
>>
>> On 17/8/9 23:24, ge changwei wrote:
>>> Hi
>>>
>>>
>>> On 2017/8/9 ??7:32, Joseph Qi wrote:
>>>> Hi,
>>>>
>>>> On 17/8/7 15:13, Changwei Ge wrote:
>>>>> Hi,
>>>>>
>>>>> In current code, while flushing AST, we don't handle an
exception that
>>>>> sending AST or BAST is failed.
>>>>> But it is indeed possible that AST or BAST is lost due to
some kind of
>>>>> networks fault.
>>>>>
>>>> Could you please describe this issue more clearly? It is better
analyze
>>>> issue along with the error message and the status of related
nodes.
>>>> IMO, if network is down, one of the two nodes will be fenced.
So what's
>>>> your case here?
>>>>
>>>> Thanks,
>>>> Joseph
>>> I have posted the status of related lock resource in my preceding
email.
>>> Please check them out.
>>>
>>> Moreover, network is not down forever even not longer than
threshold  to
>>> be fenced.
>>> So no node will be fenced.
>>>
>>> This issue happens in terrible network environment. Some messages
may be
>>> abandoned by switch due to various conditions.
>>> And even frequent and fast link up and down will also cause this
issue.
>>>
>>> In a nutshell,  re-queuing AST and BAST is crucial when link
between
>>> nodes recover quickly. It prevents cluster from hanging.
>>> So you mean the tcp packet is lost due to connection reset? IIRC,
> Yes, it's something like that exception which I think is deserved to be
> fixed within OCFS2.
>> Junxiao has posted a patchset to fix this issue.
>> If you are using the way of re-queuing, how to make sure the original
>> message is *truly* lost and the same ast/bast won't be sent twice?
> With regards to TCP layer, if it returns error to OCFS2, packets must
> not be sent successfully. So no node will obtain such an AST or BAST.Right, but not only AST/BAST, other messages pending in tcp queue will
also lost if tcp return error to ocfs2, this can also caused hung.
Besides, your fix may introduce duplicated ast/bast message Joseph
mentioned.
Ocfs2 depends tcp a lot, it can't work well if tcp return error to it.
To fix it, maybe ocfs2 should maintain its own message queue and ack
messages while not depend on TCP.

Thanks,
Junxiao.

> With regards to OCFS2, my patch can guarantee that one AST/BAST can't
be
> queued on pending list twice of which are both sent successfully.
> 
> Thanks,
> Changwei
>>
>> Thanks,
>> Joseph
>>  
>>> Thanks,
>>> Changwei
>>>>> If above exception happens, the requesting node will never
obtain an AST
>>>>> back, hence, it will never acquire the lock or abort
current locking.
>>>>>
>>>>> With this patch, I'd like to fix this issue by
re-queuing the AST or
>>>>> BAST if sending is failed due to networks fault.
>>>>>
>>>>> And the re-queuing AST or BAST will be dropped if the
requesting node is
>>>>> dead!
>>>>>
>>>>> It will improve the reliability a lot.
>>>>>
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Changwei.
>>>> _______________________________________________
>>>> Ocfs2-devel mailing list
>>>> Ocfs2-devel at oss.oracle.com
>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

Joseph Qi

2017-Aug-23 03:34 UTC

head link

[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

On 17/8/23 10:23, Junxiao Bi wrote:> On 08/10/2017 06:49 PM, Changwei Ge wrote:
>> Hi Joseph,
>>
>>
>> On 2017/8/10 17:53, Joseph Qi wrote:
>>> Hi Changwei,
>>>
>>> On 17/8/9 23:24, ge changwei wrote:
>>>> Hi
>>>>
>>>>
>>>> On 2017/8/9 ??7:32, Joseph Qi wrote:
>>>>> Hi,
>>>>>
>>>>> On 17/8/7 15:13, Changwei Ge wrote:
>>>>>> Hi,
>>>>>>
>>>>>> In current code, while flushing AST, we don't
handle an exception that
>>>>>> sending AST or BAST is failed.
>>>>>> But it is indeed possible that AST or BAST is lost due
to some kind of
>>>>>> networks fault.
>>>>>>
>>>>> Could you please describe this issue more clearly? It is
better analyze
>>>>> issue along with the error message and the status of
related nodes.
>>>>> IMO, if network is down, one of the two nodes will be
fenced. So what's
>>>>> your case here?
>>>>>
>>>>> Thanks,
>>>>> Joseph
>>>> I have posted the status of related lock resource in my
preceding email.
>>>> Please check them out.
>>>>
>>>> Moreover, network is not down forever even not longer than
threshold  to
>>>> be fenced.
>>>> So no node will be fenced.
>>>>
>>>> This issue happens in terrible network environment. Some
messages may be
>>>> abandoned by switch due to various conditions.
>>>> And even frequent and fast link up and down will also cause
this issue.
>>>>
>>>> In a nutshell,  re-queuing AST and BAST is crucial when link
between
>>>> nodes recover quickly. It prevents cluster from hanging.
>>>> So you mean the tcp packet is lost due to connection reset?
IIRC,
>> Yes, it's something like that exception which I think is deserved
to be
>> fixed within OCFS2.
>>> Junxiao has posted a patchset to fix this issue.
>>> If you are using the way of re-queuing, how to make sure the
original
>>> message is *truly* lost and the same ast/bast won't be sent
twice?
>> With regards to TCP layer, if it returns error to OCFS2, packets must
>> not be sent successfully. So no node will obtain such an AST or BAST.
> Right, but not only AST/BAST, other messages pending in tcp queue will
> also lost if tcp return error to ocfs2, this can also caused hung.
> Besides, your fix may introduce duplicated ast/bast message Joseph
> mentioned.
> Ocfs2 depends tcp a lot, it can't work well if tcp return error to it.
> To fix it, maybe ocfs2 should maintain its own message queue and ack
> messages while not depend on TCP.>Agree. Or we can add a sequence to distinguish duplicate message. Under
this, we can simply resend message if fails.

Thanks,
Joseph
 > Thanks,
> Junxiao.

Ocfs2 devel - Aug 2017 - [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability