thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability [Sep 2017]

If this information is useful, please help other people find it:
Share via:

Changwei Ge

2017-Sep-13 07:03 UTC

[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

Hi,

I think the mentioned duplicated AST issue doesn't even exist.
Because the re-sended AST won't find any lock on converting list or 
blocked list.
How AST callback can be called twice?

Thanks,
Changwei
> 
> On 2017/8/23 12:48, Gang He wrote:
>>
>>
>>> On 17/8/23 10:23, Junxiao Bi wrote:
>>>> On 08/10/2017 06:49 PM, Changwei Ge wrote:
>>>>> Hi Joseph,
>>>>>
>>>>>
>>>>> On 2017/8/10 17:53, Joseph Qi wrote:
>>>>>> Hi Changwei,
>>>>>>
>>>>>> On 17/8/9 23:24, ge changwei wrote:
>>>>>>> Hi
>>>>>>>
>>>>>>>
>>>>>>> On 2017/8/9 ??7:32, Joseph Qi wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 17/8/7 15:13, Changwei Ge wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> In current code, while flushing AST, we
don't handle an
>>>>>>>>> exception that sending AST or BAST is
failed.
>>>>>>>>> But it is indeed possible that AST or BAST
is lost due to some
>>>>>>>>> kind of networks fault.
>>>>>>>>>
>>>>>>>> Could you please describe this issue more
clearly? It is better
>>>>>>>> analyze issue along with the error message and
the status of related nodes.
>>>>>>>> IMO, if network is down, one of the two nodes
will be fenced. So
>>>>>>>> what's your case here?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Joseph
>>>>>>> I have posted the status of related lock resource
in my preceding email.
>>>>>>> Please check them out.
>>>>>>>
>>>>>>> Moreover, network is not down forever even not
longer than
>>>>>>> threshold  to be fenced.
>>>>>>> So no node will be fenced.
>>>>>>>
>>>>>>> This issue happens in terrible network environment.
Some messages
>>>>>>> may be abandoned by switch due to various
conditions.
>>>>>>> And even frequent and fast link up and down will
also cause this issue.
>>>>>>>
>>>>>>> In a nutshell,  re-queuing AST and BAST is crucial
when link
>>>>>>> between nodes recover quickly. It prevents cluster
from hanging.
>>>>>>> So you mean the tcp packet is lost due to
connection reset? IIRC,
>>>>> Yes, it's something like that exception which I think
is deserved
>>>>> to be fixed within OCFS2.
>>>>>> Junxiao has posted a patchset to fix this issue.
>>>>>> If you are using the way of re-queuing, how to make
sure the
>>>>>> original message is *truly* lost and the same ast/bast
won't be sent twice?
>>>>> With regards to TCP layer, if it returns error to OCFS2,
packets
>>>>> must not be sent successfully. So no node will obtain such
an AST or BAST.
>>>> Right, but not only AST/BAST, other messages pending in tcp
queue
>>>> will also lost if tcp return error to ocfs2, this can also
caused hung.
>>>> Besides, your fix may introduce duplicated ast/bast message
Joseph
>>>> mentioned.
>>>> Ocfs2 depends tcp a lot, it can't work well if tcp return
error to it.
>>>> To fix it, maybe ocfs2 should maintain its own message queue
and ack
>>>> messages while not depend on TCP.>
>>> Agree. Or we can add a sequence to distinguish duplicate message.
>>> Under this, we can simply resend message if fails.
>> Look likes, we need to make the message stateless.
>> Maybe, we can refer to GFS2, to see if GFS2 has considered this issue.
>>
>> Thanks
>> Gang
> Um.
> Since Joseph, Junxiao and Gang all have a different or opposite opinion on
this hang issue fix, I will perform more tests to check if the previously
mentioned duplicated ast issue truly exists. And if it does exist, I will try to
figure out a new way to fix it and send a improved version of this patch.
> 
> I will report the test results few days later. Anyway, thanks for your
comments.
> 
> Thank,
> Changwei.
>>> Thanks,
>>> Joseph
>>>   
>>>> Thanks,
>>>> Junxiao.
>>> _______________________________________________
>>> Ocfs2-devel mailing list
>>> Ocfs2-devel at oss.oracle.com
>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

Ocfs2 devel - Sep 2017 - [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability

[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability