Changwei Ge
2017-Sep-13 07:03 UTC
[Ocfs2-devel] [PATCH] ocfs2: re-queue AST or BAST if sending is failed to improve the reliability
Hi, I think the mentioned duplicated AST issue doesn't even exist. Because the re-sended AST won't find any lock on converting list or blocked list. How AST callback can be called twice? Thanks, Changwei> > On 2017/8/23 12:48, Gang He wrote: >> >> >>> On 17/8/23 10:23, Junxiao Bi wrote: >>>> On 08/10/2017 06:49 PM, Changwei Ge wrote: >>>>> Hi Joseph, >>>>> >>>>> >>>>> On 2017/8/10 17:53, Joseph Qi wrote: >>>>>> Hi Changwei, >>>>>> >>>>>> On 17/8/9 23:24, ge changwei wrote: >>>>>>> Hi >>>>>>> >>>>>>> >>>>>>> On 2017/8/9 ??7:32, Joseph Qi wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> On 17/8/7 15:13, Changwei Ge wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> In current code, while flushing AST, we don't handle an >>>>>>>>> exception that sending AST or BAST is failed. >>>>>>>>> But it is indeed possible that AST or BAST is lost due to some >>>>>>>>> kind of networks fault. >>>>>>>>> >>>>>>>> Could you please describe this issue more clearly? It is better >>>>>>>> analyze issue along with the error message and the status of related nodes. >>>>>>>> IMO, if network is down, one of the two nodes will be fenced. So >>>>>>>> what's your case here? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Joseph >>>>>>> I have posted the status of related lock resource in my preceding email. >>>>>>> Please check them out. >>>>>>> >>>>>>> Moreover, network is not down forever even not longer than >>>>>>> threshold to be fenced. >>>>>>> So no node will be fenced. >>>>>>> >>>>>>> This issue happens in terrible network environment. Some messages >>>>>>> may be abandoned by switch due to various conditions. >>>>>>> And even frequent and fast link up and down will also cause this issue. >>>>>>> >>>>>>> In a nutshell, re-queuing AST and BAST is crucial when link >>>>>>> between nodes recover quickly. It prevents cluster from hanging. >>>>>>> So you mean the tcp packet is lost due to connection reset? IIRC, >>>>> Yes, it's something like that exception which I think is deserved >>>>> to be fixed within OCFS2. >>>>>> Junxiao has posted a patchset to fix this issue. >>>>>> If you are using the way of re-queuing, how to make sure the >>>>>> original message is *truly* lost and the same ast/bast won't be sent twice? >>>>> With regards to TCP layer, if it returns error to OCFS2, packets >>>>> must not be sent successfully. So no node will obtain such an AST or BAST. >>>> Right, but not only AST/BAST, other messages pending in tcp queue >>>> will also lost if tcp return error to ocfs2, this can also caused hung. >>>> Besides, your fix may introduce duplicated ast/bast message Joseph >>>> mentioned. >>>> Ocfs2 depends tcp a lot, it can't work well if tcp return error to it. >>>> To fix it, maybe ocfs2 should maintain its own message queue and ack >>>> messages while not depend on TCP.> >>> Agree. Or we can add a sequence to distinguish duplicate message. >>> Under this, we can simply resend message if fails. >> Look likes, we need to make the message stateless. >> Maybe, we can refer to GFS2, to see if GFS2 has considered this issue. >> >> Thanks >> Gang > Um. > Since Joseph, Junxiao and Gang all have a different or opposite opinion on this hang issue fix, I will perform more tests to check if the previously mentioned duplicated ast issue truly exists. And if it does exist, I will try to figure out a new way to fix it and send a improved version of this patch. > > I will report the test results few days later. Anyway, thanks for your comments. > > Thank, > Changwei. >>> Thanks, >>> Joseph >>> >>>> Thanks, >>>> Junxiao. >>> _______________________________________________ >>> Ocfs2-devel mailing list >>> Ocfs2-devel at oss.oracle.com >>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel >