akpm at linux-foundation.org
2017-Nov-30 22:24 UTC
[Ocfs2-devel] [patch 07/11] ocfs2: fix qs_holds may could not be zero
From: Zhangyang <zhang.yangB at h3c.com> Subject: ocfs2: fix qs_holds may could not be zero In our test, We fond that when the network down, qs->qs_holds could not be reduce to zero, it will lead to the node can't do fence. o2net_idle_timer -> o2quo_conn_err -> qs->qs_holds++, after O2NET_QUORUM_DE= LAY_MS if qs_holds could be subtract to zero, it could do make_decision. But if there are many nodes, when one node network down which contains o2net connections may not do o2net_idle_timer at the same time. So when a o2net_node have done nn->nn_still_up, but the qs_holds is not zero. because the other o2net_node have not done nn->nn_still_up. So the first o2net_node will do o2net_idle_timer again, and the qs_holds could be add again. And the qs_holds is global variable, so it formed a loop, the node could not do o2quo_make_decision, because of qs_holds never be zero. I alter the function o2quo_conn_err, take o2quo_set_hold under control of t= he bit map qs_conn_bm. Link: https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_7F50894FD17BEC45AAC26E5BADA6CE330C60F99A-40H3CMLB12-2DEX.srv.huawei-2D3com.com&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=C7gAd4uDxlAvTdc0vmU6X8CMk6L2iDY8-HD0qT6Fo7Y&m=CYujo6g1PiMEWNoljfzfkpq8GWBXbNNSftl3t-szE9s&s=9JBgEUTtHISAW_NA8cG1Vg9v_7vTHRok4N9hiTmUSHM&eSigned-off-by: Yang Zhang <zhang.yangB at h3c.com> Cc: Mark Fasheh <mfasheh at versity.com> Cc: Joel Becker <jlbec at evilplan.org> Cc: Junxiao Bi <junxiao.bi at oracle.com> Cc: Joseph Qi <jiangqi903 at gmail.com> Signed-off-by: Andrew Morton <akpm at linux-foundation.org> --- fs/ocfs2/cluster/quorum.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff -puN fs/ocfs2/cluster/quorum.c~ocfs2-fix-qs_holds-may-could-not-be-zero fs/ocfs2/cluster/quorum.c --- a/fs/ocfs2/cluster/quorum.c~ocfs2-fix-qs_holds-may-could-not-be-zero +++ a/fs/ocfs2/cluster/quorum.c @@ -314,13 +314,16 @@ void o2quo_conn_err(u8 node) node, qs->qs_connected); clear_bit(node, qs->qs_conn_bm); + /* + * Bring set hold within this judgement, in order to avoid + * qs_hold could not be zero. + */ + if (test_bit(node, qs->qs_hb_bm)) + o2quo_set_hold(qs, node); } mlog(0, "node %u, %d total\n", node, qs->qs_connected); - if (test_bit(node, qs->qs_hb_bm)) - o2quo_set_hold(qs, node); - spin_unlock(&qs->qs_lock); } _
Changwei Ge
2017-Dec-01 01:40 UTC
[Ocfs2-devel] [PATCH resend] renew changelog and title Re: [patch 07/11] ocfs2: fix qs_holds may could not be zero
Hi Andrew, I helped Yang clean up his changelog and re-send his patch with a reworked title. The Author is still Yang Zhang. Thanks, Changwei Subject: [PATCH] ocfs2/cluster: close a race that fence can't be triggered When some nodes of cluster face with TCP connection fault, ocfs2 will pick up a quorum to continue to work and other nodes will be fenced by resetting host. In order to decide which node should be fenced, ocfs2 leverages o2quo_state::qs_holds. If that variable is reduced to zero, then a try to decide if fence local node is performed. However, under a specific scenario that local node is not disconnected from others at the same time, above method has a problem to reduce ::qs_holds to zero. Because, o2net 90s idle timer corresponding to different nodes is triggered one after another. node 2 node 3 90s idle timer elapses clear ::qs_conn_bm set hold 40s is passed 90 idle timer elapses clear ::qs_conn_bm set hold still up timer elapses clear hold (NOT to zero ) 90s idle timer elapses AGAIN still up timer elapses. clear hold still up timer elapses To solve this issue, a node which has already be evicted from ::qs_conn_bm can't set hold again and again invoked from idle timer. Signed-off-by: Yang Zhang <zhang.yangB at h3c.com> Signed-off-by: Changwei Ge <ge.changwei at h3c.com> --- fs/ocfs2/cluster/quorum.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/cluster/quorum.c b/fs/ocfs2/cluster/quorum.c index 62e8ec619b4c..af2e7473956e 100644 --- a/fs/ocfs2/cluster/quorum.c +++ b/fs/ocfs2/cluster/quorum.c @@ -314,12 +314,13 @@ void o2quo_conn_err(u8 node) node, qs->qs_connected); clear_bit(node, qs->qs_conn_bm); + + if (test_bit(node, qs->qs_hb_bm)) + o2quo_set_hold(qs, node); } mlog(0, "node %u, %d total\n", node, qs->qs_connected); - if (test_bit(node, qs->qs_hb_bm)) - o2quo_set_hold(qs, node); spin_unlock(&qs->qs_lock); } -- 2.11.0