Zhangyang
2017-Sep-21 02:09 UTC
[Ocfs2-devel] [patch] ocfs2: fix qs_holds may could not be zero
Hi all, In our test, We fond that , when the network down, qs->qs_holds could not be reduce to zero, it will lead to the node can't do fence. o2net_idle_timer -> o2quo_conn_err -> qs->qs_holds++, after O2NET_QUORUM_DELAY_MS if qs_holds could be subtract to zero, it could do make_decision. But if there are many nodes, when one node network down which contains o2net connections may not do o2net_idle_timer at the same time. So when a o2net_node have done nn->nn_still_up, but the qs_holds is not zero. because the other o2net_node have not done nn->nn_still_up. So the first o2net_node will do o2net_idle_timer again, and the qs_holds could be add again. And the qs_holds is global variable, so it formed a loop, the node could not do o2quo_make_decision, because of qs_holds never be zero. I alter the function o2quo_conn_err, take o2quo_set_hold under control of the bit map qs_conn_bm. Signed-off-by: Yang Zhang <zhang.yangB at h3c.com> --- fs/ocfs2/cluster/quorum.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/ocfs2/cluster/quorum.c b/fs/ocfs2/cluster/quorum.c index 3f337e5..0fe531e 100644 --- a/fs/ocfs2/cluster/quorum.c +++ b/fs/ocfs2/cluster/quorum.c @@ -423,13 +423,15 @@ void o2quo_conn_err(u8 node) node, qs->qs_connected); clear_bit(node, qs->qs_conn_bm); + /*bring set hold within this judgement, in order to avoid qs_hold + * could not be zero. + */ + if (test_bit(node, qs->qs_hb_bm)) + o2quo_set_hold(qs, node); } mlog(0, "node %u, %d total\n", node, qs->qs_connected); - if (test_bit(node, qs->qs_hb_bm)) - o2quo_set_hold(qs, node); - spin_unlock(&qs->qs_lock); } -------------- next part -------------- An HTML attachment was scrubbed... URL: http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20170921/86286525/attachment.html
Andrew Morton
2017-Oct-17 23:20 UTC
[Ocfs2-devel] [patch] ocfs2: fix qs_holds may could not be zero
On Thu, 21 Sep 2017 02:09:33 +0000 Zhangyang <zhang.yangB at h3c.com> wrote:> In our test, We fond that , when the network down, qs->qs_holds could not be reduce to zero, it will lead to the node can't do fence. > > > > o2net_idle_timer -> o2quo_conn_err -> qs->qs_holds++, after O2NET_QUORUM_DELAY_MS if qs_holds could be subtract to zero, it could do make_decision. > > But if there are many nodes, when one node network down which contains o2net connections may not do o2net_idle_timer at the same time. > > So when a o2net_node have done nn->nn_still_up, but the qs_holds is not zero. because the other o2net_node have not done nn->nn_still_up. > > So the first o2net_node will do o2net_idle_timer again, and the qs_holds could be add again. And the qs_holds is global variable, so it formed a loop, the node could not do o2quo_make_decision, because of qs_holds never be zero. > > > > I alter the function o2quo_conn_err, take o2quo_set_hold under control of the bit map qs_conn_bm.I merged this, subject to review by the ocfs2 maintainers. The changelog and the comment are really hard to understand. Perhaps one of the ocfs2 developers could suggest some more clear words to use? Thanks.