Jiaju Zhang
2011-Sep-01 15:28 UTC
[Ocfs2-devel] [PATCH] ocfs2: handle ocfs2 node down event more correctly
In the scenario that ocfs2 is used with in-kernel fs/dlm and user-space cluster stack, osb->node_num == node_num in ocfs2_do_node_down doesn't mean it is a bug any more. This is because ocfs2_controld might receive the node down information first, in the normal case, dlm_controld should receive that node down information soon then osb->node_num != node_num. But a rare case is before dlm_controld receive the node down information, that node is up again and dlm_controld won't receive node down any more, which results in osb->node_num == node_num here, this case can happen and it should not be a bug. Just return here and won't trigger the recovery thread should be the right way to go. Also, it won't introduce other side effect when using o2cb stack. Signed-off-by: Jiaju Zhang <jjzhang at suse.de> --- fs/ocfs2/heartbeat.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/heartbeat.c b/fs/ocfs2/heartbeat.c index d8208b2..632e855 100644 --- a/fs/ocfs2/heartbeat.c +++ b/fs/ocfs2/heartbeat.c @@ -64,10 +64,11 @@ void ocfs2_do_node_down(int node_num, void *data) { struct ocfs2_super *osb = data; - BUG_ON(osb->node_num == node_num); - trace_ocfs2_do_node_down(node_num); + if (osb->node_num == node_num) + return; + if (!osb->cconn) { /* * No cluster connection means we're not even ready to
Jiaju Zhang
2011-Sep-02 08:57 UTC
[Ocfs2-devel] [PATCH] ocfs2: handle ocfs2 node down event more correctly
Just found out this patch may not be correct since it also need some change in user-space, I'll look into the issue more closely to see if it can be resolved in user-space totally. So please ignore this patch, sorry for the noise;) Thanks, Jiaju On Thu, Sep 1, 2011 at 11:28 PM, Jiaju Zhang <jjzhang.linux at gmail.com> wrote:> In the scenario that ocfs2 is used with in-kernel fs/dlm and user-space > cluster stack, osb->node_num == node_num in ocfs2_do_node_down doesn't > mean it is a bug any more. This is because ocfs2_controld might receive > the node down information first, in the normal case, dlm_controld should > receive that node down information soon then osb->node_num != node_num. > But a rare case is before dlm_controld receive the node down information, > that node is up again and dlm_controld won't receive node down any more, > which results in osb->node_num == node_num here, this case can happen and > it should not be a bug. Just return here and won't trigger the recovery > thread should be the right way to go. Also, it won't introduce other side > effect when using o2cb stack. > > Signed-off-by: Jiaju Zhang <jjzhang at suse.de> > --- > ?fs/ocfs2/heartbeat.c | ? ?5 +++-- > ?1 files changed, 3 insertions(+), 2 deletions(-) > > diff --git a/fs/ocfs2/heartbeat.c b/fs/ocfs2/heartbeat.c > index d8208b2..632e855 100644 > --- a/fs/ocfs2/heartbeat.c > +++ b/fs/ocfs2/heartbeat.c > @@ -64,10 +64,11 @@ void ocfs2_do_node_down(int node_num, void *data) > ?{ > ? ? ? ?struct ocfs2_super *osb = data; > > - ? ? ? BUG_ON(osb->node_num == node_num); > - > ? ? ? ?trace_ocfs2_do_node_down(node_num); > > + ? ? ? if (osb->node_num == node_num) > + ? ? ? ? ? ? ? return; > + > ? ? ? ?if (!osb->cconn) { > ? ? ? ? ? ? ? ?/* > ? ? ? ? ? ? ? ? * No cluster connection means we're not even ready to >