Xue jiufei
2013-Jun-18 03:13 UTC
[Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery V2
> From: "Xiaowei.Hu" <xiaowei.hu at oracle.com> > > when the master requested locks ,but one/some of the live nodes died, > after it received the request msg and before send out the locks packages, > the recovery will fall into endless loop,waiting for the status changed to finalize > > NodeA NodeB > selected as recovery master > dlm_remaster_locks > -> dlm_requeset_all_locks > this send request locks msg to B > received the msg from A, > queue worker dlm_request_all_locks_worker > return 0 > go on set state to requested > wait for the state become done > NodeB lost connection due to network > before the worker begin, or it die. > > NodeA still waiting for the change of reco state. > It won't end if it not get data done msg. > And at this time nodeB do not realize this (or it just died), > it won't send the msg for ever, nodeA left in the recovery process forever. > > This patch let the recovery master check if the node still in live node > map when it stay in REQUESTED status. >Hi, xiaowei, We have reviewed this patch and have some questions: 1) in dlm_is_node_in_livemap(), I think it should use !!(test_bit(node, dlm->live_nodes_map)) to determine whether a node is live; 2) why not use dlm_is_node_dead() instead of dlm_is_node_in_livemap() in dlm_remaster_locks? I think dlm_is_node_dead() is better because dlm->live_nodes_map may be remain when another node umount. Thanks.> Signed-off-by: Xiaowei.Hu <xiaowei.hu at oracle.com> > --- > fs/ocfs2/dlm/dlmrecovery.c | 16 +++++++++++++++- > 1 files changed, 15 insertions(+), 1 deletions(-) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 01ebfd0..546c5b5 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -339,6 +339,17 @@ static int dlm_reco_master_ready(struct dlm_ctxt *dlm) > return ready; > } > > +/* returns true if node is still in the live node map > + * this map is cleared before domain map,could be checked in recovery*/ > +int dlm_is_node_in_livemap(struct dlm_ctxt *dlm, u8 node) > +{ > + int live; > + spin_lock(&dlm->spinlock); > + live = !test_bit(node, dlm->live_nodes_map); > + spin_unlock(&dlm->spinlock); > + return live; > +} > + > /* returns true if node is no longer in the domain > * could be dead or just not joined */ > int dlm_is_node_dead(struct dlm_ctxt *dlm, u8 node) > @@ -679,7 +690,10 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node) > dlm->name, ndata->node_num, > ndata->state==DLM_RECO_NODE_DATA_RECEIVING ? > "receiving" : "requested"); > - all_nodes_done = 0; > + if (!dlm_is_node_in_livemap(dlm, ndata->node_num)) > + ndata->state = DLM_RECO_NODE_DATA_DEAD; > + else > + all_nodes_done = 0; > break; > case DLM_RECO_NODE_DATA_DONE: > mlog(0, "%s: node %u state is done\n", >.
Jeff Liu
2013-Jun-23 11:12 UTC
[Ocfs2-devel] [PATCH] Fix waiting status race condition in dlm recovery V2
On 06/18/2013 11:13 AM, Xue jiufei wrote:>> From: "Xiaowei.Hu" <xiaowei.hu at oracle.com> >> >> when the master requested locks ,but one/some of the live nodes died, >> after it received the request msg and before send out the locks packages, >> the recovery will fall into endless loop,waiting for the status changed to finalize >> >> NodeA NodeB >> selected as recovery master >> dlm_remaster_locks >> -> dlm_requeset_all_locks >> this send request locks msg to B >> received the msg from A, >> queue worker dlm_request_all_locks_worker >> return 0 >> go on set state to requested >> wait for the state become done >> NodeB lost connection due to network >> before the worker begin, or it die. >> >> NodeA still waiting for the change of reco state. >> It won't end if it not get data done msg. >> And at this time nodeB do not realize this (or it just died), >> it won't send the msg for ever, nodeA left in the recovery process forever. >> >> This patch let the recovery master check if the node still in live node >> map when it stay in REQUESTED status. >> > > Hi, xiaowei, > We have reviewed this patch and have some questions: > 1) in dlm_is_node_in_livemap(), I think it should use > !!(test_bit(node, dlm->live_nodes_map)) to determine whether a node is > live;Hmm? test_bit(node,...) is ok.> 2) why not use dlm_is_node_dead() instead of dlm_is_node_in_livemap() > in dlm_remaster_locks? > I think dlm_is_node_dead() is better because dlm->live_nodes_map > may be remain when another node umount.Please refer to: http://marc.info/?l=ocfs2-devel&m=133799814717270&w=2 Thanks, -Jeff> > Thanks. > >> Signed-off-by: Xiaowei.Hu <xiaowei.hu at oracle.com> >> --- >> fs/ocfs2/dlm/dlmrecovery.c | 16 +++++++++++++++- >> 1 files changed, 15 insertions(+), 1 deletions(-) >> >> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c >> index 01ebfd0..546c5b5 100644 >> --- a/fs/ocfs2/dlm/dlmrecovery.c >> +++ b/fs/ocfs2/dlm/dlmrecovery.c >> @@ -339,6 +339,17 @@ static int dlm_reco_master_ready(struct dlm_ctxt *dlm) >> return ready; >> } >> >> +/* returns true if node is still in the live node map >> + * this map is cleared before domain map,could be checked in recovery*/ >> +int dlm_is_node_in_livemap(struct dlm_ctxt *dlm, u8 node) >> +{ >> + int live; >> + spin_lock(&dlm->spinlock); >> + live = !test_bit(node, dlm->live_nodes_map); >> + spin_unlock(&dlm->spinlock); >> + return live; >> +} >> + >> /* returns true if node is no longer in the domain >> * could be dead or just not joined */ >> int dlm_is_node_dead(struct dlm_ctxt *dlm, u8 node) >> @@ -679,7 +690,10 @@ static int dlm_remaster_locks(struct dlm_ctxt *dlm, u8 dead_node) >> dlm->name, ndata->node_num, >> ndata->state==DLM_RECO_NODE_DATA_RECEIVING ? >> "receiving" : "requested"); >> - all_nodes_done = 0; >> + if (!dlm_is_node_in_livemap(dlm, ndata->node_num)) >> + ndata->state = DLM_RECO_NODE_DATA_DEAD; >> + else >> + all_nodes_done = 0; >> break; >> case DLM_RECO_NODE_DATA_DONE: >> mlog(0, "%s: node %u state is done\n", >> > > > > > . > > > > > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel