Changwei Ge
2017-Oct-17 06:48 UTC
[Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
When a node dies, other live nodes have to choose a new master for an existed lock resource mastered by the dead node. As for ocfs2/dlm implementation, this is done by function - dlm_move_lockres_to_recovery_list which marks those lock rsources as DLM_LOCK_RES_RECOVERING and manages them via a list from which DLM changes lock resource's master later. So without invoking dlm_move_lockres_to_recovery_list, no master will be choosed after dlm recovery accomplishment since no lock resource can be found through ::resource list. What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for lock resources mastered a dead node, it will break up synchronization among nodes. So invoke dlm_move_lockres_to_recovery_list again. Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down")' Reported-by: Vitaly Mayatskih <v.mayatskih at gmail.com> Signed-off-by: Changwei Ge <ge.changwei at h3c.com> --- fs/ocfs2/dlm/dlmrecovery.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c index 74407c6..ec8f758 100644 --- a/fs/ocfs2/dlm/dlmrecovery.c +++ b/fs/ocfs2/dlm/dlmrecovery.c @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct dlm_ctxt *dlm, u8 dead_node) dlm_lockres_put(res); continue; } + dlm_move_lockres_to_recovery_list(dlm, res); } else if (res->owner == dlm->node_num) { dlm_free_dead_locks(dlm, res, dead_node); __dlm_lockres_calc_usage(dlm, res); -- 1.7.9.5
piaojun
2017-Oct-18 08:17 UTC
[Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
Hi Changwei, Could you share the method to reproduce the problem? On 2017/10/17 14:48, Changwei Ge wrote:> When a node dies, other live nodes have to choose a new master > for an existed lock resource mastered by the dead node. > > As for ocfs2/dlm implementation, this is done by function - > dlm_move_lockres_to_recovery_list which marks those lock rsources > as DLM_LOCK_RES_RECOVERING and manages them via a list from which > DLM changes lock resource's master later. > > So without invoking dlm_move_lockres_to_recovery_list, no master will > be choosed after dlm recovery accomplishment since no lock resource can > be found through ::resource list. > > What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for > lock resources mastered a dead node, it will break up synchronization > among nodes. > > So invoke dlm_move_lockres_to_recovery_list again. > > Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery > lockres when recovery master goes down")' > > Reported-by: Vitaly Mayatskih <v.mayatskih at gmail.com> > Signed-off-by: Changwei Ge <ge.changwei at h3c.com> > --- > fs/ocfs2/dlm/dlmrecovery.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 74407c6..ec8f758 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct > dlm_ctxt *dlm, u8 dead_node) > dlm_lockres_put(res); > continue; > } > + dlm_move_lockres_to_recovery_list(dlm, res); > } else if (res->owner == dlm->node_num) { > dlm_free_dead_locks(dlm, res, dead_node); > __dlm_lockres_calc_usage(dlm, res); >
piaojun
2017-Oct-18 09:09 UTC
[Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
On 2017/10/17 14:48, Changwei Ge wrote:> When a node dies, other live nodes have to choose a new master > for an existed lock resource mastered by the dead node. > > As for ocfs2/dlm implementation, this is done by function - > dlm_move_lockres_to_recovery_list which marks those lock rsources > as DLM_LOCK_RES_RECOVERING and manages them via a list from which > DLM changes lock resource's master later. > > So without invoking dlm_move_lockres_to_recovery_list, no master will > be choosed after dlm recovery accomplishment since no lock resource can > be found through ::resource list. > > What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for > lock resources mastered a dead node, it will break up synchronization > among nodes. > > So invoke dlm_move_lockres_to_recovery_list again. > > Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery > lockres when recovery master goes down")' > > Reported-by: Vitaly Mayatskih <v.mayatskih at gmail.com> > Signed-off-by: Changwei Ge <ge.changwei at h3c.com>Reviewed-by: Jun Piao <piaojun at huawei.com>> --- > fs/ocfs2/dlm/dlmrecovery.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 74407c6..ec8f758 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct > dlm_ctxt *dlm, u8 dead_node) > dlm_lockres_put(res); > continue; > } > + dlm_move_lockres_to_recovery_list(dlm, res); > } else if (res->owner == dlm->node_num) { > dlm_free_dead_locks(dlm, res, dead_node); > __dlm_lockres_calc_usage(dlm, res); >
Joseph Qi
2017-Oct-23 03:51 UTC
[Ocfs2-devel] [PATCH] ocfs2: fix cluster hang after a node dies
On 17/10/17 14:48, Changwei Ge wrote:> When a node dies, other live nodes have to choose a new master > for an existed lock resource mastered by the dead node. > > As for ocfs2/dlm implementation, this is done by function - > dlm_move_lockres_to_recovery_list which marks those lock rsources > as DLM_LOCK_RES_RECOVERING and manages them via a list from which > DLM changes lock resource's master later. > > So without invoking dlm_move_lockres_to_recovery_list, no master will > be choosed after dlm recovery accomplishment since no lock resource can > be found through ::resource list. > > What's worse is that if DLM_LOCK_RES_RECOVERING is not marked for > lock resources mastered a dead node, it will break up synchronization > among nodes. > > So invoke dlm_move_lockres_to_recovery_list again. > > Fixs: 'commit ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery > lockres when recovery master goes down")' >A typo here, it should be: Fixes: ee8f7fcbe638 ("ocfs2/dlm: continue to purge recovery lockres when recovery master goes down") Also we'd better Cc stable as well. Others look good to me. Reviewed-by: Joseph Qi <jiangqi903 at gmail.com>> Reported-by: Vitaly Mayatskih <v.mayatskih at gmail.com> > Signed-off-by: Changwei Ge <ge.changwei at h3c.com> > --- > fs/ocfs2/dlm/dlmrecovery.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c > index 74407c6..ec8f758 100644 > --- a/fs/ocfs2/dlm/dlmrecovery.c > +++ b/fs/ocfs2/dlm/dlmrecovery.c > @@ -2419,6 +2419,7 @@ static void dlm_do_local_recovery_cleanup(struct > dlm_ctxt *dlm, u8 dead_node) > dlm_lockres_put(res); > continue; > } > + dlm_move_lockres_to_recovery_list(dlm, res); > } else if (res->owner == dlm->node_num) { > dlm_free_dead_locks(dlm, res, dead_node); > __dlm_lockres_calc_usage(dlm, res); >