Changwei Ge
2017-Nov-01 09:00 UTC
[Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lockres
Hi Jun, On 2017/11/1 16:46, piaojun wrote:> Hi Changwei, > > I do think we need follow the principle that use 'dlm_domain_lock' to > protect 'dlm_state' as the NOTE says in 'dlm_ctxt': > /* NOTE: Next three are protected by dlm_domain_lock */ > > deadnode won't be cleared from 'dlm->domain_map' if return from > __dlm_hb_node_down(), and NodeA will retry migrating to NodeB forever > if only NodeA and NodeB in domain. I suggest more testing needed in > your solution.I agree, however, my patch is just a draft to indicate my comments. Perhaps we can figure out a better way to solve this, as your patch can't clear DLM RECOVERING flag on lock resource. I am not sure if it is reasonable, I suppose this may violate ocfs2/dlm design philosophy. Thanks, Changwei> > thanks, > Jun > > On 2017/11/1 16:11, Changwei Ge wrote: >> Hi Jun, >> >> Thanks for reviewing. >> I don't think we have to worry about misusing *dlm_domain_lock* and >> *dlm::spinlock*. I admit my change may look a little different from most >> of other code snippets where using these two spin locks. But our purpose >> is to close the race between __dlm_hb_node_down and >> dlm_unregister_domain, right? And my change meets that. :-) >> >> I suppose we can do it in a flexible way. >> >> Thanks, >> Changwei >> >> >> On 2017/11/1 15:57, piaojun wrote: >>> Hi Changwei, >>> >>> thanks for reviewing, and I think waiting for recoverying done before >>> migrating seems another solution, but I wonder if new problems will be >>> invoked as following comments. >>> >>> thanks, >>> Jun >>> >>> On 2017/11/1 15:13, Changwei Ge wrote: >>>> Hi Jun, >>>> >>>> I probably get your point. >>>> >>>> You mean that dlm finds no lock resource to be migrated and no more lock >>>> resource is managed by its hash table. After that a node dies all of a >>>> sudden and the dead node is put into dlm's recovery map, right? >>> that is it. >>>> Furthermore, a lock resource is migrated from other nodes and local node >>>> has already asserted master to them. >>>> >>>> If so, I want to suggest a easier way to solve it. >>>> We don't have to add a new flag to dlm structure, just leverage existed >>>> dlm status and bitmap. >>>> It will bring a bonus - no lock resource will be marked with RECOVERING, >>>> it's a safer way, I suppose. >>>> >>>> Please take a review. >>>> >>>> Thanks, >>>> Changwei >>>> >>>> >>>> Subject: [PATCH] ocfs2/dlm: a node can't be involved in recovery if it >>>> is being shutdown >>>> >>>> Signed-off-by: Changwei Ge <ge.changwei at h3c.com> >>>> --- >>>> fs/ocfs2/dlm/dlmdomain.c | 4 ++++ >>>> fs/ocfs2/dlm/dlmrecovery.c | 3 +++ >>>> 2 files changed, 7 insertions(+) >>>> >>>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c >>>> index a2b19fbdcf46..5e9283e509a4 100644 >>>> --- a/fs/ocfs2/dlm/dlmdomain.c >>>> +++ b/fs/ocfs2/dlm/dlmdomain.c >>>> @@ -707,11 +707,15 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm) >>>> * want new domain joins to communicate with us at >>>> * least until we've completed migration of our >>>> * resources. */ >>>> + spin_lock(&dlm->spinlock); >>>> dlm->dlm_state = DLM_CTXT_IN_SHUTDOWN; >>>> + spin_unlock(&dlm->spinlock); >>> I guess there will be misuse of 'dlm->spinlock' and dlm_domain_lock. >>>> leave = 1; >>>> } >>>> spin_unlock(&dlm_domain_lock); >>>> >>>> + dlm_wait_for_recovery(dlm); >>>> + >>>> if (leave) { >>>> mlog(0, "shutting down domain %s\n", dlm->name); >>>> dlm_begin_exit_domain(dlm); >>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c >>>> index 74407c6dd592..764c95b2b35c 100644 >>>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>>> @@ -2441,6 +2441,9 @@ static void __dlm_hb_node_down(struct dlm_ctxt >>>> *dlm, int idx) >>>> { >>>> assert_spin_locked(&dlm->spinlock); >>>> >>>> + if (dlm->dlm_state == DLM_CTXT_IN_SHUTDOWN) >>>> + return; >>>> + >>> 'dlm->dlm_state' probably need be to protected by 'dlm_domain_lock'. >>> and I wander if there is more work to be done when in >>> 'DLM_CTXT_IN_SHUTDOWN'? >>>> if (dlm->reco.new_master == idx) { >>>> mlog(0, "%s: recovery master %d just died\n", >>>> dlm->name, idx); >>>> >>> >> >> . >> >
piaojun
2017-Nov-02 01:42 UTC
[Ocfs2-devel] [PATCH] ocfs2/dlm: wait for dlm recovery done when migrating all lockres
Hi Changwei, I had tried a solution like yours before, but failed to prevent the race just by 'dlm_state' and the existed variable as 'DLM_CTXT_IN_SHUTDOWN' contains many status. so I think we need introduce a 'migrate_done' to solve that problem. thanks, Jun On 2017/11/1 17:00, Changwei Ge wrote:> Hi Jun, > > On 2017/11/1 16:46, piaojun wrote: >> Hi Changwei, >> >> I do think we need follow the principle that use 'dlm_domain_lock' to >> protect 'dlm_state' as the NOTE says in 'dlm_ctxt': >> /* NOTE: Next three are protected by dlm_domain_lock */ >> >> deadnode won't be cleared from 'dlm->domain_map' if return from >> __dlm_hb_node_down(), and NodeA will retry migrating to NodeB forever >> if only NodeA and NodeB in domain. I suggest more testing needed in >> your solution. > > I agree, however, my patch is just a draft to indicate my comments. > > Perhaps we can figure out a better way to solve this, as your patch > can't clear DLM RECOVERING flag on lock resource. I am not sure if it is > reasonable, I suppose this may violate ocfs2/dlm design philosophy. > > Thanks, > Changwei >if res is marked as DLM RECOVERING, migrating process will wait for recoverying done. and DLM RECOVERING will be cleared after recoverying.>> >> thanks, >> Jun >> >> On 2017/11/1 16:11, Changwei Ge wrote: >>> Hi Jun, >>> >>> Thanks for reviewing. >>> I don't think we have to worry about misusing *dlm_domain_lock* and >>> *dlm::spinlock*. I admit my change may look a little different from most >>> of other code snippets where using these two spin locks. But our purpose >>> is to close the race between __dlm_hb_node_down and >>> dlm_unregister_domain, right? And my change meets that. :-) >>> >>> I suppose we can do it in a flexible way. >>> >>> Thanks, >>> Changwei >>> >>> >>> On 2017/11/1 15:57, piaojun wrote: >>>> Hi Changwei, >>>> >>>> thanks for reviewing, and I think waiting for recoverying done before >>>> migrating seems another solution, but I wonder if new problems will be >>>> invoked as following comments. >>>> >>>> thanks, >>>> Jun >>>> >>>> On 2017/11/1 15:13, Changwei Ge wrote: >>>>> Hi Jun, >>>>> >>>>> I probably get your point. >>>>> >>>>> You mean that dlm finds no lock resource to be migrated and no more lock >>>>> resource is managed by its hash table. After that a node dies all of a >>>>> sudden and the dead node is put into dlm's recovery map, right? >>>> that is it. >>>>> Furthermore, a lock resource is migrated from other nodes and local node >>>>> has already asserted master to them. >>>>> >>>>> If so, I want to suggest a easier way to solve it. >>>>> We don't have to add a new flag to dlm structure, just leverage existed >>>>> dlm status and bitmap. >>>>> It will bring a bonus - no lock resource will be marked with RECOVERING, >>>>> it's a safer way, I suppose. >>>>> >>>>> Please take a review. >>>>> >>>>> Thanks, >>>>> Changwei >>>>> >>>>> >>>>> Subject: [PATCH] ocfs2/dlm: a node can't be involved in recovery if it >>>>> is being shutdown >>>>> >>>>> Signed-off-by: Changwei Ge <ge.changwei at h3c.com> >>>>> --- >>>>> fs/ocfs2/dlm/dlmdomain.c | 4 ++++ >>>>> fs/ocfs2/dlm/dlmrecovery.c | 3 +++ >>>>> 2 files changed, 7 insertions(+) >>>>> >>>>> diff --git a/fs/ocfs2/dlm/dlmdomain.c b/fs/ocfs2/dlm/dlmdomain.c >>>>> index a2b19fbdcf46..5e9283e509a4 100644 >>>>> --- a/fs/ocfs2/dlm/dlmdomain.c >>>>> +++ b/fs/ocfs2/dlm/dlmdomain.c >>>>> @@ -707,11 +707,15 @@ void dlm_unregister_domain(struct dlm_ctxt *dlm) >>>>> * want new domain joins to communicate with us at >>>>> * least until we've completed migration of our >>>>> * resources. */ >>>>> + spin_lock(&dlm->spinlock); >>>>> dlm->dlm_state = DLM_CTXT_IN_SHUTDOWN; >>>>> + spin_unlock(&dlm->spinlock); >>>> I guess there will be misuse of 'dlm->spinlock' and dlm_domain_lock. >>>>> leave = 1; >>>>> } >>>>> spin_unlock(&dlm_domain_lock); >>>>> >>>>> + dlm_wait_for_recovery(dlm); >>>>> + >>>>> if (leave) { >>>>> mlog(0, "shutting down domain %s\n", dlm->name); >>>>> dlm_begin_exit_domain(dlm); >>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c b/fs/ocfs2/dlm/dlmrecovery.c >>>>> index 74407c6dd592..764c95b2b35c 100644 >>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c >>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c >>>>> @@ -2441,6 +2441,9 @@ static void __dlm_hb_node_down(struct dlm_ctxt >>>>> *dlm, int idx) >>>>> { >>>>> assert_spin_locked(&dlm->spinlock); >>>>> >>>>> + if (dlm->dlm_state == DLM_CTXT_IN_SHUTDOWN) >>>>> + return; >>>>> + >>>> 'dlm->dlm_state' probably need be to protected by 'dlm_domain_lock'. >>>> and I wander if there is more work to be done when in >>>> 'DLM_CTXT_IN_SHUTDOWN'? >>>>> if (dlm->reco.new_master == idx) { >>>>> mlog(0, "%s: recovery master %d just died\n", >>>>> dlm->name, idx); >>>>> >>>> >>> >>> . >>> >> > > . >