thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled [Feb 2019]

If this information is useful, please help other people find it:
Share via:

piaojun

2019-Feb-14 09:04 UTC

[Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

Hi Changwei,

The problem can't be solved completely if clear ::cancel_pending in
dlm_proxy_ast_handler, as AST will come at anytime just before
::cancel_pendig is set. If there is not any other better solutions,
could we accept this patch? This bug is very harmful.

Thanks,
Jun

On 2018/12/8 18:05, wangjian wrote:> Hi Changwei,
> 
> I understand your idea. But we should be aware that the cancel_convert
process and
> other processes (accepting the AST process, the recovery process) are
asynchronous.
> For example, according to your idea, check if the lock is in the grant
queue before
> calling the dlm_send_remote_unlock_request function in the
dlm_proxy_ast_handler function.
> Then decide whether to clear cancel_pending. But if the AST does not come
at this time,
> the check passes and cancel_pendig will not be cleared. Then AST
immediately came over again,
> which also led to a bug. I personally think that for asynchronous processes
we can't guarantee
> the speed of execution of each process. All we can do is to avoid the BUG
scene.
> 
> As for the question you said ("If you remove the BUG check why you
still call dlm_commit_pending_cancel()
> to move the lock back to grant on matter it's on converting list or
not?").
> I think we should first check if the lock is in the grant queue
> (at this time, dlm->spinlock and res->spinlock have been added), then
decide whether to call
> dlm_commit_pending_cancel function. 
> 
> Thanks,
> Jian
> 
> On 12/7/2018 11:12 AM, Changwei Ge wrote:
>> Hi Jian,
>>
>> I suppose that the situation you described truly exists.
>> But the way you fix the issue is not in my favor.
>>
>> If you remove the BUG check why you still call
dlm_commit_pending_cancel() to
>> move the lock back to grant on matter it's on converting list or
not?
>>
>> So how about keeping the BUG check in
dlm_move_lockres_to_recovery_list().
>> If the locking _ast_ comes back very fast just check ::cancel_pending
in dlm_proxy_ast_handler() and clear it.
>> Then with the logic checking if the lock is on grant list (in
dlmunlock_common() more less like your current method)
>> or not we can easily tell if the cancellation succeeds or not.
>>
>> That complies the original dlm design, which I think is better and
easier for maintainers to understand.
>>
>> Thanks,
>> Changwei
>>
>> On 2018/12/6 20:06, wangjian wrote:
>>> Hi Changwei,
>>>
>>> I don't fully agree with your point of view. In my opinion, the
lock cancellation process and the lock
>>> conversion process are asynchronous. We can't guarantee that
the lock must be in the convert list
>>> during the lock conversion process, otherwise this BUG will not
happen.
>>> So, I think this is a meaningless BUG.
>>>
>>> Thanks,
>>> Jian
>>>
>>> On 12/5/2018 9:49 AM, Changwei Ge wrote:
>>>> Hi Jian,
>>>>
>>>> I suppose you can't just remove the BUG_ON() check.
>>>> If you remove it, below code violates the original logic.
>>>>
>>>> '''
>>>> 2141                                
dlm_commit_pending_cancel(res, lock);
>>>> '''
>>>>
>>>> What's more important is *locking cancel* must be against a
*locking conversion* progress.
>>>> So it makes sense to check if this lock is on converting list.
>>>>
>>>> So I have to NACK to this patch.
>>>>
>>>> Thanks,
>>>> Changwei
>>>>
>>>>
>>>> On 2018/12/3 20:23, wangjian wrote:
>>>>> In the dlm_move_lockres_to_recovery_list function, if the
lock
>>>>> is in the granted queue and cancel_pending is set, it will
>>>>> encounter a BUG. I think this is a meaningless BUG,
>>>>> so be prepared to remove it. A scenario that causes
>>>>> this BUG will be given below.
>>>>>
>>>>> At the beginning, Node 1 is the master and has NL lock,
>>>>> Node 2 has PR lock, Node 3 has PR lock too.
>>>>>
>>>>> Node 1          Node 2          Node 3
>>>>>               want to get EX lock.
>>>>>
>>>>>                               want to get EX lock.
>>>>>
>>>>> Node 3 hinder
>>>>> Node 2 to get
>>>>> EX lock, send
>>>>> Node 3 a BAST.
>>>>>
>>>>>                               receive BAST from
>>>>>                               Node 1. downconvert
>>>>>                               thread begin to
>>>>>                               cancel PR to EX conversion.
>>>>>                               In dlmunlock_common function,
>>>>>                               downconvert thread has set
>>>>>                               lock->cancel_pending,
>>>>>                               but did not enter
>>>>>                              
dlm_send_remote_unlock_request
>>>>>                               function.
>>>>>
>>>>>               Node2 dies because
>>>>>               the host is powered down.
>>>>>
>>>>> In recovery process,
>>>>> clean the lock that
>>>>> related to Node2.
>>>>> then finish Node 3
>>>>> PR to EX request.
>>>>> give Node 3 a AST.
>>>>>
>>>>>                               receive AST from Node 1.
>>>>>                               change lock level to EX,
>>>>>                               move lock to granted list.
>>>>>
>>>>> Node1 dies because
>>>>> the host is powered down.
>>>>>
>>>>>                               In
dlm_move_lockres_to_recovery_list
>>>>>                               function. the lock is in the
>>>>>                               granted queue and
cancel_pending
>>>>>                               is set. BUG_ON.
>>>>>
>>>>> But after clearing this BUG, process will encounter
>>>>> the second BUG in the ocfs2_unlock_ast function.
>>>>> Here is a scenario that will cause the second BUG
>>>>> in ocfs2_unlock_ast as follows:
>>>>>
>>>>> At the beginning, Node 1 is the master and has NL lock,
>>>>> Node 2 has PR lock, Node 3 has PR lock too.
>>>>>
>>>>> Node 1          Node 2          Node 3
>>>>>               want to get EX lock.
>>>>>
>>>>>                               want to get EX lock.
>>>>>
>>>>> Node 3 hinder
>>>>> Node 2 to get
>>>>> EX lock, send
>>>>> Node 3 a BAST.
>>>>>
>>>>>                               receive BAST from
>>>>>                               Node 1. downconvert
>>>>>                               thread begin to
>>>>>                               cancel PR to EX conversion.
>>>>>                               In dlmunlock_common function,
>>>>>                               downconvert thread has
released
>>>>>                               lock->spinlock and
res->spinlock,
>>>>>                               but did not enter
>>>>>                              
dlm_send_remote_unlock_request
>>>>>                               function.
>>>>>
>>>>>               Node2 dies because
>>>>>               the host is powered down.
>>>>>
>>>>> In recovery process,
>>>>> clean the lock that
>>>>> related to Node2.
>>>>> then finish Node 3
>>>>> PR to EX request.
>>>>> give Node 3 a AST.
>>>>>
>>>>>                               receive AST from Node 1.
>>>>>                               change lock level to EX,
>>>>>                               move lock to granted list,
>>>>>                               set
lockres->l_unlock_action
>>>>>                               as OCFS2_UNLOCK_INVALID
>>>>>                               in ocfs2_locking_ast
function.
>>>>>
>>>>> Node2 dies because
>>>>> the host is powered down.
>>>>>
>>>>>                               Node 3 realize that Node 1
>>>>>                               is dead, remove Node 1 from
>>>>>                               domain_map. downconvert
thread
>>>>>                               get DLM_NORMAL from
>>>>>                              
dlm_send_remote_unlock_request
>>>>>                               function and set *call_ast as
1.
>>>>>                               Then downconvert thread meet
>>>>>                               BUG in ocfs2_unlock_ast
function.
>>>>>
>>>>> To avoid meet the second BUG, function dlmunlock_common
shuold
>>>>> return DLM_CANCELGRANT if the lock is on granted list and
>>>>> the operation is canceled.
>>>>>
>>>>> Signed-off-by: Jian Wang<wangjian161 at huawei.com>
>>>>> Reviewed-by: Yiwen Jiang<jiangyiwen at huawei.com>
>>>>> ---
>>>>>    fs/ocfs2/dlm/dlmrecovery.c | 1 -
>>>>>    fs/ocfs2/dlm/dlmunlock.c   | 5 +++++
>>>>>    2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c
b/fs/ocfs2/dlm/dlmrecovery.c
>>>>> index 802636d..7489652 100644
>>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>> @@ -2134,7 +2134,6 @@ void
dlm_move_lockres_to_recovery_list(struct dlm_ctxt *dlm,
>>>>>    				 * if this had completed successfully
>>>>>    				 * before sending this lock state to the
>>>>>    				 * new master */
>>>>> -				BUG_ON(i != DLM_CONVERTING_LIST);
>>>>>    				mlog(0, "node died with cancel pending "
>>>>>    				     "on %.*s. move back to granted
list.\n",
>>>>>    				     res->lockname.len, res->lockname.name);
>>>>> diff --git a/fs/ocfs2/dlm/dlmunlock.c
b/fs/ocfs2/dlm/dlmunlock.c
>>>>> index 63d701c..505bb6c 100644
>>>>> --- a/fs/ocfs2/dlm/dlmunlock.c
>>>>> +++ b/fs/ocfs2/dlm/dlmunlock.c
>>>>> @@ -183,6 +183,11 @@ static enum dlm_status
dlmunlock_common(struct dlm_ctxt *dlm,
>>>>>    							flags, owner);
>>>>>    		spin_lock(&res->spinlock);
>>>>>    		spin_lock(&lock->spinlock);
>>>>> +
>>>>> +		if ((flags & LKM_CANCEL) &&
>>>>> +				dlm_lock_on_list(&res->granted, lock))
>>>>> +			status = DLM_CANCELGRANT;
>>>>> +
>>>>>    		/* if the master told us the lock was already granted,
>>>>>    		 * let the ast handle all of these actions */
>>>>>    		if (status == DLM_CANCELGRANT) {
>>>>> -- 
>>>>> 1.8.3.1
>>>>>
>>>> .
>>>>
>> .
>>
> 
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>

Changwei Ge

2019-Feb-14 09:59 UTC

head link

[Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

Hi Jun,

On 2019/2/14 17:09, piaojun wrote:> Hi Changwei,
> 
> The problem can't be solved completely if clear ::cancel_pending in
> dlm_proxy_ast_handler, as AST will come at anytime just before
> ::cancel_pendig is set. If there is not any other better solutions,
> could we accept this patch? This bug is very harmful.
Um, I am recalling the problem this patch is trying to address...
If I miss something, please let me know.

I read the patch again, it seems that NODE3 finds a lock on its _grant list_
with ::cancel_pending set, which makes dlm insane.
Actually, I don't like patches addressing problems killing BUG checks as it
keeps the whole software consistent with its original design.
With those BUG checks, we can add feature and fix bugs in an elegant way.

As we the lock is *already granted*, why not just check ::cancel_pending and if
it is set, clear it and then move the lock to
grant list since cancellation obviously cant work anymore. And I think we even
don't have to worry about dlm_send_remote_unlock_request()
since message sending will fail anyway, but function will return NORMAL as NODE1
was *dead* already.

  
I admit that the problem reported truly exists and should be solved but the
method is not in my favor. :-)
I have several reasons for your reference:
1) My biggest concern is we should not remove the BUG check ( BUG_ON(i !=
DLM_CONVERTING_LIST) ),
    because it might violate the original ocfs2/dlm design making following
maintenance work hard.
2) As for another patch from Jian, it can't use LVB to boost performance
that single locking procedure.

I am providing a draft code for you and Jian's reference.

root at ubuntu16:/home/chge/linux[master]# git diff
diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
index 39831fc..812843b 100644
--- a/fs/ocfs2/dlm/dlmast.c
+++ b/fs/ocfs2/dlm/dlmast.c
@@ -372,8 +372,11 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len,
void *data,
         head = &res->converting;
         lock = NULL;
         list_for_each_entry(lock, head, list) {
-               if (lock->ml.cookie == cookie)
+               if (lock->ml.cookie == cookie) {
+                       if (lock->cancel_pending)
+                               lock->cancel_pending = 0;
                         goto do_ast;
+               }
         }

Thanks,
Changwei

> 
> Thanks,
> Jun
> 
> On 2018/12/8 18:05, wangjian wrote:
>> Hi Changwei,
>>
>> I understand your idea. But we should be aware that the cancel_convert
process and
>> other processes (accepting the AST process, the recovery process) are
asynchronous.
>> For example, according to your idea, check if the lock is in the grant
queue before
>> calling the dlm_send_remote_unlock_request function in the
dlm_proxy_ast_handler function.
>> Then decide whether to clear cancel_pending. But if the AST does not
come at this time,
>> the check passes and cancel_pendig will not be cleared. Then AST
immediately came over again,
>> which also led to a bug. I personally think that for asynchronous
processes we can't guarantee
>> the speed of execution of each process. All we can do is to avoid the
BUG scene.
>>
>> As for the question you said ("If you remove the BUG check why you
still call dlm_commit_pending_cancel()
>> to move the lock back to grant on matter it's on converting list or
not?").
>> I think we should first check if the lock is in the grant queue
>> (at this time, dlm->spinlock and res->spinlock have been added),
then decide whether to call
>> dlm_commit_pending_cancel function.
>>
>> Thanks,
>> Jian
>>
>> On 12/7/2018 11:12 AM, Changwei Ge wrote:
>>> Hi Jian,
>>>
>>> I suppose that the situation you described truly exists.
>>> But the way you fix the issue is not in my favor.
>>>
>>> If you remove the BUG check why you still call
dlm_commit_pending_cancel() to
>>> move the lock back to grant on matter it's on converting list
or not?
>>>
>>> So how about keeping the BUG check in
dlm_move_lockres_to_recovery_list().
>>> If the locking _ast_ comes back very fast just check
::cancel_pending in dlm_proxy_ast_handler() and clear it.
>>> Then with the logic checking if the lock is on grant list (in
dlmunlock_common() more less like your current method)
>>> or not we can easily tell if the cancellation succeeds or not.
>>>
>>> That complies the original dlm design, which I think is better and
easier for maintainers to understand.
>>>
>>> Thanks,
>>> Changwei
>>>
>>> On 2018/12/6 20:06, wangjian wrote:
>>>> Hi Changwei,
>>>>
>>>> I don't fully agree with your point of view. In my opinion,
the lock cancellation process and the lock
>>>> conversion process are asynchronous. We can't guarantee
that the lock must be in the convert list
>>>> during the lock conversion process, otherwise this BUG will not
happen.
>>>> So, I think this is a meaningless BUG.
>>>>
>>>> Thanks,
>>>> Jian
>>>>
>>>> On 12/5/2018 9:49 AM, Changwei Ge wrote:
>>>>> Hi Jian,
>>>>>
>>>>> I suppose you can't just remove the BUG_ON() check.
>>>>> If you remove it, below code violates the original logic.
>>>>>
>>>>> '''
>>>>> 2141                                
dlm_commit_pending_cancel(res, lock);
>>>>> '''
>>>>>
>>>>> What's more important is *locking cancel* must be
against a *locking conversion* progress.
>>>>> So it makes sense to check if this lock is on converting
list.
>>>>>
>>>>> So I have to NACK to this patch.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>>>
>>>>> On 2018/12/3 20:23, wangjian wrote:
>>>>>> In the dlm_move_lockres_to_recovery_list function, if
the lock
>>>>>> is in the granted queue and cancel_pending is set, it
will
>>>>>> encounter a BUG. I think this is a meaningless BUG,
>>>>>> so be prepared to remove it. A scenario that causes
>>>>>> this BUG will be given below.
>>>>>>
>>>>>> At the beginning, Node 1 is the master and has NL lock,
>>>>>> Node 2 has PR lock, Node 3 has PR lock too.
>>>>>>
>>>>>> Node 1          Node 2          Node 3
>>>>>>                want to get EX lock.
>>>>>>
>>>>>>                                want to get EX lock.
>>>>>>
>>>>>> Node 3 hinder
>>>>>> Node 2 to get
>>>>>> EX lock, send
>>>>>> Node 3 a BAST.
>>>>>>
>>>>>>                                receive BAST from
>>>>>>                                Node 1. downconvert
>>>>>>                                thread begin to
>>>>>>                                cancel PR to EX
conversion.
>>>>>>                                In dlmunlock_common
function,
>>>>>>                                downconvert thread has
set
>>>>>>                                lock->cancel_pending,
>>>>>>                                but did not enter
>>>>>>                               
dlm_send_remote_unlock_request
>>>>>>                                function.
>>>>>>
>>>>>>                Node2 dies because
>>>>>>                the host is powered down.
>>>>>>
>>>>>> In recovery process,
>>>>>> clean the lock that
>>>>>> related to Node2.
>>>>>> then finish Node 3
>>>>>> PR to EX request.
>>>>>> give Node 3 a AST.
>>>>>>
>>>>>>                                receive AST from Node 1.
>>>>>>                                change lock level to EX,
>>>>>>                                move lock to granted
list.
>>>>>>
>>>>>> Node1 dies because
>>>>>> the host is powered down.
>>>>>>
>>>>>>                                In
dlm_move_lockres_to_recovery_list
>>>>>>                                function. the lock is in
the
>>>>>>                                granted queue and
cancel_pending
>>>>>>                                is set. BUG_ON.
>>>>>>
>>>>>> But after clearing this BUG, process will encounter
>>>>>> the second BUG in the ocfs2_unlock_ast function.
>>>>>> Here is a scenario that will cause the second BUG
>>>>>> in ocfs2_unlock_ast as follows:
>>>>>>
>>>>>> At the beginning, Node 1 is the master and has NL lock,
>>>>>> Node 2 has PR lock, Node 3 has PR lock too.
>>>>>>
>>>>>> Node 1          Node 2          Node 3
>>>>>>                want to get EX lock.
>>>>>>
>>>>>>                                want to get EX lock.
>>>>>>
>>>>>> Node 3 hinder
>>>>>> Node 2 to get
>>>>>> EX lock, send
>>>>>> Node 3 a BAST.
>>>>>>
>>>>>>                                receive BAST from
>>>>>>                                Node 1. downconvert
>>>>>>                                thread begin to
>>>>>>                                cancel PR to EX
conversion.
>>>>>>                                In dlmunlock_common
function,
>>>>>>                                downconvert thread has
released
>>>>>>                                lock->spinlock and
res->spinlock,
>>>>>>                                but did not enter
>>>>>>                               
dlm_send_remote_unlock_request
>>>>>>                                function.
>>>>>>
>>>>>>                Node2 dies because
>>>>>>                the host is powered down.
>>>>>>
>>>>>> In recovery process,
>>>>>> clean the lock that
>>>>>> related to Node2.
>>>>>> then finish Node 3
>>>>>> PR to EX request.
>>>>>> give Node 3 a AST.
>>>>>>
>>>>>>                                receive AST from Node 1.
>>>>>>                                change lock level to EX,
>>>>>>                                move lock to granted
list,
>>>>>>                                set
lockres->l_unlock_action
>>>>>>                                as OCFS2_UNLOCK_INVALID
>>>>>>                                in ocfs2_locking_ast
function.
>>>>>>
>>>>>> Node2 dies because
>>>>>> the host is powered down.
>>>>>>
>>>>>>                                Node 3 realize that Node
1
>>>>>>                                is dead, remove Node 1
from
>>>>>>                                domain_map. downconvert
thread
>>>>>>                                get DLM_NORMAL from
>>>>>>                               
dlm_send_remote_unlock_request
>>>>>>                                function and set
*call_ast as 1.
>>>>>>                                Then downconvert thread
meet
>>>>>>                                BUG in ocfs2_unlock_ast
function.
>>>>>>
>>>>>> To avoid meet the second BUG, function dlmunlock_common
shuold
>>>>>> return DLM_CANCELGRANT if the lock is on granted list
and
>>>>>> the operation is canceled.
>>>>>>
>>>>>> Signed-off-by: Jian Wang<wangjian161 at
huawei.com>
>>>>>> Reviewed-by: Yiwen Jiang<jiangyiwen at
huawei.com>
>>>>>> ---
>>>>>>     fs/ocfs2/dlm/dlmrecovery.c | 1 -
>>>>>>     fs/ocfs2/dlm/dlmunlock.c   | 5 +++++
>>>>>>     2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c
b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>> index 802636d..7489652 100644
>>>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>> @@ -2134,7 +2134,6 @@ void
dlm_move_lockres_to_recovery_list(struct dlm_ctxt *dlm,
>>>>>>     				 * if this had completed successfully
>>>>>>     				 * before sending this lock state to the
>>>>>>     				 * new master */
>>>>>> -				BUG_ON(i != DLM_CONVERTING_LIST);
>>>>>>     				mlog(0, "node died with cancel pending
"
>>>>>>     				     "on %.*s. move back to granted
list.\n",
>>>>>>     				     res->lockname.len,
res->lockname.name);
>>>>>> diff --git a/fs/ocfs2/dlm/dlmunlock.c
b/fs/ocfs2/dlm/dlmunlock.c
>>>>>> index 63d701c..505bb6c 100644
>>>>>> --- a/fs/ocfs2/dlm/dlmunlock.c
>>>>>> +++ b/fs/ocfs2/dlm/dlmunlock.c
>>>>>> @@ -183,6 +183,11 @@ static enum dlm_status
dlmunlock_common(struct dlm_ctxt *dlm,
>>>>>>     							flags, owner);
>>>>>>     		spin_lock(&res->spinlock);
>>>>>>     		spin_lock(&lock->spinlock);
>>>>>> +
>>>>>> +		if ((flags & LKM_CANCEL) &&
>>>>>> +				dlm_lock_on_list(&res->granted, lock))
>>>>>> +			status = DLM_CANCELGRANT;
>>>>>> +
>>>>>>     		/* if the master told us the lock was already
granted,
>>>>>>     		 * let the ast handle all of these actions */
>>>>>>     		if (status == DLM_CANCELGRANT) {
>>>>>> -- 
>>>>>> 1.8.3.1
>>>>>>
>>>>> .
>>>>>
>>> .
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
>

Changwei Ge

2019-Feb-14 10:13 UTC

head link

[Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

On 2019/2/14 17:09, piaojun wrote:> Hi Changwei,
> 
> The problem can't be solved completely if clear ::cancel_pending in
> dlm_proxy_ast_handler, as AST will come at anytime just before
So how about also add check here bere setting ::cancel_pending in
dlmunlock_common() before invoking dlm_send_remote_unlock_request().
If already on grant list just return DLM_CANCELGRANT

Then a further reference code might look like:

root at ubuntu16:/home/chge/linux[master]# git diff
diff --git a/fs/ocfs2/dlm/dlmast.c b/fs/ocfs2/dlm/dlmast.c
index 39831fc..812843b 100644
--- a/fs/ocfs2/dlm/dlmast.c
+++ b/fs/ocfs2/dlm/dlmast.c
@@ -372,8 +372,11 @@ int dlm_proxy_ast_handler(struct o2net_msg *msg, u32 len,
void *data,
         head = &res->converting;
         lock = NULL;
         list_for_each_entry(lock, head, list) {
-               if (lock->ml.cookie == cookie)
+               if (lock->ml.cookie == cookie) {
+                       if (lock->cancel_pending)
+                               lock->cancel_pending = 0;
                         goto do_ast;
+               }
         }

         /* if not on convert, try blocked for ast, granted for bast */
diff --git a/fs/ocfs2/dlm/dlmunlock.c b/fs/ocfs2/dlm/dlmunlock.c
index c8e9b70..b4728b5 100644
--- a/fs/ocfs2/dlm/dlmunlock.c
+++ b/fs/ocfs2/dlm/dlmunlock.c
@@ -174,9 +174,14 @@ static enum dlm_status dlmunlock_common(struct dlm_ctxt
*dlm,
         if (!master_node) {
                 owner = res->owner;
                 /* drop locks and send message */
-               if (flags & LKM_CANCEL)
+               if (flags & LKM_CANCEL) {
+                       if (dlm_lock_on_list(&res->granted, lock)) {
+                               status = DLM_CANCELGRANT;
+                               goto leave;
+                       }
+
                         lock->cancel_pending = 1;
-               else
+               } else
                         lock->unlock_pending = 1;
                 spin_unlock(&lock->spinlock);
                 spin_unlock(&res->spinlock);



Thanks,
Changwei
> ::cancel_pendig is set. If there is not any other better solutions,
> could we accept this patch? This bug is very harmful.
> 
> Thanks,
> Jun
> 
> On 2018/12/8 18:05, wangjian wrote:
>> Hi Changwei,
>>
>> I understand your idea. But we should be aware that the cancel_convert
process and
>> other processes (accepting the AST process, the recovery process) are
asynchronous.
>> For example, according to your idea, check if the lock is in the grant
queue before
>> calling the dlm_send_remote_unlock_request function in the
dlm_proxy_ast_handler function.
>> Then decide whether to clear cancel_pending. But if the AST does not
come at this time,
>> the check passes and cancel_pendig will not be cleared. Then AST
immediately came over again,
>> which also led to a bug. I personally think that for asynchronous
processes we can't guarantee
>> the speed of execution of each process. All we can do is to avoid the
BUG scene.
>>
>> As for the question you said ("If you remove the BUG check why you
still call dlm_commit_pending_cancel()
>> to move the lock back to grant on matter it's on converting list or
not?").
>> I think we should first check if the lock is in the grant queue
>> (at this time, dlm->spinlock and res->spinlock have been added),
then decide whether to call
>> dlm_commit_pending_cancel function.
>>
>> Thanks,
>> Jian
>>
>> On 12/7/2018 11:12 AM, Changwei Ge wrote:
>>> Hi Jian,
>>>
>>> I suppose that the situation you described truly exists.
>>> But the way you fix the issue is not in my favor.
>>>
>>> If you remove the BUG check why you still call
dlm_commit_pending_cancel() to
>>> move the lock back to grant on matter it's on converting list
or not?
>>>
>>> So how about keeping the BUG check in
dlm_move_lockres_to_recovery_list().
>>> If the locking _ast_ comes back very fast just check
::cancel_pending in dlm_proxy_ast_handler() and clear it.
>>> Then with the logic checking if the lock is on grant list (in
dlmunlock_common() more less like your current method)
>>> or not we can easily tell if the cancellation succeeds or not.
>>>
>>> That complies the original dlm design, which I think is better and
easier for maintainers to understand.
>>>
>>> Thanks,
>>> Changwei
>>>
>>> On 2018/12/6 20:06, wangjian wrote:
>>>> Hi Changwei,
>>>>
>>>> I don't fully agree with your point of view. In my opinion,
the lock cancellation process and the lock
>>>> conversion process are asynchronous. We can't guarantee
that the lock must be in the convert list
>>>> during the lock conversion process, otherwise this BUG will not
happen.
>>>> So, I think this is a meaningless BUG.
>>>>
>>>> Thanks,
>>>> Jian
>>>>
>>>> On 12/5/2018 9:49 AM, Changwei Ge wrote:
>>>>> Hi Jian,
>>>>>
>>>>> I suppose you can't just remove the BUG_ON() check.
>>>>> If you remove it, below code violates the original logic.
>>>>>
>>>>> '''
>>>>> 2141                                
dlm_commit_pending_cancel(res, lock);
>>>>> '''
>>>>>
>>>>> What's more important is *locking cancel* must be
against a *locking conversion* progress.
>>>>> So it makes sense to check if this lock is on converting
list.
>>>>>
>>>>> So I have to NACK to this patch.
>>>>>
>>>>> Thanks,
>>>>> Changwei
>>>>>
>>>>>
>>>>> On 2018/12/3 20:23, wangjian wrote:
>>>>>> In the dlm_move_lockres_to_recovery_list function, if
the lock
>>>>>> is in the granted queue and cancel_pending is set, it
will
>>>>>> encounter a BUG. I think this is a meaningless BUG,
>>>>>> so be prepared to remove it. A scenario that causes
>>>>>> this BUG will be given below.
>>>>>>
>>>>>> At the beginning, Node 1 is the master and has NL lock,
>>>>>> Node 2 has PR lock, Node 3 has PR lock too.
>>>>>>
>>>>>> Node 1          Node 2          Node 3
>>>>>>                want to get EX lock.
>>>>>>
>>>>>>                                want to get EX lock.
>>>>>>
>>>>>> Node 3 hinder
>>>>>> Node 2 to get
>>>>>> EX lock, send
>>>>>> Node 3 a BAST.
>>>>>>
>>>>>>                                receive BAST from
>>>>>>                                Node 1. downconvert
>>>>>>                                thread begin to
>>>>>>                                cancel PR to EX
conversion.
>>>>>>                                In dlmunlock_common
function,
>>>>>>                                downconvert thread has
set
>>>>>>                                lock->cancel_pending,
>>>>>>                                but did not enter
>>>>>>                               
dlm_send_remote_unlock_request
>>>>>>                                function.
>>>>>>
>>>>>>                Node2 dies because
>>>>>>                the host is powered down.
>>>>>>
>>>>>> In recovery process,
>>>>>> clean the lock that
>>>>>> related to Node2.
>>>>>> then finish Node 3
>>>>>> PR to EX request.
>>>>>> give Node 3 a AST.
>>>>>>
>>>>>>                                receive AST from Node 1.
>>>>>>                                change lock level to EX,
>>>>>>                                move lock to granted
list.
>>>>>>
>>>>>> Node1 dies because
>>>>>> the host is powered down.
>>>>>>
>>>>>>                                In
dlm_move_lockres_to_recovery_list
>>>>>>                                function. the lock is in
the
>>>>>>                                granted queue and
cancel_pending
>>>>>>                                is set. BUG_ON.
>>>>>>
>>>>>> But after clearing this BUG, process will encounter
>>>>>> the second BUG in the ocfs2_unlock_ast function.
>>>>>> Here is a scenario that will cause the second BUG
>>>>>> in ocfs2_unlock_ast as follows:
>>>>>>
>>>>>> At the beginning, Node 1 is the master and has NL lock,
>>>>>> Node 2 has PR lock, Node 3 has PR lock too.
>>>>>>
>>>>>> Node 1          Node 2          Node 3
>>>>>>                want to get EX lock.
>>>>>>
>>>>>>                                want to get EX lock.
>>>>>>
>>>>>> Node 3 hinder
>>>>>> Node 2 to get
>>>>>> EX lock, send
>>>>>> Node 3 a BAST.
>>>>>>
>>>>>>                                receive BAST from
>>>>>>                                Node 1. downconvert
>>>>>>                                thread begin to
>>>>>>                                cancel PR to EX
conversion.
>>>>>>                                In dlmunlock_common
function,
>>>>>>                                downconvert thread has
released
>>>>>>                                lock->spinlock and
res->spinlock,
>>>>>>                                but did not enter
>>>>>>                               
dlm_send_remote_unlock_request
>>>>>>                                function.
>>>>>>
>>>>>>                Node2 dies because
>>>>>>                the host is powered down.
>>>>>>
>>>>>> In recovery process,
>>>>>> clean the lock that
>>>>>> related to Node2.
>>>>>> then finish Node 3
>>>>>> PR to EX request.
>>>>>> give Node 3 a AST.
>>>>>>
>>>>>>                                receive AST from Node 1.
>>>>>>                                change lock level to EX,
>>>>>>                                move lock to granted
list,
>>>>>>                                set
lockres->l_unlock_action
>>>>>>                                as OCFS2_UNLOCK_INVALID
>>>>>>                                in ocfs2_locking_ast
function.
>>>>>>
>>>>>> Node2 dies because
>>>>>> the host is powered down.
>>>>>>
>>>>>>                                Node 3 realize that Node
1
>>>>>>                                is dead, remove Node 1
from
>>>>>>                                domain_map. downconvert
thread
>>>>>>                                get DLM_NORMAL from
>>>>>>                               
dlm_send_remote_unlock_request
>>>>>>                                function and set
*call_ast as 1.
>>>>>>                                Then downconvert thread
meet
>>>>>>                                BUG in ocfs2_unlock_ast
function.
>>>>>>
>>>>>> To avoid meet the second BUG, function dlmunlock_common
shuold
>>>>>> return DLM_CANCELGRANT if the lock is on granted list
and
>>>>>> the operation is canceled.
>>>>>>
>>>>>> Signed-off-by: Jian Wang<wangjian161 at
huawei.com>
>>>>>> Reviewed-by: Yiwen Jiang<jiangyiwen at
huawei.com>
>>>>>> ---
>>>>>>     fs/ocfs2/dlm/dlmrecovery.c | 1 -
>>>>>>     fs/ocfs2/dlm/dlmunlock.c   | 5 +++++
>>>>>>     2 files changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/ocfs2/dlm/dlmrecovery.c
b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>> index 802636d..7489652 100644
>>>>>> --- a/fs/ocfs2/dlm/dlmrecovery.c
>>>>>> +++ b/fs/ocfs2/dlm/dlmrecovery.c
>>>>>> @@ -2134,7 +2134,6 @@ void
dlm_move_lockres_to_recovery_list(struct dlm_ctxt *dlm,
>>>>>>     				 * if this had completed successfully
>>>>>>     				 * before sending this lock state to the
>>>>>>     				 * new master */
>>>>>> -				BUG_ON(i != DLM_CONVERTING_LIST);
>>>>>>     				mlog(0, "node died with cancel pending
"
>>>>>>     				     "on %.*s. move back to granted
list.\n",
>>>>>>     				     res->lockname.len,
res->lockname.name);
>>>>>> diff --git a/fs/ocfs2/dlm/dlmunlock.c
b/fs/ocfs2/dlm/dlmunlock.c
>>>>>> index 63d701c..505bb6c 100644
>>>>>> --- a/fs/ocfs2/dlm/dlmunlock.c
>>>>>> +++ b/fs/ocfs2/dlm/dlmunlock.c
>>>>>> @@ -183,6 +183,11 @@ static enum dlm_status
dlmunlock_common(struct dlm_ctxt *dlm,
>>>>>>     							flags, owner);
>>>>>>     		spin_lock(&res->spinlock);
>>>>>>     		spin_lock(&lock->spinlock);
>>>>>> +
>>>>>> +		if ((flags & LKM_CANCEL) &&
>>>>>> +				dlm_lock_on_list(&res->granted, lock))
>>>>>> +			status = DLM_CANCELGRANT;
>>>>>> +
>>>>>>     		/* if the master told us the lock was already
granted,
>>>>>>     		 * let the ast handle all of these actions */
>>>>>>     		if (status == DLM_CANCELGRANT) {
>>>>>> -- 
>>>>>> 1.8.3.1
>>>>>>
>>>>> .
>>>>>
>>> .
>>>
>>
>>
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
>

Ocfs2 devel - Feb 2019 - [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

[Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

[Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled

[Ocfs2-devel] [PATCH] ocfs2/dlm: return DLM_CANCELGRANT if the lock is on granted list and the operation is canceled