thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED

If this information is useful, please help other people find it:
Share via:

alex chen

2017-Dec-28 02:02 UTC

[Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

Hi Gang,

On 2017/12/27 18:37, Gang He wrote:> Hi Jun,
> 
> 
>>>>
>> Hi Gang,
>>
>> Do you mean that too many retrys in loop cast losts of CPU-time and
>> block page-fault interrupt? We should not add any delay in
>> ocfs2_fault(), right? And I still feel a little confused why your
>> method can solve this problem.
> You can see the related code in function filemap_fault(), if ocfs2 fails to
read a page since
> it can not get a inode lock with non-block mode, the VFS layer code will
invoke ocfs2
> read page call back function circularly, this will lead to a softlockup
problem (like the below back trace).
> So, we should get a blocking lock to let the dlm lock to this node and also
can avoid CPU loop,Can we use 'cond_resched()' to allow the thread to release the CPU
temperately for solving this softlockup?
> second, base on my testing, the patch also can improve the efficiency in
case modifying the same
> file frequently from multiple nodes, since the lock acquisition chance is
more fair.
> In fact, the code was modified by a patch 1cce4df04f37 ("ocfs2: do not
lock/unlock() inode DLM lock"),
> before that patch, the code is the same, this patch can be considered to
revert that patch, except adding more
> clear comments.In patch 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM lock"),
Goldwyn says blocking lock and unlock will only make
the performance worse where contention over the locks is high, which is the
opposite of your described above.
IMO, blocking lock and unlock here is indeed unnecessary.

Thanks,
Alex>  
> Thanks
> Gang
> 
> 
>>
>> thanks,
>> Jun
>>
>> On 2017/12/27 17:29, Gang He wrote:
>>> If we can't get inode lock immediately in the function
>>> ocfs2_inode_lock_with_page() when reading a page, we should not
>>> return directly here, since this will lead to a softlockup problem.
>>> The method is to get a blocking lock and immediately unlock before
>>> returning, this can avoid CPU resource waste due to lots of
retries,
>>> and benefits fairness in getting lock among multiple nodes,
increase
>>> efficiency in case modifying the same file frequently from multiple
>>> nodes.
>>> The softlockup problem looks like,
>>> Kernel panic - not syncing: softlockup: hung tasks
>>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L 4.12.14-6.1-default
#1
>>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>> Call Trace:
>>>   <IRQ>
>>>   dump_stack+0x5c/0x82
>>>   panic+0xd5/0x21e
>>>   watchdog_timer_fn+0x208/0x210
>>>   ? watchdog_park_threads+0x70/0x70
>>>   __hrtimer_run_queues+0xcc/0x200
>>>   hrtimer_interrupt+0xa6/0x1f0
>>>   smp_apic_timer_interrupt+0x34/0x50
>>>   apic_timer_interrupt+0x96/0xa0
>>>   </IRQ>
>>>  RIP: 0010:unlock_page+0x17/0x30
>>>  RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
>>>  RAX: dead000000000100 RBX: fffff21e009f5300 RCX: 0000000000000004
>>>  RDX: dead0000000000ff RSI: 0000000000000202 RDI: fffff21e009f5300
>>>  RBP: 0000000000000000 R08: 0000000000000000 R09: ffffaf154080bb00
>>>  R10: ffffaf154080bc30 R11: 0000000000000040 R12: ffff993749a39518
>>>  R13: 0000000000000000 R14: fffff21e009f5300 R15: fffff21e009f5300
>>>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>>>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>>>   ? pagecache_get_page+0x30/0x200
>>>   filemap_fault+0x12b/0x5c0
>>>   ? recalc_sigpending+0x17/0x50
>>>   ? __set_task_blocked+0x28/0x70
>>>   ? __set_current_blocked+0x3d/0x60
>>>   ocfs2_fault+0x29/0xb0 [ocfs2]
>>>   __do_fault+0x1a/0xa0
>>>   __handle_mm_fault+0xbe8/0x1090
>>>   handle_mm_fault+0xaa/0x1f0
>>>   __do_page_fault+0x235/0x4b0
>>>   trace_do_page_fault+0x3c/0x110
>>>   async_page_fault+0x28/0x30
>>>  RIP: 0033:0x7fa75ded638e
>>>  RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
>>>  RAX: 000055c7662fb700 RBX: 0000000000000001 RCX: 000055c7662fb700
>>>  RDX: 0000000000001770 RSI: 00007fa75e909000 RDI: 000055c7662fb700
>>>  RBP: 0000000000000003 R08: 000000000000000e R09: 0000000000000000
>>>  R10: 0000000000000483 R11: 00007fa75ded61b0 R12: 00007fa75e90a770
>>>  R13: 000000000000000e R14: 0000000000001770 R15: 0000000000000000
>>>
>>> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM
lock")
>>> Signed-off-by: Gang He <ghe at suse.com>
>>> ---
>>>  fs/ocfs2/dlmglue.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>> index 4689940..5193218 100644
>>> --- a/fs/ocfs2/dlmglue.c
>>> +++ b/fs/ocfs2/dlmglue.c
>>> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct inode
*inode,
>>>  	ret = ocfs2_inode_lock_full(inode, ret_bh, ex,
OCFS2_LOCK_NONBLOCK);
>>>  	if (ret == -EAGAIN) {
>>>  		unlock_page(page);
>>> +		/*
>>> +		 * If we can't get inode lock immediately, we should not
return
>>> +		 * directly here, since this will lead to a softlockup problem.
>>> +		 * The method is to get a blocking lock and immediately unlock
>>> +		 * before returning, this can avoid CPU resource waste due to
>>> +		 * lots of retries, and benefits fairness in getting lock.
>>> +		 */
>>> +		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>> +			ocfs2_inode_unlock(inode, ex);
>>>  		ret = AOP_TRUNCATED_PAGE;
>>>  	}
>>>  
>>>
> 
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
> 
> .
>

Gang He

2017-Dec-28 02:48 UTC

head link

[Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

Hi Alex,

>>> 
> Hi Gang,
> 
> On 2017/12/27 18:37, Gang He wrote:
>> Hi Jun,
>> 
>> 
>>>>>
>>> Hi Gang,
>>>
>>> Do you mean that too many retrys in loop cast losts of CPU-time and
>>> block page-fault interrupt? We should not add any delay in
>>> ocfs2_fault(), right? And I still feel a little confused why your
>>> method can solve this problem.
>> You can see the related code in function filemap_fault(), if ocfs2
fails to
> read a page since 
>> it can not get a inode lock with non-block mode, the VFS layer code
will
> invoke ocfs2
>> read page call back function circularly, this will lead to a softlockup
> problem (like the below back trace). 
>> So, we should get a blocking lock to let the dlm lock to this node and
also
> can avoid CPU loop,
> Can we use 'cond_resched()' to allow the thread to release the CPU 
> temperately for solving this softlockup?Yes, we can use cond_resched() function to avoid this softlockup.
In fact, if the kernel is configured with CONFIG_PREEMPT=y, this softlockup does
not happen since the kernel can help.
But, this way still leads to CPU resource waste, CPU usage can reach about 80% -
100% when
multiple nodes read/write/mmap-access the same file concurrently, and more, the
read/write/mmap-access
speed is more lower (50% decrease). 
Why? 
Because we need to get DLM lock for each node, before one node gets DLM lock,
another node has
to down-convert this DLM lock, that means flushing the memory data to the disk
before DLM lock down-conversion.
this disk IO operation is very slow compared with CPU cycle, that means the node
which want to get DLM lock,
will do lots of reties before another node complete down-converting this DLM
lock, actual, these retries do not make
sense, just waste CPU cycle. 
So, if we add a blocking lock/unlock here, we will avoid these unnecessary
reties, especially in case slow-speed disk and more ocfs2 nodes(>=3).
I did the ocfs2 test case (multi_mmap in multiple_run.sh), after applied this
patch, the CPU rate on each node was about 40%-50%, and the test case
execution time reduced by half.
the full command is as below,
multiple_run.sh -i eth0 -k ~/linux-4.4.21-69.tar.gz -o ~/ocfs2mullog -C
hacluster -s pcmk -n nd1,nd2,nd3 -d /dev/sda1 -b 4096 -c 32768 -t multi_mmap
/mnt/shared
the shared storage is a iscsi disk.

Thanks
Gang
> 
>> second, base on my testing, the patch also can improve the efficiency
in
> case modifying the same
>> file frequently from multiple nodes, since the lock acquisition chance
is
> more fair.
>> In fact, the code was modified by a patch 1cce4df04f37 ("ocfs2: do
not
> lock/unlock() inode DLM lock"),
>> before that patch, the code is the same, this patch can be considered
to
> revert that patch, except adding more
>> clear comments.
> In patch 1cce4df04f37 ("ocfs2: do not lock/unlock() inode DLM
lock"),
> Goldwyn says blocking lock and unlock will only make
> the performance worse where contention over the locks is high, which is the
> opposite of your described above.
> IMO, blocking lock and unlock here is indeed unnecessary.
> 
> Thanks,
> Alex
>>  
>> Thanks
>> Gang
>> 
>> 
>>>
>>> thanks,
>>> Jun
>>>
>>> On 2017/12/27 17:29, Gang He wrote:
>>>> If we can't get inode lock immediately in the function
>>>> ocfs2_inode_lock_with_page() when reading a page, we should not
>>>> return directly here, since this will lead to a softlockup
problem.
>>>> The method is to get a blocking lock and immediately unlock
before
>>>> returning, this can avoid CPU resource waste due to lots of
retries,
>>>> and benefits fairness in getting lock among multiple nodes,
increase
>>>> efficiency in case modifying the same file frequently from
multiple
>>>> nodes.
>>>> The softlockup problem looks like,
>>>> Kernel panic - not syncing: softlockup: hung tasks
>>>> CPU: 0 PID: 885 Comm: multi_mmap Tainted: G L
4.12.14-6.1-default #1
>>>> Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
>>>> Call Trace:
>>>>   <IRQ>
>>>>   dump_stack+0x5c/0x82
>>>>   panic+0xd5/0x21e
>>>>   watchdog_timer_fn+0x208/0x210
>>>>   ? watchdog_park_threads+0x70/0x70
>>>>   __hrtimer_run_queues+0xcc/0x200
>>>>   hrtimer_interrupt+0xa6/0x1f0
>>>>   smp_apic_timer_interrupt+0x34/0x50
>>>>   apic_timer_interrupt+0x96/0xa0
>>>>   </IRQ>
>>>>  RIP: 0010:unlock_page+0x17/0x30
>>>>  RSP: 0000:ffffaf154080bc88 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffff10
>>>>  RAX: dead000000000100 RBX: fffff21e009f5300 RCX:
0000000000000004
>>>>  RDX: dead0000000000ff RSI: 0000000000000202 RDI:
fffff21e009f5300
>>>>  RBP: 0000000000000000 R08: 0000000000000000 R09:
ffffaf154080bb00
>>>>  R10: ffffaf154080bc30 R11: 0000000000000040 R12:
ffff993749a39518
>>>>  R13: 0000000000000000 R14: fffff21e009f5300 R15:
fffff21e009f5300
>>>>   ocfs2_inode_lock_with_page+0x25/0x30 [ocfs2]
>>>>   ocfs2_readpage+0x41/0x2d0 [ocfs2]
>>>>   ? pagecache_get_page+0x30/0x200
>>>>   filemap_fault+0x12b/0x5c0
>>>>   ? recalc_sigpending+0x17/0x50
>>>>   ? __set_task_blocked+0x28/0x70
>>>>   ? __set_current_blocked+0x3d/0x60
>>>>   ocfs2_fault+0x29/0xb0 [ocfs2]
>>>>   __do_fault+0x1a/0xa0
>>>>   __handle_mm_fault+0xbe8/0x1090
>>>>   handle_mm_fault+0xaa/0x1f0
>>>>   __do_page_fault+0x235/0x4b0
>>>>   trace_do_page_fault+0x3c/0x110
>>>>   async_page_fault+0x28/0x30
>>>>  RIP: 0033:0x7fa75ded638e
>>>>  RSP: 002b:00007ffd6657db18 EFLAGS: 00010287
>>>>  RAX: 000055c7662fb700 RBX: 0000000000000001 RCX:
000055c7662fb700
>>>>  RDX: 0000000000001770 RSI: 00007fa75e909000 RDI:
000055c7662fb700
>>>>  RBP: 0000000000000003 R08: 000000000000000e R09:
0000000000000000
>>>>  R10: 0000000000000483 R11: 00007fa75ded61b0 R12:
00007fa75e90a770
>>>>  R13: 000000000000000e R14: 0000000000001770 R15:
0000000000000000
>>>>
>>>> Fixes: 1cce4df04f37 ("ocfs2: do not lock/unlock() inode
DLM lock")
>>>> Signed-off-by: Gang He <ghe at suse.com>
>>>> ---
>>>>  fs/ocfs2/dlmglue.c | 9 +++++++++
>>>>  1 file changed, 9 insertions(+)
>>>>
>>>> diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c
>>>> index 4689940..5193218 100644
>>>> --- a/fs/ocfs2/dlmglue.c
>>>> +++ b/fs/ocfs2/dlmglue.c
>>>> @@ -2486,6 +2486,15 @@ int ocfs2_inode_lock_with_page(struct
inode *inode,
>>>>  	ret = ocfs2_inode_lock_full(inode, ret_bh, ex,
OCFS2_LOCK_NONBLOCK);
>>>>  	if (ret == -EAGAIN) {
>>>>  		unlock_page(page);
>>>> +		/*
>>>> +		 * If we can't get inode lock immediately, we should not
return
>>>> +		 * directly here, since this will lead to a softlockup
problem.
>>>> +		 * The method is to get a blocking lock and immediately
unlock
>>>> +		 * before returning, this can avoid CPU resource waste due
to
>>>> +		 * lots of retries, and benefits fairness in getting lock.
>>>> +		 */
>>>> +		if (ocfs2_inode_lock(inode, ret_bh, ex) == 0)
>>>> +			ocfs2_inode_unlock(inode, ex);
>>>>  		ret = AOP_TRUNCATED_PAGE;
>>>>  	}
>>>>  
>>>>
>> 
>> _______________________________________________
>> Ocfs2-devel mailing list
>> Ocfs2-devel at oss.oracle.com 
>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel 
>> 
>> .
>>

Ocfs2 devel - Dec 2017 - [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

[Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE

[Ocfs2-devel] [PATCH] ocfs2: try a blocking lock before return AOP_TRUNCATED_PAGE