thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next

If this information is useful, please help other people find it:
Share via:

Joseph Qi

2020-Nov-10 01:33 UTC

[Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan

On 11/10/20 1:17 AM, Wengang Wang wrote:> Though problem if found on a lower 4.1.12 kernel, I think upstream
> has same issue.
>
> In one node in the cluster, there is the following callback trace:
>
> # cat /proc/21473/stack
> [<ffffffffc09a2f06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
> [<ffffffffc09a4481>] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
> [<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820 [ocfs2]
> [<ffffffff8122b36e>] evict+0xae/0x1a0
> [<ffffffff8122bd26>] iput+0x1c6/0x230
> [<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
> [<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
> [<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
> [<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
> [<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
> [<ffffffff810a1399>] process_one_work+0x169/0x4a0
> [<ffffffff810a1bcb>] worker_thread+0x5b/0x560
> [<ffffffff810a7a2b>] kthread+0xcb/0xf0
> [<ffffffff816f5d21>] ret_from_fork+0x61/0x90
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The above stack is not reasonable, the final iput shouldn't happen in
> ocfs2_orphan_filldir() function. Looking at the code,
>
> 2067         /* Skip inodes which are already added to recover list, since
dio may
> 2068          * happen concurrently with unlink/rename */
> 2069         if (OCFS2_I(iter)->ip_next_orphan) {
> 2070                 iput(iter);
> 2071                 return 0;
> 2072         }
> 2073
>
> The logic thinks the inode is already in recover list on seeing
> ip_next_orphan is non-NULL, so it skip this inode after dropping a
> reference which incremented in ocfs2_iget().
>
> While, if the inode is already in recover list, it should have another
> reference and the iput() at line 2070 should not be the final iput
> (dropping the last reference). So I don't think the inode is really
> in the recover list (no vmcore to confirm).
>
> Note that ocfs2_queue_orphans(), though not shown up in the call back
trace,
> is holding cluster lock on the orphan directory when looking up for
unlinked
> inodes. The on disk inode eviction could involve a lot of IOs which may
need
> long time to finish. That means this node could hold the cluster lock for
> very long time, that can lead to the lock requests (from other nodes) to
the
> orhpan directory hang for long time.
>
> Looking at more on ip_next_orphan, I found it's not initialized when
> allocating a new ocfs2_inode_info structure.
>
> Fix:
> 	initialize ip_next_orphan as NULL.
> Signed-off-by: Wengang Wang <wen.gang.wang at oracle.com>
Reviewed-by: Joseph Qi <joseph.qi at
linux.alibaba.com>> ---
> v1 -> v2: move the initialization of ip_next_orphan earlier.
> ---
>   fs/ocfs2/super.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
> index 1d91dd1e8711..2febc76e9de7 100644
> --- a/fs/ocfs2/super.c
> +++ b/fs/ocfs2/super.c
> @@ -1713,6 +1713,7 @@ static void ocfs2_inode_init_once(void *data)
>   
>   	oi->ip_blkno = 0ULL;
>   	oi->ip_clusters = 0;
> +	oi->ip_next_orphan = NULL;
>   
>   	ocfs2_resv_init_once(&oi->ip_la_data_resv);
>

Wengang Wang

2020-Nov-10 01:56 UTC

head link

[Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan

On 11/9/20 5:33 PM, Joseph Qi wrote:>
>
> On 11/10/20 1:17 AM, Wengang Wang wrote:
>> Though problem if found on a lower 4.1.12 kernel, I think upstream
>> has same issue.
>>
>> In one node in the cluster, there is the following callback trace:
>>
>> # cat /proc/21473/stack
>> [<ffffffffc09a2f06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0
[ocfs2]
>> [<ffffffffc09a4481>] ocfs2_inode_lock_full_nested+0x121/0x520
[ocfs2]
>> [<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820 [ocfs2]
>> [<ffffffff8122b36e>] evict+0xae/0x1a0
>> [<ffffffff8122bd26>] iput+0x1c6/0x230
>> [<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100 [ocfs2]
>> [<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0 [ocfs2]
>> [<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
>> [<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0 [ocfs2]
>> [<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0 [ocfs2]
>> [<ffffffff810a1399>] process_one_work+0x169/0x4a0
>> [<ffffffff810a1bcb>] worker_thread+0x5b/0x560
>> [<ffffffff810a7a2b>] kthread+0xcb/0xf0
>> [<ffffffff816f5d21>] ret_from_fork+0x61/0x90
>> [<ffffffffffffffff>] 0xffffffffffffffff
>>
>> The above stack is not reasonable, the final iput shouldn't happen
in
>> ocfs2_orphan_filldir() function. Looking at the code,
>>
>> 2067???????? /* Skip inodes which are already added to recover list, 
>> since dio may
>> 2068????????? * happen concurrently with unlink/rename */
>> 2069???????? if (OCFS2_I(iter)->ip_next_orphan) {
>> 2070???????????????? iput(iter);
>> 2071???????????????? return 0;
>> 2072???????? }
>> 2073
>>
>> The logic thinks the inode is already in recover list on seeing
>> ip_next_orphan is non-NULL, so it skip this inode after dropping a
>> reference which incremented in ocfs2_iget().
>>
>> While, if the inode is already in recover list, it should have another
>> reference and the iput() at line 2070 should not be the final iput
>> (dropping the last reference). So I don't think the inode is really
>> in the recover list (no vmcore to confirm).
>>
>> Note that ocfs2_queue_orphans(), though not shown up in the call back 
>> trace,
>> is holding cluster lock on the orphan directory when looking up for 
>> unlinked
>> inodes. The on disk inode eviction could involve a lot of IOs which 
>> may need
>> long time to finish. That means this node could hold the cluster lock 
>> for
>> very long time, that can lead to the lock requests (from other nodes) 
>> to the
>> orhpan directory hang for long time.
>>
>> Looking at more on ip_next_orphan, I found it's not initialized
when
>> allocating a new ocfs2_inode_info structure.
>>
>> Fix:
>> ????initialize ip_next_orphan as NULL.
>> Signed-off-by: Wengang Wang <wen.gang.wang at oracle.com>
>
> Reviewed-by: Joseph Qi <joseph.qi at linux.alibaba.com>
Thank you Joseph!

AKPM, could you please pull this patch to your tree?

thanks,
wengang
>> ---
>> v1 -> v2: move the initialization of ip_next_orphan earlier.
>> ---
>> ? fs/ocfs2/super.c | 1 +
>> ? 1 file changed, 1 insertion(+)
>>
>> diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
>> index 1d91dd1e8711..2febc76e9de7 100644
>> --- a/fs/ocfs2/super.c
>> +++ b/fs/ocfs2/super.c
>> @@ -1713,6 +1713,7 @@ static void ocfs2_inode_init_once(void *data)
>> ? ????? oi->ip_blkno = 0ULL;
>> ????? oi->ip_clusters = 0;
>> +??? oi->ip_next_orphan = NULL;
>> ? ????? ocfs2_resv_init_once(&oi->ip_la_data_resv);
>

Ocfs2 devel - Nov 2020 - [PATCH V2] ocfs2: initialize ip_next_orphan

[Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan

[Ocfs2-devel] [PATCH V2] ocfs2: initialize ip_next_orphan