On 11/1/20 5:40 PM, Joseph Qi wrote:>
> On 2020/10/30 23:32, Wengang Wang wrote:
>> Thanks for review Joseph,
>>
>> Please see in lines:
>>
>> On 10/29/20 10:55 PM, Joseph Qi wrote:
>>> On 2020/10/30 05:04, Wengang Wang wrote:
>>>> Though problem if found on a lower 4.1.12 kernel, I think
upstream
>>>> has same issue.
>>>>
>>>> In one node in the cluster, there is the following callback
trace:
>>>>
>>>> # cat /proc/21473/stack
>>>> [<ffffffffc09a2f06>]
__ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
>>>> [<ffffffffc09a4481>]
ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
>>>> [<ffffffffc09b2ce2>] ocfs2_evict_inode+0x152/0x820
[ocfs2]
>>>> [<ffffffff8122b36e>] evict+0xae/0x1a0
>>>> [<ffffffff8122bd26>] iput+0x1c6/0x230
>>>> [<ffffffffc09b60ed>] ocfs2_orphan_filldir+0x5d/0x100
[ocfs2]
>>>> [<ffffffffc0992ae0>] ocfs2_dir_foreach_blk+0x490/0x4f0
[ocfs2]
>>>> [<ffffffffc099a1e9>] ocfs2_dir_foreach+0x29/0x30 [ocfs2]
>>>> [<ffffffffc09b7716>] ocfs2_recover_orphans+0x1b6/0x9a0
[ocfs2]
>>>> [<ffffffffc09b9b4e>] ocfs2_complete_recovery+0x1de/0x5c0
[ocfs2]
>>>> [<ffffffff810a1399>] process_one_work+0x169/0x4a0
>>>> [<ffffffff810a1bcb>] worker_thread+0x5b/0x560
>>>> [<ffffffff810a7a2b>] kthread+0xcb/0xf0
>>>> [<ffffffff816f5d21>] ret_from_fork+0x61/0x90
>>>> [<ffffffffffffffff>] 0xffffffffffffffff
>>>>
>>>> The above stack is not reasonable, the final iput shouldn't
happen in
>>>> ocfs2_orphan_filldir() function. Looking at the code,
>>>>
>>>> 2067???????? /* Skip inodes which are already added to recover
list, since dio may
>>>> 2068????????? * happen concurrently with unlink/rename */
>>>> 2069???????? if (OCFS2_I(iter)->ip_next_orphan) {
>>>> 2070???????????????? iput(iter);
>>>> 2071???????????????? return 0;
>>>> 2072???????? }
>>>> 2073
>>>>
>>>> The logic thinks the inode is already in recover list on seeing
>>>> ip_next_orphan is non-NULL, so it skip this inode after
dropping a
>>>> reference which incremented in ocfs2_iget().
>>>>
>>>> While, if the inode is already in recover list, it should have
another
>>>> reference and the iput() at line 2070 should not be the final
iput
>>>> (dropping the last reference). So I don't think the inode
is really
>>>> in the recover list (no vmcore to confirm).
>>>>
>>>> Note that ocfs2_queue_orphans(), though not shown up in the
call back trace,
>>>> is holding cluster lock on the orphan directory when looking up
for unlinked
>>>> inodes. The on disk inode eviction could involve a lot of IOs
which may need
>>>> long time to finish. That means this node could hold the
cluster lock for
>>>> very long time, that can lead to the lock requests (from other
nodes) to the
>>>> orhpan directory hang for long time.
>>>>
>>>> Looking at more on ip_next_orphan, I found it's not
initialized when
>>>> allocating a new ocfs2_inode_info structure.
>>> I don't see the internal relations.
>> If not initialized, ip_next_orphan could be any value. When it's an
arbitrary value rather than zero (NULL), the problem would appear (at line 2069
and 2070).
>>
>> But, what I am curious is that why this problem didn't raise much
earlier? Hope I can find the answer here.
>>
>>> And AFAIK, ip_next_orphan will be initialized during
ocfs2_queue_orphans().
>> I am not seeing it's initialized in ocfs2_queue_orphans() in source
code v5.10-rc1. Can you provide more details where it's initialized?
>>
> I thought it is initialzed by ocfs2_queue_orphans() ->
> ocfs2_orphan_filldir(). But take a closer look at the code, it's after
> the check you paste above, so you are right.
Thanks for checking again.> I also have the same question now, why we don't encounter it before
> since recovery is very common case for us.
One guess is that the problem actually was happening in the past too, but just
not noticed by people somehow.
The impact of this problem is holding the inode cluster lock for long time
against the orphan directory in question, while if no other nodes request same
lock, that is OK.
How did I notice this problem is that there are reflink operations happening on
different nodes. You know when creating reflinks, they are firstly created under
orphan directories then are secondly moved to destination directories. In both
step1 and step2, it locks the inode cluster lock against the orphan directory
(then unlock). My case is in step2, moving the reflink from orphan directory to
destination with the following stack trace:
[<ffffffffc099bf06>] __ocfs2_cluster_lock.isra.36+0x336/0x9e0 [ocfs2]
[<ffffffffc099d481>] ocfs2_inode_lock_full_nested+0x121/0x520 [ocfs2]
[<ffffffffc09c3033>] ocfs2_mv_orphaned_inode_to_new+0x383/0xaa0 [ocfs2]
[<ffffffffc09d0c44>] ocfs2_vfs_reflink+0x5e4/0x850 [ocfs2]
[<ffffffffc09d1137>] ocfs2_reflink_ioctl+0x287/0x290 [ocfs2]
[<ffffffffc09ae410>] ocfs2_ioctl+0x280/0x750 [ocfs2]
[<ffffffff8122310b>] do_vfs_ioctl+0x2fb/0x510
Those reflink operations are found hang for very long time.
Let's wait for some more ideas.
thanks,
wengang