Heming Zhao
2023-Apr-21 08:36 UTC
[Ocfs2-devel] [PATCH] ocfs2: fix missing reset j_num_trans for sync
fstest generic case 266 272 281 trigger hanging issue when umount. I use 266 to describe the root cause. ``` 49 _dmerror_unmount 50 _dmerror_mount 51 52 echo "Compare files" 53 md5sum $testdir/file1 | _filter_scratch 54 md5sum $testdir/file2 | _filter_scratch 55 56 echo "CoW and unmount" 57 sync 58 _dmerror_load_error_table 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ 60 -c "fdatasync" $testdir/file2 2>&1) 61 echo $urk >> $seqres.full 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" 63 64 echo "Clean up the mess" 65 _dmerror_unmount ``` After line 49 50 umount & mount ocfs2 dev, this case run md5sum to verify target file. line 57 run 'sync' before line 58 changes the dm target from dm-linear to dm-error. this case is hanging at line 65. In 266, md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to do journal job. but there is only ->j_num_trans+1 in ocfs2_start_trans, the ocfs2_commit_trans doesn't do reduction operation. call flow: ``` [md5sum] vfs_read ocfs2_file_read_iter ocfs2_inode_lock_atime ocfs2_update_inode_atime + ocfs2_start_trans //atomic_inc j_num_trans + ... + ocfs2_commit_trans//no modify j_num_trans sync //no modify j_num_trans _dmerror_load_error_table //all write will return error after this line _dmerror_unmount //found j_num_trans is not zero, run commit thread //but the underlying dev is dm-error, journaling IO //failed all the time and keep going to retry. ``` *** How to fix *** kick commit thread in sync path, which can reset j_num_trans to 0. Signed-off-by: Heming Zhao <heming.zhao at suse.com> --- fs/ocfs2/super.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 0b0e6a132101..9929320bd967 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -408,6 +408,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) if (jbd2_journal_start_commit(osb->journal->j_journal, &target)) { + /* kick commit thread to reset journal->j_num_trans */ + wake_up(&osb->checkpoint_event); + if (wait) jbd2_log_wait_commit(osb->journal->j_journal, target); -- 2.39.2
Heming Zhao
2023-Apr-22 08:55 UTC
[Ocfs2-devel] [PATCH] ocfs2: fix missing reset j_num_trans for sync
Sorry, please pause this patch review. When I was investigating fstest generic failed case 347 361, I found the wake_up() action should move out the 'if()' area. The correct way is calling wake_up() unconditionally. Thanks, Heming On 4/21/23 4:36 PM, Heming Zhao wrote:> fstest generic case 266 272 281 trigger hanging issue when umount. > > I use 266 to describe the root cause. > > ``` > 49 _dmerror_unmount > 50 _dmerror_mount > 51 > 52 echo "Compare files" > 53 md5sum $testdir/file1 | _filter_scratch > 54 md5sum $testdir/file2 | _filter_scratch > 55 > 56 echo "CoW and unmount" > 57 sync > 58 _dmerror_load_error_table > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > 60 -c "fdatasync" $testdir/file2 2>&1) > 61 echo $urk >> $seqres.full > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > 63 > 64 echo "Clean up the mess" > 65 _dmerror_unmount > ``` > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > verify target file. line 57 run 'sync' before line 58 changes the dm > target from dm-linear to dm-error. this case is hanging at line 65. > > In 266, md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > do journal job. but there is only ->j_num_trans+1 in ocfs2_start_trans, > the ocfs2_commit_trans doesn't do reduction operation. > > call flow: > ``` > [md5sum] > vfs_read > ocfs2_file_read_iter > ocfs2_inode_lock_atime > ocfs2_update_inode_atime > + ocfs2_start_trans //atomic_inc j_num_trans > + ... > + ocfs2_commit_trans//no modify j_num_trans > > sync //no modify j_num_trans > > _dmerror_load_error_table //all write will return error after this line > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > //but the underlying dev is dm-error, journaling IO > //failed all the time and keep going to retry. > ``` > > *** How to fix *** > > kick commit thread in sync path, which can reset j_num_trans to 0. > > Signed-off-by: Heming Zhao <heming.zhao at suse.com> > --- > fs/ocfs2/super.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > index 0b0e6a132101..9929320bd967 100644 > --- a/fs/ocfs2/super.c > +++ b/fs/ocfs2/super.c > @@ -408,6 +408,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > > if (jbd2_journal_start_commit(osb->journal->j_journal, > &target)) { > + /* kick commit thread to reset journal->j_num_trans */ > + wake_up(&osb->checkpoint_event); > + > if (wait) > jbd2_log_wait_commit(osb->journal->j_journal, > target); >
Apparently Analagous Threads
- [PATCH] ocfs2: fix missing reset j_num_trans for sync
- [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
- [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
- [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
- [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status