Heming Zhao
2023-Apr-30 03:13 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
fstest generic cases 266 272 281 trigger hanging issue when umount. I use 266 to describe the root cause. ``` 49 _dmerror_unmount 50 _dmerror_mount 51 52 echo "Compare files" 53 md5sum $testdir/file1 | _filter_scratch 54 md5sum $testdir/file2 | _filter_scratch 55 56 echo "CoW and unmount" 57 sync 58 _dmerror_load_error_table 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ 60 -c "fdatasync" $testdir/file2 2>&1) 61 echo $urk >> $seqres.full 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" 63 64 echo "Clean up the mess" 65 _dmerror_unmount ``` After line 49 50 umount & mount ocfs2 dev, this case run md5sum to verify target file. Line 57 run 'sync' before line 58 changes the dm target from dm-linear to dm-error. This case is hanging at line 65. The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans, the ocfs2_commit_trans doesn't do reduction operation, 'sync' neither. finally no function reset ->j_num_trans until umount is triggered. call flow: ``` [md5sum] //line 53 54 vfs_read ocfs2_file_read_iter ocfs2_inode_lock_atime ocfs2_update_inode_atime + ocfs2_start_trans //atomic_inc j_num_trans + ... + ocfs2_commit_trans//no modify j_num_trans sync //line 57. no modify j_num_trans _dmerror_load_error_table //all write will return error after this line _dmerror_unmount //found j_num_trans is not zero, run commit thread //but the underlying dev is dm-error, journaling IO //failed all the time and keep going to retry. ``` *** How to fix *** kick commit thread in sync path, which can reset j_num_trans to 0. Signed-off-by: Heming Zhao <heming.zhao at suse.com> --- fs/ocfs2/super.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c index 0b0e6a132101..bb3fa21e9b47 100644 --- a/fs/ocfs2/super.c +++ b/fs/ocfs2/super.c @@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) jbd2_log_wait_commit(osb->journal->j_journal, target); } + /* kick commit thread to reset journal->j_num_trans */ + if (atomic_read(&(osb->journal->j_num_trans))) + wake_up(&osb->checkpoint_event); return 0; } -- 2.39.2
Heming Zhao
2023-Apr-30 03:13 UTC
[Ocfs2-devel] [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
fstest generic cases 347 361 628 629 trigger a same issue: When jbd2 enter ABORT status, ocfs2 ignores it and keep going to commit journal. This commit gives ocfs2 ability to handle jbd2 ABORT case. Signed-off-by: Heming Zhao <heming.zhao at suse.com> --- fs/ocfs2/alloc.c | 10 ++++++---- fs/ocfs2/journal.c | 5 +++++ fs/ocfs2/localalloc.c | 3 +++ 3 files changed, 14 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 51c93929a146..d90961a1c433 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -6308,11 +6308,13 @@ void ocfs2_truncate_log_shutdown(struct ocfs2_super *osb) if (tl_inode) { cancel_delayed_work(&osb->osb_truncate_log_wq); - flush_workqueue(osb->ocfs2_wq); + if (!is_journal_aborted(osb->journal->j_journal)) { + flush_workqueue(osb->ocfs2_wq); - status = ocfs2_flush_truncate_log(osb); - if (status < 0) - mlog_errno(status); + status = ocfs2_flush_truncate_log(osb); + if (status < 0) + mlog_errno(status); + } brelse(osb->osb_tl_bh); iput(osb->osb_tl_inode); diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c index 25d8072ccfce..2798067a2f66 100644 --- a/fs/ocfs2/journal.c +++ b/fs/ocfs2/journal.c @@ -312,11 +312,16 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb) status = jbd2_journal_flush(journal->j_journal, 0); jbd2_journal_unlock_updates(journal->j_journal); if (status < 0) { + if (is_journal_aborted(journal->j_journal)) { + ocfs2_error(osb->sb, "jbd2 status: ABORT.\n"); + goto reset; + } up_write(&journal->j_trans_barrier); mlog_errno(status); goto finally; } +reset: ocfs2_inc_trans_id(journal); flushed = atomic_read(&journal->j_num_trans); diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c index c4426d12a2ad..e2e3400717b0 100644 --- a/fs/ocfs2/localalloc.c +++ b/fs/ocfs2/localalloc.c @@ -378,6 +378,9 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb) if (osb->ocfs2_wq) flush_workqueue(osb->ocfs2_wq); + if (is_journal_aborted(osb->journal->j_journal)) + goto out; + if (osb->local_alloc_state == OCFS2_LA_UNUSED) goto out; -- 2.39.2
Joseph Qi
2023-May-01 02:07 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
Hi, What's the journal status in this case? I wonder why commit thread is not working, which should flush journal and reset j_num_trans during commit cache. Thanks, Joseph On 4/30/23 11:13 AM, Heming Zhao wrote:> fstest generic cases 266 272 281 trigger hanging issue when umount. > > I use 266 to describe the root cause. > > ``` > 49 _dmerror_unmount > 50 _dmerror_mount > 51 > 52 echo "Compare files" > 53 md5sum $testdir/file1 | _filter_scratch > 54 md5sum $testdir/file2 | _filter_scratch > 55 > 56 echo "CoW and unmount" > 57 sync > 58 _dmerror_load_error_table > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > 60 -c "fdatasync" $testdir/file2 2>&1) > 61 echo $urk >> $seqres.full > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > 63 > 64 echo "Clean up the mess" > 65 _dmerror_unmount > ``` > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > verify target file. Line 57 run 'sync' before line 58 changes the dm > target from dm-linear to dm-error. This case is hanging at line 65. > > The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans, > the ocfs2_commit_trans doesn't do reduction operation, 'sync' neither. > finally no function reset ->j_num_trans until umount is triggered. > > call flow: > ``` > [md5sum] //line 53 54 > vfs_read > ocfs2_file_read_iter > ocfs2_inode_lock_atime > ocfs2_update_inode_atime > + ocfs2_start_trans //atomic_inc j_num_trans > + ... > + ocfs2_commit_trans//no modify j_num_trans > > sync //line 57. no modify j_num_trans > > _dmerror_load_error_table //all write will return error after this line > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > //but the underlying dev is dm-error, journaling IO > //failed all the time and keep going to retry. > ``` > > *** How to fix *** > > kick commit thread in sync path, which can reset j_num_trans to 0. > > Signed-off-by: Heming Zhao <heming.zhao at suse.com> > --- > fs/ocfs2/super.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > index 0b0e6a132101..bb3fa21e9b47 100644 > --- a/fs/ocfs2/super.c > +++ b/fs/ocfs2/super.c > @@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > jbd2_log_wait_commit(osb->journal->j_journal, > target); > } > + /* kick commit thread to reset journal->j_num_trans */ > + if (atomic_read(&(osb->journal->j_num_trans))) > + wake_up(&osb->checkpoint_event); > return 0; > } >
Possibly Parallel Threads
- [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
- [PATCH] ocfs2: fix missing reset j_num_trans for sync
- [PATCH] ocfs2: fix missing reset j_num_trans for sync
- [PATCH 0/3] ocfs2: Switch over to JBD2.
- [PATCH] ocfs2: flush dentry lock drop when sync ocfs2 volume.