Heming Zhao
2023-Apr-30  03:13 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
fstest generic cases 266 272 281 trigger hanging issue when umount.
I use 266 to describe the root cause.
```
 49 _dmerror_unmount
 50 _dmerror_mount
 51
 52 echo "Compare files"
 53 md5sum $testdir/file1 | _filter_scratch
 54 md5sum $testdir/file2 | _filter_scratch
 55
 56 echo "CoW and unmount"
 57 sync
 58 _dmerror_load_error_table
 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize"
\
 60     -c "fdatasync" $testdir/file2 2>&1)
 61 echo $urk >> $seqres.full
 62 echo "$urk" | grep -q "error" || _fail "pwrite did
not fail"
 63
 64 echo "Clean up the mess"
 65 _dmerror_unmount
```
After line 49 50 umount & mount ocfs2 dev, this case run md5sum to
verify target file. Line 57 run 'sync' before line 58 changes the dm
target from dm-linear to dm-error. This case is hanging at line 65.
The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to
do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans,
the ocfs2_commit_trans doesn't do reduction operation, 'sync'
neither.
finally no function reset ->j_num_trans until umount is triggered.
call flow:
```
[md5sum] //line 53 54
 vfs_read
  ocfs2_file_read_iter
   ocfs2_inode_lock_atime
    ocfs2_update_inode_atime
     + ocfs2_start_trans //atomic_inc j_num_trans
     + ...
     + ocfs2_commit_trans//no modify j_num_trans
sync //line 57. no modify j_num_trans
_dmerror_load_error_table //all write will return error after this line
_dmerror_unmount //found j_num_trans is not zero, run commit thread
               //but the underlying dev is dm-error, journaling IO
               //failed all the time and keep going to retry.
```
*** How to fix ***
kick commit thread in sync path, which can reset j_num_trans to 0.
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/super.c | 3 +++
 1 file changed, 3 insertions(+)
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 0b0e6a132101..bb3fa21e9b47 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
 			jbd2_log_wait_commit(osb->journal->j_journal,
 					     target);
 	}
+	/* kick commit thread to reset journal->j_num_trans */
+	if (atomic_read(&(osb->journal->j_num_trans)))
+		wake_up(&osb->checkpoint_event);
 	return 0;
 }
 
-- 
2.39.2
Heming Zhao
2023-Apr-30  03:13 UTC
[Ocfs2-devel] [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
fstest generic cases 347 361 628 629 trigger a same issue:
When jbd2 enter ABORT status, ocfs2 ignores it and keep going to commit
journal.
This commit gives ocfs2 ability to handle jbd2 ABORT case.
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/alloc.c      | 10 ++++++----
 fs/ocfs2/journal.c    |  5 +++++
 fs/ocfs2/localalloc.c |  3 +++
 3 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 51c93929a146..d90961a1c433 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -6308,11 +6308,13 @@ void ocfs2_truncate_log_shutdown(struct ocfs2_super
*osb)
 
 	if (tl_inode) {
 		cancel_delayed_work(&osb->osb_truncate_log_wq);
-		flush_workqueue(osb->ocfs2_wq);
+		if (!is_journal_aborted(osb->journal->j_journal)) {
+			flush_workqueue(osb->ocfs2_wq);
 
-		status = ocfs2_flush_truncate_log(osb);
-		if (status < 0)
-			mlog_errno(status);
+			status = ocfs2_flush_truncate_log(osb);
+			if (status < 0)
+				mlog_errno(status);
+		}
 
 		brelse(osb->osb_tl_bh);
 		iput(osb->osb_tl_inode);
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 25d8072ccfce..2798067a2f66 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -312,11 +312,16 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb)
 	status = jbd2_journal_flush(journal->j_journal, 0);
 	jbd2_journal_unlock_updates(journal->j_journal);
 	if (status < 0) {
+		if (is_journal_aborted(journal->j_journal)) {
+			ocfs2_error(osb->sb, "jbd2 status: ABORT.\n");
+			goto reset;
+		}
 		up_write(&journal->j_trans_barrier);
 		mlog_errno(status);
 		goto finally;
 	}
 
+reset:
 	ocfs2_inc_trans_id(journal);
 
 	flushed = atomic_read(&journal->j_num_trans);
diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index c4426d12a2ad..e2e3400717b0 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -378,6 +378,9 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb)
 	if (osb->ocfs2_wq)
 		flush_workqueue(osb->ocfs2_wq);
 
+	if (is_journal_aborted(osb->journal->j_journal))
+		goto out;
+
 	if (osb->local_alloc_state == OCFS2_LA_UNUSED)
 		goto out;
 
-- 
2.39.2
Joseph Qi
2023-May-01  02:07 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
Hi, What's the journal status in this case? I wonder why commit thread is not working, which should flush journal and reset j_num_trans during commit cache. Thanks, Joseph On 4/30/23 11:13 AM, Heming Zhao wrote:> fstest generic cases 266 272 281 trigger hanging issue when umount. > > I use 266 to describe the root cause. > > ``` > 49 _dmerror_unmount > 50 _dmerror_mount > 51 > 52 echo "Compare files" > 53 md5sum $testdir/file1 | _filter_scratch > 54 md5sum $testdir/file2 | _filter_scratch > 55 > 56 echo "CoW and unmount" > 57 sync > 58 _dmerror_load_error_table > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > 60 -c "fdatasync" $testdir/file2 2>&1) > 61 echo $urk >> $seqres.full > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > 63 > 64 echo "Clean up the mess" > 65 _dmerror_unmount > ``` > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > verify target file. Line 57 run 'sync' before line 58 changes the dm > target from dm-linear to dm-error. This case is hanging at line 65. > > The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans, > the ocfs2_commit_trans doesn't do reduction operation, 'sync' neither. > finally no function reset ->j_num_trans until umount is triggered. > > call flow: > ``` > [md5sum] //line 53 54 > vfs_read > ocfs2_file_read_iter > ocfs2_inode_lock_atime > ocfs2_update_inode_atime > + ocfs2_start_trans //atomic_inc j_num_trans > + ... > + ocfs2_commit_trans//no modify j_num_trans > > sync //line 57. no modify j_num_trans > > _dmerror_load_error_table //all write will return error after this line > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > //but the underlying dev is dm-error, journaling IO > //failed all the time and keep going to retry. > ``` > > *** How to fix *** > > kick commit thread in sync path, which can reset j_num_trans to 0. > > Signed-off-by: Heming Zhao <heming.zhao at suse.com> > --- > fs/ocfs2/super.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > index 0b0e6a132101..bb3fa21e9b47 100644 > --- a/fs/ocfs2/super.c > +++ b/fs/ocfs2/super.c > @@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > jbd2_log_wait_commit(osb->journal->j_journal, > target); > } > + /* kick commit thread to reset journal->j_num_trans */ > + if (atomic_read(&(osb->journal->j_num_trans))) > + wake_up(&osb->checkpoint_event); > return 0; > } >
Maybe Matching Threads
- [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
- [PATCH] ocfs2: fix missing reset j_num_trans for sync
- [PATCH] ocfs2: fix missing reset j_num_trans for sync
- [PATCH 0/3] ocfs2: Switch over to JBD2.
- [PATCH] ocfs2: flush dentry lock drop when sync ocfs2 volume.