Heming Zhao
2023-Apr-30 03:13 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
fstest generic cases 266 272 281 trigger hanging issue when umount.
I use 266 to describe the root cause.
```
49 _dmerror_unmount
50 _dmerror_mount
51
52 echo "Compare files"
53 md5sum $testdir/file1 | _filter_scratch
54 md5sum $testdir/file2 | _filter_scratch
55
56 echo "CoW and unmount"
57 sync
58 _dmerror_load_error_table
59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize"
\
60 -c "fdatasync" $testdir/file2 2>&1)
61 echo $urk >> $seqres.full
62 echo "$urk" | grep -q "error" || _fail "pwrite did
not fail"
63
64 echo "Clean up the mess"
65 _dmerror_unmount
```
After line 49 50 umount & mount ocfs2 dev, this case run md5sum to
verify target file. Line 57 run 'sync' before line 58 changes the dm
target from dm-linear to dm-error. This case is hanging at line 65.
The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to
do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans,
the ocfs2_commit_trans doesn't do reduction operation, 'sync'
neither.
finally no function reset ->j_num_trans until umount is triggered.
call flow:
```
[md5sum] //line 53 54
vfs_read
ocfs2_file_read_iter
ocfs2_inode_lock_atime
ocfs2_update_inode_atime
+ ocfs2_start_trans //atomic_inc j_num_trans
+ ...
+ ocfs2_commit_trans//no modify j_num_trans
sync //line 57. no modify j_num_trans
_dmerror_load_error_table //all write will return error after this line
_dmerror_unmount //found j_num_trans is not zero, run commit thread
//but the underlying dev is dm-error, journaling IO
//failed all the time and keep going to retry.
```
*** How to fix ***
kick commit thread in sync path, which can reset j_num_trans to 0.
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
fs/ocfs2/super.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c
index 0b0e6a132101..bb3fa21e9b47 100644
--- a/fs/ocfs2/super.c
+++ b/fs/ocfs2/super.c
@@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait)
jbd2_log_wait_commit(osb->journal->j_journal,
target);
}
+ /* kick commit thread to reset journal->j_num_trans */
+ if (atomic_read(&(osb->journal->j_num_trans)))
+ wake_up(&osb->checkpoint_event);
return 0;
}
--
2.39.2
Heming Zhao
2023-Apr-30 03:13 UTC
[Ocfs2-devel] [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
fstest generic cases 347 361 628 629 trigger a same issue:
When jbd2 enter ABORT status, ocfs2 ignores it and keep going to commit
journal.
This commit gives ocfs2 ability to handle jbd2 ABORT case.
Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
fs/ocfs2/alloc.c | 10 ++++++----
fs/ocfs2/journal.c | 5 +++++
fs/ocfs2/localalloc.c | 3 +++
3 files changed, 14 insertions(+), 4 deletions(-)
diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 51c93929a146..d90961a1c433 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -6308,11 +6308,13 @@ void ocfs2_truncate_log_shutdown(struct ocfs2_super
*osb)
if (tl_inode) {
cancel_delayed_work(&osb->osb_truncate_log_wq);
- flush_workqueue(osb->ocfs2_wq);
+ if (!is_journal_aborted(osb->journal->j_journal)) {
+ flush_workqueue(osb->ocfs2_wq);
- status = ocfs2_flush_truncate_log(osb);
- if (status < 0)
- mlog_errno(status);
+ status = ocfs2_flush_truncate_log(osb);
+ if (status < 0)
+ mlog_errno(status);
+ }
brelse(osb->osb_tl_bh);
iput(osb->osb_tl_inode);
diff --git a/fs/ocfs2/journal.c b/fs/ocfs2/journal.c
index 25d8072ccfce..2798067a2f66 100644
--- a/fs/ocfs2/journal.c
+++ b/fs/ocfs2/journal.c
@@ -312,11 +312,16 @@ static int ocfs2_commit_cache(struct ocfs2_super *osb)
status = jbd2_journal_flush(journal->j_journal, 0);
jbd2_journal_unlock_updates(journal->j_journal);
if (status < 0) {
+ if (is_journal_aborted(journal->j_journal)) {
+ ocfs2_error(osb->sb, "jbd2 status: ABORT.\n");
+ goto reset;
+ }
up_write(&journal->j_trans_barrier);
mlog_errno(status);
goto finally;
}
+reset:
ocfs2_inc_trans_id(journal);
flushed = atomic_read(&journal->j_num_trans);
diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index c4426d12a2ad..e2e3400717b0 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -378,6 +378,9 @@ void ocfs2_shutdown_local_alloc(struct ocfs2_super *osb)
if (osb->ocfs2_wq)
flush_workqueue(osb->ocfs2_wq);
+ if (is_journal_aborted(osb->journal->j_journal))
+ goto out;
+
if (osb->local_alloc_state == OCFS2_LA_UNUSED)
goto out;
--
2.39.2
Joseph Qi
2023-May-01 02:07 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
Hi, What's the journal status in this case? I wonder why commit thread is not working, which should flush journal and reset j_num_trans during commit cache. Thanks, Joseph On 4/30/23 11:13 AM, Heming Zhao wrote:> fstest generic cases 266 272 281 trigger hanging issue when umount. > > I use 266 to describe the root cause. > > ``` > 49 _dmerror_unmount > 50 _dmerror_mount > 51 > 52 echo "Compare files" > 53 md5sum $testdir/file1 | _filter_scratch > 54 md5sum $testdir/file2 | _filter_scratch > 55 > 56 echo "CoW and unmount" > 57 sync > 58 _dmerror_load_error_table > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > 60 -c "fdatasync" $testdir/file2 2>&1) > 61 echo $urk >> $seqres.full > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > 63 > 64 echo "Clean up the mess" > 65 _dmerror_unmount > ``` > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > verify target file. Line 57 run 'sync' before line 58 changes the dm > target from dm-linear to dm-error. This case is hanging at line 65. > > The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans, > the ocfs2_commit_trans doesn't do reduction operation, 'sync' neither. > finally no function reset ->j_num_trans until umount is triggered. > > call flow: > ``` > [md5sum] //line 53 54 > vfs_read > ocfs2_file_read_iter > ocfs2_inode_lock_atime > ocfs2_update_inode_atime > + ocfs2_start_trans //atomic_inc j_num_trans > + ... > + ocfs2_commit_trans//no modify j_num_trans > > sync //line 57. no modify j_num_trans > > _dmerror_load_error_table //all write will return error after this line > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > //but the underlying dev is dm-error, journaling IO > //failed all the time and keep going to retry. > ``` > > *** How to fix *** > > kick commit thread in sync path, which can reset j_num_trans to 0. > > Signed-off-by: Heming Zhao <heming.zhao at suse.com> > --- > fs/ocfs2/super.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > index 0b0e6a132101..bb3fa21e9b47 100644 > --- a/fs/ocfs2/super.c > +++ b/fs/ocfs2/super.c > @@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > jbd2_log_wait_commit(osb->journal->j_journal, > target); > } > + /* kick commit thread to reset journal->j_num_trans */ > + if (atomic_read(&(osb->journal->j_num_trans))) > + wake_up(&osb->checkpoint_event); > return 0; > } >
Possibly Parallel Threads
- [PATCH 2/2] ocfs2: add error handling path when jbd2 enter ABORT status
- [PATCH] ocfs2: fix missing reset j_num_trans for sync
- [PATCH] ocfs2: fix missing reset j_num_trans for sync
- [PATCH 0/3] ocfs2: Switch over to JBD2.
- [PATCH] ocfs2: flush dentry lock drop when sync ocfs2 volume.