Joseph Qi
2023-May-01 02:07 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
Hi, What's the journal status in this case? I wonder why commit thread is not working, which should flush journal and reset j_num_trans during commit cache. Thanks, Joseph On 4/30/23 11:13 AM, Heming Zhao wrote:> fstest generic cases 266 272 281 trigger hanging issue when umount. > > I use 266 to describe the root cause. > > ``` > 49 _dmerror_unmount > 50 _dmerror_mount > 51 > 52 echo "Compare files" > 53 md5sum $testdir/file1 | _filter_scratch > 54 md5sum $testdir/file2 | _filter_scratch > 55 > 56 echo "CoW and unmount" > 57 sync > 58 _dmerror_load_error_table > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > 60 -c "fdatasync" $testdir/file2 2>&1) > 61 echo $urk >> $seqres.full > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > 63 > 64 echo "Clean up the mess" > 65 _dmerror_unmount > ``` > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > verify target file. Line 57 run 'sync' before line 58 changes the dm > target from dm-linear to dm-error. This case is hanging at line 65. > > The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans, > the ocfs2_commit_trans doesn't do reduction operation, 'sync' neither. > finally no function reset ->j_num_trans until umount is triggered. > > call flow: > ``` > [md5sum] //line 53 54 > vfs_read > ocfs2_file_read_iter > ocfs2_inode_lock_atime > ocfs2_update_inode_atime > + ocfs2_start_trans //atomic_inc j_num_trans > + ... > + ocfs2_commit_trans//no modify j_num_trans > > sync //line 57. no modify j_num_trans > > _dmerror_load_error_table //all write will return error after this line > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > //but the underlying dev is dm-error, journaling IO > //failed all the time and keep going to retry. > ``` > > *** How to fix *** > > kick commit thread in sync path, which can reset j_num_trans to 0. > > Signed-off-by: Heming Zhao <heming.zhao at suse.com> > --- > fs/ocfs2/super.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > index 0b0e6a132101..bb3fa21e9b47 100644 > --- a/fs/ocfs2/super.c > +++ b/fs/ocfs2/super.c > @@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > jbd2_log_wait_commit(osb->journal->j_journal, > target); > } > + /* kick commit thread to reset journal->j_num_trans */ > + if (atomic_read(&(osb->journal->j_num_trans))) > + wake_up(&osb->checkpoint_event); > return 0; > } >
Heming Zhao
2023-May-01 02:26 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
On Mon, May 01, 2023 at 10:07:34AM +0800, Joseph Qi wrote:> Hi, > > What's the journal status in this case?jbd2 is marked with JBD2_ABORT> I wonder why commit thread is not working, which should flush journal > and reset j_num_trans during commit cache.The reason is in my above answer. below is the dmesg log for No. 266. ``` [ 595.071807] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: joining the lockspace group... [ 595.080950] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: group event done 0 [ 595.081091] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover 1 [ 595.081580] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: add member 1 [ 595.081886] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_members 1 nodes [ 595.082173] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: generation 1 slots 1 1:1 [ 595.082420] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_directory [ 595.082696] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_directory 0 in 0 new [ 595.082932] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_directory 0 out 0 messages [ 595.083188] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover 1 generation 1 done: 0 ms [ 595.084721] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: join complete [ 595.095436] kernel: ocfs2: Mounting device (253,16) on (node 1, slot 0) with ordered data mode. [ 595.787078] systemd[1]: Started /usr/bin/bash -c exit 77. [ 595.802650] systemd[1]: fstests-check.scope: Deactivated successfully. [ 596.195734] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: joining the lockspace group... [ 596.206310] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: group event done 0 [ 596.206350] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: dlm_recover 1 [ 596.209667] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: add member 1 [ 596.211433] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: dlm_recover_members 1 nodes [ 596.213942] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: generation 1 slots 1 1:1 [ 596.215668] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: dlm_recover_directory [ 596.217171] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: dlm_recover_directory 0 in 0 new [ 596.218253] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: dlm_recover_directory 0 out 0 messages [ 596.219401] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: dlm_recover 1 generation 1 done: 0 ms [ 596.220628] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: join complete [ 596.224616] kernel: JBD2: Ignoring recovery information on journal [ 596.229686] kernel: ocfs2: Mounting device (253,32) on (node 1, slot 0) with ordered data mode. [ 596.870008] systemd[1]: fstest-scratch.mount: Deactivated successfully. [ 596.236135] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: leaving the lockspace group... [ 596.237431] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: group event done 0 [ 596.239442] kernel: dlm: 78CE18109AE44E6AA896722A5E25CFC0: release_lockspace final free [ 596.877286] ocfs2_hb_ctl[5436]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -K -u 78CE18109AE44E6AA896722A5E25CFC0 [ 596.248299] kernel: ocfs2: Unmounting device (253,32) on (node 1) [ 596.255888] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: leaving the lockspace group... [ 596.905718] systemd[1]: fstest-test.mount: Deactivated successfully. [ 596.280602] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: group event done 0 [ 596.281762] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: release_lockspace final free [ 596.921672] ocfs2_hb_ctl[5448]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -K -u 838DD74847E34AA1839D6C4AA31132F1 [ 596.299507] kernel: ocfs2: Unmounting device (253,16) on (node 1) [ 596.481755] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: joining the lockspace group... [ 596.484722] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover 1 [ 596.484729] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: group event done 0 [ 596.485361] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: add member 1 [ 596.485622] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_members 1 nodes [ 596.485861] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: generation 1 slots 1 1:1 [ 596.486063] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_directory [ 596.486293] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_directory 0 in 0 new [ 596.486498] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover_directory 0 out 0 messages [ 596.486768] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: dlm_recover 1 generation 1 done: 0 ms [ 596.487037] kernel: dlm: 838DD74847E34AA1839D6C4AA31132F1: join complete [ 596.491655] kernel: ocfs2: Mounting device (253,16) on (node 1, slot 0) with ordered data mode. [ 597.139758] root[5475]: run xfstest generic/266 [ 596.505896] unknown: run fstests generic/266 at 2023-05-01 10:19:02 [ 597.186885] systemd[1]: Started /usr/bin/bash -c test -w /proc/self/oom_score_adj && echo 250 > /proc/self/oom_score_adj; exec ./tests/generic/266. [ 597.566466] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: joining the lockspace group... [ 597.579759] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: group event done 0 [ 597.579766] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover 1 [ 597.581004] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: add member 1 [ 597.581820] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_members 1 nodes [ 597.582292] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: generation 1 slots 1 1:1 [ 597.582894] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_directory [ 597.583423] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_directory 0 in 0 new [ 597.583991] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_directory 0 out 0 messages [ 597.584547] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover 1 generation 1 done: 4 ms [ 597.585079] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: join complete [ 597.590136] kernel: JBD2: Ignoring recovery information on journal [ 597.594258] kernel: ocfs2: Mounting device (253,32) on (node 1, slot 0) with ordered data mode. [ 598.239046] systemd[1]: fstest-scratch.mount: Deactivated successfully. [ 597.606575] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: leaving the lockspace group... [ 597.607446] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: group event done 0 [ 598.244507] ocfs2_hb_ctl[5680]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -K -u AB0245F740984C25A9D0907D61DFCF8A [ 597.607843] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: release_lockspace final free [ 597.615474] kernel: ocfs2: Unmounting device (253,32) on (node 1) [ 597.641624] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: joining the lockspace group... [ 597.659435] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover 1 [ 597.659479] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: group event done 0 [ 597.661482] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: add member 1 [ 597.662340] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_members 1 nodes [ 597.663211] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: generation 1 slots 1 1:1 [ 597.664150] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_directory [ 597.665466] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_directory 0 in 0 new [ 597.666370] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover_directory 0 out 0 messages [ 597.667235] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: dlm_recover 1 generation 1 done: 0 ms [ 597.668428] kernel: dlm: AB0245F740984C25A9D0907D61DFCF8A: join complete [ 600.194974] systemd[1]: Started Daily Cleanup of Snapper Snapshots. [ 600.207227] dbus-daemon[672]: [system] Activating via systemd: service name='org.opensuse.Snapper' unit='snapperd.service' requested by ':1.10' (uid=0 pid=5728 comm="/usr/lib/snapper/systemd-helper --cleanup") [ 600.209151] systemd[1]: Starting DBus interface for snapper... [ 600.217270] dbus-daemon[672]: [system] Successfully activated service 'org.opensuse.Snapper' [ 600.217370] systemd[1]: Started DBus interface for snapper. [ 600.218969] systemd[1]: snapper-cleanup.service: Deactivated successfully. [ 599.608010] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: joining the lockspace group... [ 599.611041] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover 1 [ 599.611127] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: group event done 0 [ 599.611813] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: add member 1 [ 599.612019] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_members 1 nodes [ 599.612227] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: generation 1 slots 1 1:1 [ 599.612397] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_directory [ 599.612559] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_directory 0 in 0 new [ 599.612712] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_directory 0 out 0 messages [ 599.612880] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover 1 generation 1 done: 0 ms [ 599.613078] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: join complete [ 599.615320] kernel: JBD2: Ignoring recovery information on journal [ 599.619221] kernel: ocfs2: Mounting device (254,0) on (node 1, slot 0) with ordered data mode. [ 600.392044] systemd[1]: fstest-scratch.mount: Deactivated successfully. [ 599.780111] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: leaving the lockspace group... [ 599.781949] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: group event done 0 [ 599.784235] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: release_lockspace final free [ 600.426136] ocfs2_hb_ctl[5762]: ocfs2_hb_ctl /sbin/ocfs2_hb_ctl -K -u F06626AE7D754F11A0211679C145C8BF [ 599.808566] kernel: ocfs2: Unmounting device (254,0) on (node 1) [ 599.827803] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: joining the lockspace group... [ 599.832848] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover 1 [ 599.832984] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: group event done 0 [ 599.833810] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: add member 1 [ 599.834272] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_members 1 nodes [ 599.834663] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: generation 1 slots 1 1:1 [ 599.835030] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_directory [ 599.835474] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_directory 0 in 0 new [ 599.835868] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover_directory 0 out 0 messages [ 599.836365] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: dlm_recover 1 generation 1 done: 4 ms [ 599.836829] kernel: dlm: F06626AE7D754F11A0211679C145C8BF: join complete [ 599.845790] kernel: ocfs2: Mounting device (254,0) on (node 1, slot 0) with ordered data mode. [ 600.079285] kernel: Buffer I/O error on dev dm-0, logical block 2621424, async page read [ 600.090845] kernel: (xfs_io,5785,0):ocfs2_inode_lock_update:2354 ERROR: status = -5 [ 600.091387] kernel: (xfs_io,5785,0):ocfs2_inode_lock_full_nested:2502 ERROR: status = -5 [ 600.091924] kernel: (xfs_io,5785,0):ocfs2_statfs:1654 ERROR: status = -5 [ 600.092422] kernel: (xfs_io,5785,0):ocfs2_statfs:1686 ERROR: status = -5 [ 600.094252] kernel: (xfs_io,5785,0):ocfs2_inode_lock_update:2354 ERROR: status = -5 [ 600.095029] kernel: (xfs_io,5785,0):ocfs2_inode_lock_full_nested:2502 ERROR: status = -5 [ 600.095891] kernel: (xfs_io,5785,0):ocfs2_statfs:1654 ERROR: status = -5 [ 600.096698] kernel: (xfs_io,5785,0):ocfs2_statfs:1686 ERROR: status = -5 [ 600.099071] kernel: (xfs_io,5785,0):ocfs2_inode_lock_update:2354 ERROR: status = -5 [ 600.099957] kernel: (xfs_io,5785,0):ocfs2_inode_lock_full_nested:2502 ERROR: status = -5 [ 600.100798] kernel: (xfs_io,5785,0):ocfs2_statfs:1654 ERROR: status = -5 [ 600.101623] kernel: (xfs_io,5785,0):ocfs2_statfs:1686 ERROR: status = -5 [ 600.107094] kernel: (xfs_io,5785,1):ocfs2_get_refcount_tree:358 ERROR: status = -5 [ 600.107659] kernel: (xfs_io,5785,1):ocfs2_lock_refcount_tree:456 ERROR: status = -5 [ 600.108175] kernel: (xfs_io,5785,1):ocfs2_refcount_cow_hunk:3417 ERROR: status = -5 [ 600.108654] kernel: (xfs_io,5785,1):ocfs2_refcount_cow:3478 ERROR: status = -5 [ 600.109128] kernel: (xfs_io,5785,1):ocfs2_prepare_inode_for_write:2326 ERROR: status = -5 [ 600.109613] kernel: (xfs_io,5785,1):ocfs2_file_write_iter:2433 ERROR: status = -5 [ 600.118113] kernel: (umount,5789,3):ocfs2_inode_lock_update:2354 ERROR: status = -5 [ 600.118481] kernel: (umount,5789,3):ocfs2_inode_lock_full_nested:2502 ERROR: status = -5 [ 600.118898] kernel: (umount,5789,3):ocfs2_statfs:1654 ERROR: status = -5 [ 600.119364] kernel: (umount,5789,3):ocfs2_statfs:1686 ERROR: status = -5 [ 600.759255] systemd[1]: fstest-scratch.mount: Deactivated successfully. [ 600.127135] kernel: Buffer I/O error on dev dm-0, logical block 529, lost sync page write [ 600.127373] kernel: JBD2: I/O error when updating journal superblock for dm-0-24. [ 600.127615] kernel: Aborting journal on device dm-0-24. [ 600.127893] kernel: Buffer I/O error on dev dm-0, logical block 529, lost sync page write [ 600.128137] kernel: JBD2: I/O error when updating journal superblock for dm-0-24. [ 600.128362] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 600.128591] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_thread:2346 ERROR: status = -5, journal is already aborted. [ 601.147704] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 602.171745] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 603.195707] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 604.219709] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 605.243712] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 606.267748] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 607.291715] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 608.315742] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 609.339724] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 610.363774] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 611.387693] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 612.411727] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 613.435719] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 614.459743] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 615.483742] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 615.484503] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_thread:2346 ERROR: status = -5, journal is already aborted. [ 616.507742] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 617.531751] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 618.555743] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 619.579702] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 [ 620.603758] kernel: (ocfs2cmt-F06626,5771,2):ocfs2_commit_cache:321 ERROR: status = -5 ^C tb-fstest1:/usr/src/linux/fs/ocfs2 # ``` Thanks, Heming> > On 4/30/23 11:13 AM, Heming Zhao wrote: > > fstest generic cases 266 272 281 trigger hanging issue when umount. > > > > I use 266 to describe the root cause. > > > > ``` > > 49 _dmerror_unmount > > 50 _dmerror_mount > > 51 > > 52 echo "Compare files" > > 53 md5sum $testdir/file1 | _filter_scratch > > 54 md5sum $testdir/file2 | _filter_scratch > > 55 > > 56 echo "CoW and unmount" > > 57 sync > > 58 _dmerror_load_error_table > > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > > 60 -c "fdatasync" $testdir/file2 2>&1) > > 61 echo $urk >> $seqres.full > > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > > 63 > > 64 echo "Clean up the mess" > > 65 _dmerror_unmount > > ``` > > > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > > verify target file. Line 57 run 'sync' before line 58 changes the dm > > target from dm-linear to dm-error. This case is hanging at line 65. > > > > The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > > do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans, > > the ocfs2_commit_trans doesn't do reduction operation, 'sync' neither. > > finally no function reset ->j_num_trans until umount is triggered. > > > > call flow: > > ``` > > [md5sum] //line 53 54 > > vfs_read > > ocfs2_file_read_iter > > ocfs2_inode_lock_atime > > ocfs2_update_inode_atime > > + ocfs2_start_trans //atomic_inc j_num_trans > > + ... > > + ocfs2_commit_trans//no modify j_num_trans > > > > sync //line 57. no modify j_num_trans > > > > _dmerror_load_error_table //all write will return error after this line > > > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > > //but the underlying dev is dm-error, journaling IO > > //failed all the time and keep going to retry. > > ``` > > > > *** How to fix *** > > > > kick commit thread in sync path, which can reset j_num_trans to 0. > > > > Signed-off-by: Heming Zhao <heming.zhao at suse.com> > > --- > > fs/ocfs2/super.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > > index 0b0e6a132101..bb3fa21e9b47 100644 > > --- a/fs/ocfs2/super.c > > +++ b/fs/ocfs2/super.c > > @@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > > jbd2_log_wait_commit(osb->journal->j_journal, > > target); > > } > > + /* kick commit thread to reset journal->j_num_trans */ > > + if (atomic_read(&(osb->journal->j_num_trans))) > > + wake_up(&osb->checkpoint_event); > > return 0; > > } > >
Heming Zhao
2023-May-01 02:48 UTC
[Ocfs2-devel] [PATCH 1/2] ocfs2: fix missing reset j_num_trans for sync
On Mon, May 01, 2023 at 10:07:34AM +0800, Joseph Qi wrote:> Hi, > > What's the journal status in this case? > I wonder why commit thread is not working, which should flush journal > and reset j_num_trans during commit cache.This rootcasue of NO. 266 is that ocfs2_sync_fs bypasses commit thread and directly calls jbd2 APIs to commit journal. This code design makes 'sync' doesn't reset ->j_num_trans. So my patch add the kick commit thread action in sync path. Thanks, Heming> > > On 4/30/23 11:13 AM, Heming Zhao wrote: > > fstest generic cases 266 272 281 trigger hanging issue when umount. > > > > I use 266 to describe the root cause. > > > > ``` > > 49 _dmerror_unmount > > 50 _dmerror_mount > > 51 > > 52 echo "Compare files" > > 53 md5sum $testdir/file1 | _filter_scratch > > 54 md5sum $testdir/file2 | _filter_scratch > > 55 > > 56 echo "CoW and unmount" > > 57 sync > > 58 _dmerror_load_error_table > > 59 urk=$($XFS_IO_PROG -f -c "pwrite -S 0x63 -b $bufsize 0 $filesize" \ > > 60 -c "fdatasync" $testdir/file2 2>&1) > > 61 echo $urk >> $seqres.full > > 62 echo "$urk" | grep -q "error" || _fail "pwrite did not fail" > > 63 > > 64 echo "Clean up the mess" > > 65 _dmerror_unmount > > ``` > > > > After line 49 50 umount & mount ocfs2 dev, this case run md5sum to > > verify target file. Line 57 run 'sync' before line 58 changes the dm > > target from dm-linear to dm-error. This case is hanging at line 65. > > > > The md5sum calls jbd2 trans pair: ocfs2_[start|commit]_trans to > > do journal job. But there is only ->j_num_trans+1 in ocfs2_start_trans, > > the ocfs2_commit_trans doesn't do reduction operation, 'sync' neither. > > finally no function reset ->j_num_trans until umount is triggered. > > > > call flow: > > ``` > > [md5sum] //line 53 54 > > vfs_read > > ocfs2_file_read_iter > > ocfs2_inode_lock_atime > > ocfs2_update_inode_atime > > + ocfs2_start_trans //atomic_inc j_num_trans > > + ... > > + ocfs2_commit_trans//no modify j_num_trans > > > > sync //line 57. no modify j_num_trans > > > > _dmerror_load_error_table //all write will return error after this line > > > > _dmerror_unmount //found j_num_trans is not zero, run commit thread > > //but the underlying dev is dm-error, journaling IO > > //failed all the time and keep going to retry. > > ``` > > > > *** How to fix *** > > > > kick commit thread in sync path, which can reset j_num_trans to 0. > > > > Signed-off-by: Heming Zhao <heming.zhao at suse.com> > > --- > > fs/ocfs2/super.c | 3 +++ > > 1 file changed, 3 insertions(+) > > > > diff --git a/fs/ocfs2/super.c b/fs/ocfs2/super.c > > index 0b0e6a132101..bb3fa21e9b47 100644 > > --- a/fs/ocfs2/super.c > > +++ b/fs/ocfs2/super.c > > @@ -412,6 +412,9 @@ static int ocfs2_sync_fs(struct super_block *sb, int wait) > > jbd2_log_wait_commit(osb->journal->j_journal, > > target); > > } > > + /* kick commit thread to reset journal->j_num_trans */ > > + if (atomic_read(&(osb->journal->j_num_trans))) > > + wake_up(&osb->checkpoint_event); > > return 0; > > } > >