thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally [May 2022]

If this information is useful, please help other people find it:
Share via:

Heming Zhao

2022-May-21 10:14 UTC

[Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path

defragfs triggered jbd2 report ASSERT when working.

code path:

ocfs2_ioctl_move_extents
 ocfs2_move_extents
  __ocfs2_move_extents_range
   ocfs2_defrag_extent
    __ocfs2_move_extent
     + ocfs2_journal_access_di
     + ocfs2_split_extent  //[1]
     + ocfs2_journal_dirty //crash

[1]: ocfs2_split_extent called ocfs2_extend_trans, which committed
dirty buffers then restarted the transaction. The changed transaction
triggered jbd2 ASSERT.

crash stacks:

PID: 11297  TASK: ffff974a676dcd00  CPU: 67  COMMAND: "defragfs.ocfs2"
 #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
 #1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
 #2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
 #3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
 #4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
 #5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
 #6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
    [exception RIP: jbd2_journal_dirty_metadata+0x2ba]
    RIP: ffffffffc09ca54a  RSP: ffffb25d8dad3b70  RFLAGS: 00010207
    RAX: 0000000000000000  RBX: ffff9706eedc5248  RCX: 0000000000000000
    RDX: 0000000000000001  RSI: ffff97337029ea28  RDI: ffff9706eedc5250
    RBP: ffff9703c3520200   R8: 000000000f46b0b2   R9: 0000000000000000
    R10: 0000000000000001  R11: 00000001000000fe  R12: ffff97337029ea28
    R13: 0000000000000000  R14: ffff9703de59bf60  R15: ffff9706eedc5250
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
 #8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
 #9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]

The fix method is simple, ocfs2_split_extent includes three paths. all
the paths already have journal access/dirty pair. We only need to
remove journal access/dirty from __ocfs2_move_extent.

Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/move_extents.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/fs/ocfs2/move_extents.c b/fs/ocfs2/move_extents.c
index 192cad0662d8..6251748c695b 100644
--- a/fs/ocfs2/move_extents.c
+++ b/fs/ocfs2/move_extents.c
@@ -105,14 +105,6 @@ static int __ocfs2_move_extent(handle_t *handle,
 	 */
 	replace_rec.e_flags = ext_flags & ~OCFS2_EXT_REFCOUNTED;
 
-	ret = ocfs2_journal_access_di(handle, INODE_CACHE(inode),
-				      context->et.et_root_bh,
-				      OCFS2_JOURNAL_ACCESS_WRITE);
-	if (ret) {
-		mlog_errno(ret);
-		goto out;
-	}
-
 	ret = ocfs2_split_extent(handle, &context->et, path, index,
 				 &replace_rec, context->meta_ac,
 				 &context->dealloc);
@@ -121,8 +113,6 @@ static int __ocfs2_move_extent(handle_t *handle,
 		goto out;
 	}
 
-	ocfs2_journal_dirty(handle, context->et.et_root_bh);
-
 	context->new_phys_cpos = new_p_cpos;
 
 	/*
-- 
2.34.1

Heming Zhao

2022-May-21 10:14 UTC

head link

[Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally

When la state is ENABLE, ocfs2_recalc_la_window restores la window
unconditionally. The logic is wrong.

Let's image below path.

1. la state (->local_alloc_state) is set THROTTLED or DISABLED.

2. About 30s (OCFS2_LA_ENABLE_INTERVAL), delayed work is triggered,
   ocfs2_la_enable_worker set la state to ENABLED directly.

3. a write IOs thread run:

   ```
   ocfs2_write_begin
    ...
     ocfs2_lock_allocators
      ocfs2_reserve_clusters
       ocfs2_reserve_clusters_with_limit
        ocfs2_reserve_local_alloc_bits
         ocfs2_local_alloc_slide_window // [1]
          + ocfs2_recalc_la_window(osb, OCFS2_LA_EVENT_SLIDE) // [2]
          + ...
          + ocfs2_local_alloc_new_window
             ocfs2_claim_clusters // [3]
   ```

[1]: will be called when la window bits used up.
[2]: under la state is ENABLED (eg OCFS2_LA_ENABLE_INTERVAL delayed work
     happened), it unconditionally restores la window to default value.
[3]: will use default la window size to search clusters. IMO the timing
     is O(n^4). The timing O(n^4) will cost huge time to scan global
     bitmap. It makes write IOs (eg user space 'dd') become dramatically
     slow.

i.e.
an ocfs2 partition size: 1.45TB, cluster size: 4KB,
la window default size: 106MB.
The partition is fragmentation by creating & deleting huge mount of
small file.

the timing should be (the number got from real world):
- la window size change order (size: MB):
  106, 53, 26.5, 13, 6.5, 3.25, 1.6, 0.8
  only 0.8MB succeed, 0.8MB also triggers la window to disable.
  ocfs2_local_alloc_new_window retries 8 times, first 7 times totally
  runs in worst case.
- group chain number: 242
  ocfs2_claim_suballoc_bits calls for-loop 242 times
- each chain has 49 block group
  ocfs2_search_chain calls while-loop 49 times
- each bg has 32256 blocks
  ocfs2_block_group_find_clear_bits calls while-loop for 32256 bits.
  for ocfs2_find_next_zero_bit uses ffz() to find zero bit, let's use
  (32256/64) for timing calucation.

So the loop times: 7*242*49*(32256/64) = 41835024 (~42 million times)

In the worst case, user space writes 100MB data will trigger 42M scanning
times, and if the write can't finish within 30s (OCFS2_LA_ENABLE_INTERVAL),
the write IO will suffer another 42M scanning times. It makes the ocfs2
partition keep pool performance all the time.

The fix method:

1.  la restores double la size once.

current code logic decrease la window with half size once, but directly
restores default_bits one time. It bounces the la window between
'<1M'
and default_bits. This patch makes restoring process more smoothly.
eg.
la default window is 106MB, current la window is 13MB.
when there is a free action to release one block group space, la should
roll back la size to 26MB (by 13*2).
if there are many free actions to release many block group space, la
will smoothly roll back to default window (106MB).

2. introduced a new state: OCFS2_LA_RESTORE.

Current code uses OCFS2_LA_ENABLED to mark a new big space available.
the state overwrite OCFS2_LA_THROTTLED, it makes la window forget
it's already in throttled status.
'->local_alloc_state' should keep OCFS2_LA_THROTTLED until la window
restore to default_bits.

Signed-off-by: Heming Zhao <heming.zhao at suse.com>
---
 fs/ocfs2/localalloc.c | 30 +++++++++++++++++++++---------
 fs/ocfs2/ocfs2.h      | 18 +++++++++++-------
 fs/ocfs2/suballoc.c   |  2 +-
 3 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/fs/ocfs2/localalloc.c b/fs/ocfs2/localalloc.c
index c4426d12a2ad..28acea717d7f 100644
--- a/fs/ocfs2/localalloc.c
+++ b/fs/ocfs2/localalloc.c
@@ -205,20 +205,21 @@ void ocfs2_la_set_sizes(struct ocfs2_super *osb, int
requested_mb)
 
 static inline int ocfs2_la_state_enabled(struct ocfs2_super *osb)
 {
-	return (osb->local_alloc_state == OCFS2_LA_THROTTLED ||
-		osb->local_alloc_state == OCFS2_LA_ENABLED);
+	return osb->local_alloc_state & OCFS2_LA_ACTIVE;
 }
 
 void ocfs2_local_alloc_seen_free_bits(struct ocfs2_super *osb,
 				      unsigned int num_clusters)
 {
 	spin_lock(&osb->osb_lock);
-	if (osb->local_alloc_state == OCFS2_LA_DISABLED ||
-	    osb->local_alloc_state == OCFS2_LA_THROTTLED)
+	if (osb->local_alloc_state & (OCFS2_LA_DISABLED |
+		OCFS2_LA_THROTTLED | OCFS2_LA_RESTORE)) {
 		if (num_clusters >= osb->local_alloc_default_bits) {
 			cancel_delayed_work(&osb->la_enable_wq);
-			osb->local_alloc_state = OCFS2_LA_ENABLED;
+			osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
+			osb->local_alloc_state |= OCFS2_LA_RESTORE;
 		}
+	}
 	spin_unlock(&osb->osb_lock);
 }
 
@@ -228,7 +229,10 @@ void ocfs2_la_enable_worker(struct work_struct *work)
 		container_of(work, struct ocfs2_super,
 			     la_enable_wq.work);
 	spin_lock(&osb->osb_lock);
-	osb->local_alloc_state = OCFS2_LA_ENABLED;
+	if (osb->local_alloc_state & OCFS2_LA_DISABLED) {
+		osb->local_alloc_state &= ~OCFS2_LA_DISABLED;
+		osb->local_alloc_state |= OCFS2_LA_ENABLED;
+	}
 	spin_unlock(&osb->osb_lock);
 }
 
@@ -1067,7 +1071,7 @@ static int ocfs2_recalc_la_window(struct ocfs2_super *osb,
 			 * reason to assume the bitmap situation might
 			 * have changed.
 			 */
-			osb->local_alloc_state = OCFS2_LA_THROTTLED;
+			osb->local_alloc_state |= OCFS2_LA_THROTTLED;
 			osb->local_alloc_bits = bits;
 		} else {
 			osb->local_alloc_state = OCFS2_LA_DISABLED;
@@ -1083,8 +1087,16 @@ static int ocfs2_recalc_la_window(struct ocfs2_super
*osb,
 	 * risk bouncing around the global bitmap during periods of
 	 * low space.
 	 */
-	if (osb->local_alloc_state != OCFS2_LA_THROTTLED)
-		osb->local_alloc_bits = osb->local_alloc_default_bits;
+	if (osb->local_alloc_state & OCFS2_LA_RESTORE) {
+		bits = osb->local_alloc_bits * 2;
+		if (bits > osb->local_alloc_default_bits) {
+			osb->local_alloc_bits = osb->local_alloc_default_bits;
+			osb->local_alloc_state = OCFS2_LA_ENABLED;
+		} else {
+			/* keep RESTORE state & set new bits */
+			osb->local_alloc_bits = bits;
+		}
+	}
 
 out_unlock:
 	state = osb->local_alloc_state;
diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h
index 337527571461..1764077e3229 100644
--- a/fs/ocfs2/ocfs2.h
+++ b/fs/ocfs2/ocfs2.h
@@ -245,14 +245,18 @@ struct ocfs2_alloc_stats
 
 enum ocfs2_local_alloc_state
 {
-	OCFS2_LA_UNUSED = 0,	/* Local alloc will never be used for
-				 * this mountpoint. */
-	OCFS2_LA_ENABLED,	/* Local alloc is in use. */
-	OCFS2_LA_THROTTLED,	/* Local alloc is in use, but number
-				 * of bits has been reduced. */
-	OCFS2_LA_DISABLED	/* Local alloc has temporarily been
-				 * disabled. */
+	/* Local alloc will never be used for this mountpoint. */
+	OCFS2_LA_UNUSED = 1 << 0,
+	/* Local alloc is in use. */
+	OCFS2_LA_ENABLED = 1 << 1,
+	/* Local alloc is in use, but number of bits has been reduced. */
+	OCFS2_LA_THROTTLED = 1 << 2,
+	/* In throttle state, Local alloc meets contig big space. */
+	OCFS2_LA_RESTORE = 1 << 3,
+	/* Local alloc has temporarily been disabled. */
+	OCFS2_LA_DISABLED = 1 << 4,
 };
+#define OCFS2_LA_ACTIVE (OCFS2_LA_ENABLED | OCFS2_LA_THROTTLED |
OCFS2_LA_RESTORE)
 
 enum ocfs2_mount_options
 {
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index 166c8918c825..b0df1ab2d6dd 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -1530,7 +1530,7 @@ static int ocfs2_cluster_group_search(struct inode *inode,
 		 * of bits. */
 		if (min_bits <= res->sr_bits)
 			search = 0; /* success */
-		else if (res->sr_bits) {
+		if (res->sr_bits) {
 			/*
 			 * Don't show bits which we'll be returning
 			 * for allocation to the local alloc bitmap.
-- 
2.34.1

Joseph Qi

2022-Jun-02 09:34 UTC

head link

[Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path

Hi,

Sorry for the late response since I was busy on other things...

On 5/21/22 6:14 PM, Heming Zhao wrote:> defragfs triggered jbd2 report ASSERT when working.
> 
> code path:
> 
> ocfs2_ioctl_move_extents
>  ocfs2_move_extents
>   __ocfs2_move_extents_range
>    ocfs2_defrag_extent
>     __ocfs2_move_extent
>      + ocfs2_journal_access_di
>      + ocfs2_split_extent  //[1]
>      + ocfs2_journal_dirty //crash
> 
> [1]: ocfs2_split_extent called ocfs2_extend_trans, which committed
> dirty buffers then restarted the transaction. The changed transaction
> triggered jbd2 ASSERT.
> 
> crash stacks:
> 
> PID: 11297  TASK: ffff974a676dcd00  CPU: 67  COMMAND:
"defragfs.ocfs2"
>  #0 [ffffb25d8dad3900] machine_kexec at ffffffff8386fe01
>  #1 [ffffb25d8dad3958] __crash_kexec at ffffffff8395959d
>  #2 [ffffb25d8dad3a20] crash_kexec at ffffffff8395a45d
>  #3 [ffffb25d8dad3a38] oops_end at ffffffff83836d3f
>  #4 [ffffb25d8dad3a58] do_trap at ffffffff83833205
>  #5 [ffffb25d8dad3aa0] do_invalid_op at ffffffff83833aa6
>  #6 [ffffb25d8dad3ac0] invalid_op at ffffffff84200d18
>     [exception RIP: jbd2_journal_dirty_metadata+0x2ba]
>     RIP: ffffffffc09ca54a  RSP: ffffb25d8dad3b70  RFLAGS: 00010207
>     RAX: 0000000000000000  RBX: ffff9706eedc5248  RCX: 0000000000000000
>     RDX: 0000000000000001  RSI: ffff97337029ea28  RDI: ffff9706eedc5250
>     RBP: ffff9703c3520200   R8: 000000000f46b0b2   R9: 0000000000000000
>     R10: 0000000000000001  R11: 00000001000000fe  R12: ffff97337029ea28
>     R13: 0000000000000000  R14: ffff9703de59bf60  R15: ffff9706eedc5250
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffffb25d8dad3ba8] ocfs2_journal_dirty at ffffffffc137fb95 [ocfs2]
>  #8 [ffffb25d8dad3be8] __ocfs2_move_extent at ffffffffc139a950 [ocfs2]
>  #9 [ffffb25d8dad3c80] ocfs2_defrag_extent at ffffffffc139b2d2 [ocfs2]
> 
> The fix method is simple, ocfs2_split_extent includes three paths. all
> the paths already have journal access/dirty pair. We only need to
> remove journal access/dirty from __ocfs2_move_extent.
> 
I am not sure what you mean by "all three paths have journal access/
dirty pair".

It seems we can't do it in your way, as journal access has different
ocfs2_trigger with different offset, e.g. dinode is different from
extent block.

Or we may reserve enough credits to resolve this issue? 

Thanks,
Joseph

Ocfs2 devel - May 2022 - [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally

[Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path

[Ocfs2-devel] [PATCH 2/2] ocfs2: fix for local alloc window restore unconditionally

[Ocfs2-devel] [PATCH 1/2] ocfs2: fix jbd2 assertion in defragment path