thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER

If this information is useful, please help other people find it:
Share via:

Tao Ma

2009-Mar-18 13:57 UTC

[Ocfs2-devel] [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE

Hi Mark/Joel,
	I meet with some meta allocation bugs when I implement reflink these 
days. And after some investigation, I think we should have the same 
problem when we have PAGE_SIZE > CLUSTER_SIZE. So I create a scenario 
today in one ppc box and try. the box panic as I expected. ;)

The scenario is that: Create a file with the disk layout like this(with
bs=512, and cs=4K).

debugfs: stat 15151
	Inode: 66072   Mode: 0644   Generation: 59969160 (0x3930e88)
<snip>
	Tree Depth: 1   Count: 19   Next Free Rec: 2
	## Offset        Clusters       Block#
	0  0             258            86365
	1  258           66             86367
	SubAlloc Bit: 21   SubAlloc Slot: 0
	Blknum: 86365   Next Leaf: 86367
	CRC32: N/A   ECC: N/A
	Tree Depth: 0   Count: 28   Next Free Rec: 28
	## Offset        Clusters       Block#          Flags
	0  0             1              116696          0x0
<snip>
	25 25            1              117096          0x0
	26 256           1              117112          0x1
	27 257           1              117120          0x0
	SubAlloc Bit: 23   SubAlloc Slot: 0
	Blknum: 86367   Next Leaf: 0
	CRC32: N/A   ECC: N/A
	Tree Depth: 0   Count: 28   Next Free Rec: 2
	## Offset        Clusters       Block#          Flags
	0  258           2              117128          0x1
	1  260           64             117176          0x1

Please note the extent record from "26" to "0" of the next
block are
contiguous allocated with unwritten and then divide it by write to the
256 with 1 cluster.

Now if we try to write 40960 bytes at offset 256. We will panic. Why?
The reason is that:
1. with ppc box, we have page_size=64K. So in one
ocfs2_write_begin_no_lock we will try to handle 40960 bytes together.
2. in ocfs2_lock_allocators we will get that no metadata is need since
the 2nd extent block has so many empty extent recs.
3. then write begin one cluster by one in ocfs2_write_cluster.
   1) The 1st cluster(256) nothing special.
   2) the 2nd (257), it will be merged with 256.
   3) the 3rd (258), be merged with 256.
   4) the 4th (259), be merged. Now 256-259 will be merged into 1 extent
rec, so the 2nd extent block will be removed. and we will get.
	26 256           4              117112          0x0
	27 260           64             117176          0x1
   5) Now comes the 260, we need to split and call ocfs2_add_branch to
allocate a new block. But wait, we have no metadata reserved. So we
panic here.

So my thought is that can we reuse the freed extent block? I guess we 
can. We just need to store the pointer of ocfs2_cached_dealloc_ctxt in 
ocfs2_alloc_context. So whenever we allocate a new metadata, we try to 
search ocfs2_cached_dealloc_ctxt first, if there is some, we use it 
directly and delete it from ocfs2_cached_dealloc_ctxt. The same can go 
for cluster allocation I guess although I don't know whether we have 
such case for clusters.

make sense?

btw, this is critical because we often meet with this type of issue in 
reflink(the 1st step delete a leaf extent block because of merge while 
the 2nd step want to create one because of merge while no metadata are 
reserved). And even worse, I met with a scenario that the process of 
delete/add goes for 6 times.

Regards,
Tao

Tao Ma

2009-Mar-18 22:26 UTC

head link

[Ocfs2-devel] [PATCH] ocfs2: Reused freed extent block in b-tree operation.

Hi all,
        This is the patch for the RFC. I have tested it in my ppc
box, it works OK.

In some b-tree operations we may have the chance that we haven't
reserved any metadata at the beginning because we think we don't
need. While the 1st operation free a extent block while the 2nd
operation need one. Our current code can't handle this. So this
patch try to re-use the freed extent block so that we can pass
the scenario above. For more details about the bug, see

[RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE
http://oss.oracle.com/pipermail/ocfs2-devel/2009-March/004185.html.

Signed-off-by: Tao Ma <tao.ma at oracle.com>
---
 fs/ocfs2/alloc.c    |   49 +++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ocfs2/alloc.h    |    7 +++++++
 fs/ocfs2/dir.c      |    8 ++++----
 fs/ocfs2/suballoc.c |   23 +++++++++++++++++++++++
 fs/ocfs2/suballoc.h |    3 +++
 fs/ocfs2/xattr.c    |    6 +++---
 6 files changed, 87 insertions(+), 9 deletions(-)

diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c
index 678a067..89dcb9c 100644
--- a/fs/ocfs2/alloc.c
+++ b/fs/ocfs2/alloc.c
@@ -915,7 +915,7 @@ static int ocfs2_create_new_meta_bhs(struct ocfs2_super
*osb,
 				     struct buffer_head *bhs[])
 {
 	int count, status, i;
-	u16 suballoc_bit_start;
+	u16 slot, suballoc_bit_start;
 	u32 num_got;
 	u64 first_blkno;
 	struct ocfs2_extent_block *eb;
@@ -928,6 +928,7 @@ static int ocfs2_create_new_meta_bhs(struct ocfs2_super
*osb,
 					      handle,
 					      meta_ac,
 					      wanted - count,
+					      &slot,
 					      &suballoc_bit_start,
 					      &num_got,
 					      &first_blkno);
@@ -958,7 +959,7 @@ static int ocfs2_create_new_meta_bhs(struct ocfs2_super
*osb,
 			strcpy(eb->h_signature, OCFS2_EXTENT_BLOCK_SIGNATURE);
 			eb->h_blkno = cpu_to_le64(first_blkno);
 			eb->h_fs_generation = cpu_to_le32(osb->fs_generation);
-			eb->h_suballoc_slot = cpu_to_le16(osb->slot_num);
+			eb->h_suballoc_slot = cpu_to_le16(slot);
 			eb->h_suballoc_bit = cpu_to_le16(suballoc_bit_start);
 			eb->h_list.l_count  			
cpu_to_le16(ocfs2_extent_recs_per_eb(osb->sb));
@@ -4912,6 +4913,7 @@ static int __ocfs2_mark_extent_written(struct inode
*inode,
 	struct ocfs2_extent_rec *rec = &el->l_recs[split_index];
 	struct ocfs2_merge_ctxt ctxt;
 	struct ocfs2_extent_list *rightmost_el;
+	struct ocfs2_alloc_context local_meta_ac;
 
 	if (!(rec->e_flags & OCFS2_EXT_UNWRITTEN)) {
 		ret = -EIO;
@@ -4964,6 +4966,17 @@ static int __ocfs2_mark_extent_written(struct inode
*inode,
 	     split_index, ctxt.c_contig_type, ctxt.c_has_empty_extent,
 	     ctxt.c_split_covers_rec);
 
+	/*
+	 * init dealloc in meta_ac here so that we can reuse the freed extent
+	 * block in case.
+	 * If the caller doesn't give us a meta_ac, we just fake one and add
+	 * dealloc in it.
+	 */
+	if (!meta_ac) {
+		memset(&local_meta_ac, 0, sizeof(local_meta_ac));
+		meta_ac = &local_meta_ac;
+	}
+	meta_ac->dealloc = dealloc;
 	if (ctxt.c_contig_type == CONTIG_NONE) {
 		if (ctxt.c_split_covers_rec)
 			ret = ocfs2_replace_extent_rec(inode, handle,
@@ -6172,6 +6185,38 @@ int ocfs2_cache_cluster_dealloc(struct
ocfs2_cached_dealloc_ctxt *ctxt,
 	return ret;
 }
 
+int ocfs2_claim_bit_from_dealloc(struct ocfs2_cached_dealloc_ctxt *ctxt,
+				 u16 *slot, u16 *suballoc_bit,
+				 unsigned int *num_bits, u64 *blkno)
+{
+	struct ocfs2_per_slot_free_list *fl = ctxt->c_first_suballocator;
+	struct ocfs2_cached_block_free *tmp;
+
+	if (!fl || !fl->f_first)
+		return -ENOSPC;
+
+	tmp = fl->f_first->free_next;
+
+	*slot = fl->f_slot;
+	*suballoc_bit = fl->f_first->free_bit;
+	*blkno = fl->f_first->free_blk;
+	*num_bits = 1;
+
+	mlog(0, "claim blkno %llu suballoc_bit %u, slot %u\n",
+	     (unsigned long long)*blkno, *suballoc_bit, *slot);
+
+	kfree(fl->f_first);
+	fl->f_first = tmp;
+
+	/* Free the suballocator if there is none. */
+	if (!fl->f_first) {
+		ctxt->c_first_suballocator = fl->f_next_suballocator;
+		kfree(fl);
+	}
+
+	return 0;
+}
+
 static int ocfs2_free_cached_clusters(struct ocfs2_super *osb,
 				      struct ocfs2_cached_block_free *head)
 {
diff --git a/fs/ocfs2/alloc.h b/fs/ocfs2/alloc.h
index 353254b..4e436a4 100644
--- a/fs/ocfs2/alloc.h
+++ b/fs/ocfs2/alloc.h
@@ -195,10 +195,17 @@ static inline void ocfs2_init_dealloc_ctxt(struct
ocfs2_cached_dealloc_ctxt *c)
 }
 int ocfs2_cache_cluster_dealloc(struct ocfs2_cached_dealloc_ctxt *ctxt,
 				u64 blkno, unsigned int bit);
+static inline int ocfs2_dealloc_has_block(struct ocfs2_cached_dealloc_ctxt *c)
+{
+	return c->c_first_suballocator != NULL;
+}
 static inline int ocfs2_dealloc_has_cluster(struct ocfs2_cached_dealloc_ctxt
*c)
 {
 	return c->c_global_allocator != NULL;
 }
+int ocfs2_claim_bit_from_dealloc(struct ocfs2_cached_dealloc_ctxt *ctxt,
+				 u16 *slot, u16 *suballoc_bit,
+				 unsigned int *num_bits, u64 *blkno);
 int ocfs2_run_deallocs(struct ocfs2_super *osb,
 		       struct ocfs2_cached_dealloc_ctxt *ctxt);
 
diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c
index eeac241..71692c8 100644
--- a/fs/ocfs2/dir.c
+++ b/fs/ocfs2/dir.c
@@ -2394,7 +2394,7 @@ static int ocfs2_dx_dir_attach_index(struct ocfs2_super
*osb,
 {
 	int ret;
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *) di_bh->b_data;
-	u16 dr_suballoc_bit;
+	u16 slot, dr_suballoc_bit;
 	u64 dr_blkno;
 	unsigned int num_bits;
 	struct buffer_head *dx_root_bh = NULL;
@@ -2402,8 +2402,8 @@ static int ocfs2_dx_dir_attach_index(struct ocfs2_super
*osb,
 	struct ocfs2_dir_block_trailer *trailer  		ocfs2_trailer_from_bh(dirdata_bh,
dir->i_sb);
 
-	ret = ocfs2_claim_metadata(osb, handle, meta_ac, 1, &dr_suballoc_bit,
-				   &num_bits, &dr_blkno);
+	ret = ocfs2_claim_metadata(osb, handle, meta_ac, 1, &slot,
+				   &dr_suballoc_bit, &num_bits, &dr_blkno);
 	if (ret) {
 		mlog_errno(ret);
 		goto out;
@@ -2430,7 +2430,7 @@ static int ocfs2_dx_dir_attach_index(struct ocfs2_super
*osb,
 	dx_root = (struct ocfs2_dx_root_block *)dx_root_bh->b_data;
 	memset(dx_root, 0, osb->sb->s_blocksize);
 	strcpy(dx_root->dr_signature, OCFS2_DX_ROOT_SIGNATURE);
-	dx_root->dr_suballoc_slot = cpu_to_le16(osb->slot_num);
+	dx_root->dr_suballoc_slot = cpu_to_le16(slot);
 	dx_root->dr_suballoc_bit = cpu_to_le16(dr_suballoc_bit);
 	dx_root->dr_fs_generation = cpu_to_le32(osb->fs_generation);
 	dx_root->dr_blkno = cpu_to_le64(dr_blkno);
diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c
index eb21dbb..a70991e 100644
--- a/fs/ocfs2/suballoc.c
+++ b/fs/ocfs2/suballoc.c
@@ -1618,6 +1618,7 @@ int ocfs2_claim_metadata(struct ocfs2_super *osb,
 			 handle_t *handle,
 			 struct ocfs2_alloc_context *ac,
 			 u32 bits_wanted,
+			 u16 *slot,
 			 u16 *suballoc_bit_start,
 			 unsigned int *num_bits,
 			 u64 *blkno_start)
@@ -1626,6 +1627,27 @@ int ocfs2_claim_metadata(struct ocfs2_super *osb,
 	u64 bg_blkno;
 
 	BUG_ON(!ac);
+
+	/*
+	 * If there is no space reserved in ac, check whether we have
+	 * free some before in dealloc. If yes, allocate one form it.
+	 */
+	if (ac->ac_bits_wanted < (ac->ac_bits_given + bits_wanted)) {
+		mlog(0, "alloc context doesn't have enough meta data reserved.
"
+		     "It has %d, we need %u\n",
+		     ac->ac_bits_given - ac->ac_bits_given, bits_wanted);
+		if (ac->dealloc && ocfs2_dealloc_has_block(ac->dealloc)) {
+			status = ocfs2_claim_bit_from_dealloc(ac->dealloc,
+							slot,
+							suballoc_bit_start,
+							num_bits,
+							blkno_start);
+			if (!status)
+				goto bail;
+			/*fail through, so that it will BUG out. */
+		}
+	}
+
 	BUG_ON(ac->ac_bits_wanted < (ac->ac_bits_given + bits_wanted));
 	BUG_ON(ac->ac_which != OCFS2_AC_USE_META);
 
@@ -1643,6 +1665,7 @@ int ocfs2_claim_metadata(struct ocfs2_super *osb,
 	}
 	atomic_inc(&osb->alloc_stats.bg_allocs);
 
+	*slot = ac->ac_alloc_slot;
 	*blkno_start = bg_blkno + (u64) *suballoc_bit_start;
 	ac->ac_bits_given += (*num_bits);
 	status = 0;
diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h
index 8c9a78a..da87799 100644
--- a/fs/ocfs2/suballoc.h
+++ b/fs/ocfs2/suballoc.h
@@ -34,6 +34,7 @@ typedef int (group_search_t)(struct inode *,
 			     u16 *,			/* *bit_off */
 			     u16 *);			/* *bits_found */
 
+struct ocfs2_cached_dealloc_ctxt;
 struct ocfs2_alloc_context {
 	struct inode *ac_inode;    /* which bitmap are we allocating from? */
 	struct buffer_head *ac_bh; /* file entry bh */
@@ -54,6 +55,7 @@ struct ocfs2_alloc_context {
 	u64    ac_last_group;
 	u64    ac_max_block;  /* Highest block number to allocate. 0 is
 				 is the same as ~0 - unlimited */
+	struct ocfs2_cached_dealloc_ctxt *dealloc;
 };
 
 void ocfs2_free_alloc_context(struct ocfs2_alloc_context *ac);
@@ -83,6 +85,7 @@ int ocfs2_claim_metadata(struct ocfs2_super *osb,
 			 handle_t *handle,
 			 struct ocfs2_alloc_context *ac,
 			 u32 bits_wanted,
+			 u16 *slot,
 			 u16 *suballoc_bit_start,
 			 u32 *num_bits,
 			 u64 *blkno_start);
diff --git a/fs/ocfs2/xattr.c b/fs/ocfs2/xattr.c
index 1563101..c458c0c 100644
--- a/fs/ocfs2/xattr.c
+++ b/fs/ocfs2/xattr.c
@@ -2098,7 +2098,7 @@ static int ocfs2_xattr_block_set(struct inode *inode,
 	struct ocfs2_dinode *di =  (struct ocfs2_dinode *)xs->inode_bh->b_data;
 	handle_t *handle = ctxt->handle;
 	struct ocfs2_xattr_block *xblk = NULL;
-	u16 suballoc_bit_start;
+	u16 slot, suballoc_bit_start;
 	u32 num_got;
 	u64 first_blkno;
 	int ret;
@@ -2112,7 +2112,7 @@ static int ocfs2_xattr_block_set(struct inode *inode,
 		}
 
 		ret = ocfs2_claim_metadata(osb, handle, ctxt->meta_ac, 1,
-					   &suballoc_bit_start, &num_got,
+					   &slot, &suballoc_bit_start, &num_got,
 					   &first_blkno);
 		if (ret < 0) {
 			mlog_errno(ret);
@@ -2134,7 +2134,7 @@ static int ocfs2_xattr_block_set(struct inode *inode,
 		xblk = (struct ocfs2_xattr_block *)new_bh->b_data;
 		memset(xblk, 0, inode->i_sb->s_blocksize);
 		strcpy((void *)xblk, OCFS2_XATTR_BLOCK_SIGNATURE);
-		xblk->xb_suballoc_slot = cpu_to_le16(osb->slot_num);
+		xblk->xb_suballoc_slot = cpu_to_le16(slot);
 		xblk->xb_suballoc_bit = cpu_to_le16(suballoc_bit_start);
 		xblk->xb_fs_generation = cpu_to_le32(osb->fs_generation);
 		xblk->xb_blkno = cpu_to_le64(first_blkno);
-- 
1.6.2.rc2.16.gf474c

Mark Fasheh

2009-Mar-20 00:30 UTC

head link

[Ocfs2-devel] [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE

On Wed, Mar 18, 2009 at 09:57:48PM +0800, Tao Ma wrote:> Hi Mark/Joel,
> 	I meet with some meta allocation bugs when I implement reflink these 
> days. And after some investigation, I think we should have the same 
> problem when we have PAGE_SIZE > CLUSTER_SIZE. So I create a scenario 
> today in one ppc box and try. the box panic as I expected. ;)
> 
> The scenario is that: Create a file with the disk layout like this(with
> bs=512, and cs=4K).
> 
> debugfs: stat 15151
> 	Inode: 66072   Mode: 0644   Generation: 59969160 (0x3930e88)
> <snip>
> 	Tree Depth: 1   Count: 19   Next Free Rec: 2
> 	## Offset        Clusters       Block#
> 	0  0             258            86365
> 	1  258           66             86367
> 	SubAlloc Bit: 21   SubAlloc Slot: 0
> 	Blknum: 86365   Next Leaf: 86367
> 	CRC32: N/A   ECC: N/A
> 	Tree Depth: 0   Count: 28   Next Free Rec: 28
> 	## Offset        Clusters       Block#          Flags
> 	0  0             1              116696          0x0
> <snip>
> 	25 25            1              117096          0x0
> 	26 256           1              117112          0x1
> 	27 257           1              117120          0x0
> 	SubAlloc Bit: 23   SubAlloc Slot: 0
> 	Blknum: 86367   Next Leaf: 0
> 	CRC32: N/A   ECC: N/A
> 	Tree Depth: 0   Count: 28   Next Free Rec: 2
> 	## Offset        Clusters       Block#          Flags
> 	0  258           2              117128          0x1
> 	1  260           64             117176          0x1
> 
> Please note the extent record from "26" to "0" of the
next block are
> contiguous allocated with unwritten and then divide it by write to the
> 256 with 1 cluster.
> 
> Now if we try to write 40960 bytes at offset 256. We will panic. Why?
> The reason is that:
> 1. with ppc box, we have page_size=64K. So in one
> ocfs2_write_begin_no_lock we will try to handle 40960 bytes together.
> 2. in ocfs2_lock_allocators we will get that no metadata is need since
> the 2nd extent block has so many empty extent recs.
So, the core problem then is that ocfs2_lock_allocators is not doing as good
a job calculating the meta data allocation this write needs.

> 3. then write begin one cluster by one in ocfs2_write_cluster.
>   1) The 1st cluster(256) nothing special.
>   2) the 2nd (257), it will be merged with 256.
>   3) the 3rd (258), be merged with 256.
>   4) the 4th (259), be merged. Now 256-259 will be merged into 1 extent
> rec, so the 2nd extent block will be removed. and we will get.
> 	26 256           4              117112          0x0
> 	27 260           64             117176          0x1
>   5) Now comes the 260, we need to split and call ocfs2_add_branch to
> allocate a new block. But wait, we have no metadata reserved. So we
> panic here.
Right, ok.

> So my thought is that can we reuse the freed extent block? I guess we 
> can. We just need to store the pointer of ocfs2_cached_dealloc_ctxt in 
> ocfs2_alloc_context. So whenever we allocate a new metadata, we try to 
> search ocfs2_cached_dealloc_ctxt first, if there is some, we use it 
> directly and delete it from ocfs2_cached_dealloc_ctxt. The same can go 
> for cluster allocation I guess although I don't know whether we have 
> such case for clusters.
> 
> make sense?
Yes, but I think a better approach is to fix the core problem, instead of
working around it. What we want to do then is make ocfs2_lock_allocators
smart enough to catch this case so that it can reserve the proper amount of
meta data. The good news is that we already do a scan of the writeable area
in ocfs2_populate_write_desc(). It seems to me that scan is a good place
where we could store some information about contiguous extents which will be
merged. The only thing that'd be left for ocfs2_lock_allocators to do is
look at the remaining extent block counts.
	--Mark

--
Mark Fasheh

Maybe Matching Threads

Search for more maybe matching threads

Ocfs2 devel - Mar 2009 - [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE

[Ocfs2-devel] [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE

[Ocfs2-devel] [PATCH] ocfs2: Reused freed extent block in b-tree operation.

[Ocfs2-devel] [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE

Maybe Matching Threads