Tao Ma
2009-Jun-02 06:14 UTC
[Ocfs2-devel] [PATCH] ocfs2: Adjust rightmost path in ocfs2_add_branch.
In ocfs2_add_branch, we use the rightmost rec of the leaf extent block to generate the e_cpos for the new added branch. In the most case, it is OK but if there is a gap between the the root(or branch) 's rightmost rec and the leaf, it will cause kernel panic if we insert some clusters in it. The message is something like: (7445,1):ocfs2_insert_at_leaf:3775 ERROR: bug expression: le16_to_cpu(el->l_next_free_rec) >= le16_to_cpu(el->l_count) (7445,1):ocfs2_insert_at_leaf:3775 ERROR: inode 66053, depth 0, count 28, next free 28, rec.cpos 270, rec.clusters 1, insert.cpos 275, insert.clusters 1 [<fa7ad565>] ? ocfs2_do_insert_extent+0xb58/0xda0 [ocfs2] [<fa7b08f2>] ? ocfs2_insert_extent+0x5bd/0x6ba [ocfs2] [<fa7b1b8b>] ? ocfs2_add_clusters_in_btree+0x37f/0x564 [ocfs2] ... The panic can be easily reproduced by the following small test case (with bs=512, cs=4K, and I remove all the error handling so that it looks clear enough for reading). int main(int argc, char **argv) { int fd, i; char buf[5] = "test"; fd = open(argv[1], O_RDWR|O_CREAT); for (i = 0; i < 30; i++) { lseek(fd, 40960 * i, SEEK_SET); write(fd, buf, 5); } ftruncate(fd, 1146880); lseek(fd, 1126400, SEEK_SET); write(fd, buf, 5); close(fd); return 0; } The reason of the panic is that: the 30 writes and the ftruncate makes the file's extent list looks like: Tree Depth: 1 Count: 19 Next Free Rec: 1 ## Offset Clusters Block# 0 0 280 86183 SubAlloc Bit: 7 SubAlloc Slot: 0 Blknum: 86183 Next Leaf: 0 CRC32: 00000000 ECC: 0000 Tree Depth: 0 Count: 28 Next Free Rec: 28 ## Offset Clusters Block# Flags 0 0 1 143368 0x0 1 10 1 143376 0x0 ... 26 260 1 143576 0x0 27 270 1 143584 0x0 Now another write at 1126400(275 cluster) whiich will write at the gap between 271 and 280 will trigger ocfs2_add_branch, but the result after the function looks like: Tree Depth: 1 Count: 19 Next Free Rec: 1 ## Offset Clusters Block# 0 0 280 86183 0 271 0 143592 So the extent record is intersected and make the following operation bug out. This patch just try to remove the gap before we add the new branch, so that the root(branch) rightmost rec will cover the same right position. So in the above case, before adding branch the tree will be changed to Tree Depth: 1 Count: 19 Next Free Rec: 1 ## Offset Clusters Block# 0 0 271 86183 SubAlloc Bit: 7 SubAlloc Slot: 0 Blknum: 86183 Next Leaf: 0 CRC32: 00000000 ECC: 0000 Tree Depth: 0 Count: 28 Next Free Rec: 28 ## Offset Clusters Block# Flags 0 0 1 143368 0x0 1 10 1 143376 0x0 ... 26 260 1 143576 0x0 27 270 1 143584 0x0 And after branch add, the tree looks like Tree Depth: 1 Count: 19 Next Free Rec: 1 ## Offset Clusters Block# 0 0 271 86183 0 271 0 143592 Signed-off-by: Tao Ma <tao.ma at oracle.com> --- fs/ocfs2/alloc.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++-- 1 files changed, 94 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index 678a067..7ce5e2f 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -475,6 +475,8 @@ struct ocfs2_path { #define path_leaf_el(_path) ((_path)->p_node[(_path)->p_tree_depth].el) #define path_num_items(_path) ((_path)->p_tree_depth + 1) +static int ocfs2_find_path(struct inode *inode, struct ocfs2_path *path, + u32 cpos); /* * Reset the actual path elements so that we can re-use the structure * to build another path. Generally, this involves freeing the buffer @@ -1012,6 +1014,75 @@ static inline u32 ocfs2_sum_rightmost_rec(struct ocfs2_extent_list *el) ocfs2_rec_clusters(el, &el->l_recs[i]); } +static int ocfs2_adjust_new_end(handle_t *handle, + struct inode *inode, + struct ocfs2_extent_tree *et, + u32 new_end) +{ + int status, i; + struct ocfs2_path *path = NULL; + struct ocfs2_extent_list *el; + struct ocfs2_extent_rec *rec; + + path = ocfs2_new_path_from_et(et); + if (!path) { + status = -ENOMEM; + return status; + } + + status = ocfs2_find_path(inode, path, new_end); + if (status < 0) { + mlog_errno(status); + goto out; + } + + status = ocfs2_extend_trans(handle, handle->h_buffer_credits + + path_num_items(path)); + if (status < 0) { + mlog_errno(status); + goto out; + } + + status = ocfs2_et_root_journal_access(handle, inode, et, + OCFS2_JOURNAL_ACCESS_WRITE); + if (status < 0) { + mlog_errno(status); + goto out; + } + + /* Change all the branch except the leaf. */ + for (i = 1; i < path_num_items(path) - 1; i++) { + status = ocfs2_journal_access_eb(handle, inode, + path->p_node[i].bh, + OCFS2_JOURNAL_ACCESS_WRITE); + if (status < 0) { + mlog_errno(status); + goto out; + } + } + + el = path_root_el(path); + rec = &el->l_recs[le32_to_cpu(el->l_next_free_rec) - 1]; + rec->e_int_clusters = cpu_to_le32(new_end - + le32_to_cpu(rec->e_cpos)); + + for (i = 1; i < path_num_items(path) - 1; i++) { + el = path->p_node[i].el; + rec = &el->l_recs[le32_to_cpu(el->l_next_free_rec) - 1]; + rec->e_int_clusters = cpu_to_le32(new_end - + le32_to_cpu(rec->e_cpos)); + } + + for (i = 0; i < path_num_items(path) - 1; i++) { + status = ocfs2_journal_dirty(handle, path->p_node[i].bh); + if (status < 0) + mlog_errno(status); + } +out: + ocfs2_free_path(path); + return status; +} + /* * Add an entire tree branch to our inode. eb_bh is the extent block * to start at, if we don't want to start the branch at the dinode @@ -1038,7 +1109,7 @@ static int ocfs2_add_branch(struct ocfs2_super *osb, struct ocfs2_extent_block *eb; struct ocfs2_extent_list *eb_el; struct ocfs2_extent_list *el; - u32 new_cpos; + u32 new_cpos, root_end; mlog_entry_void(); @@ -1055,6 +1126,28 @@ static int ocfs2_add_branch(struct ocfs2_super *osb, new_blocks = le16_to_cpu(el->l_tree_depth); + eb = (struct ocfs2_extent_block *)(*last_eb_bh)->b_data; + new_cpos = ocfs2_sum_rightmost_rec(&eb->h_list); + root_end = ocfs2_sum_rightmost_rec(et->et_root_el); + + /* + * If there is a gap before the root end and the real end + * of the righmost leaf block, we need to remove the gap + * between new_cpos and root_end first so that the tree + * is consistent after we add a new branch(it will start + * from new_cpos). + */ + if (root_end > new_cpos) { + mlog(0, "adjust the cluster end from %u to %u\n", + root_end, new_cpos); + status = ocfs2_adjust_new_end(handle, inode, + et, new_cpos); + if (status) { + mlog_errno(status); + goto bail; + } + } + /* allocate the number of new eb blocks we need */ new_eb_bhs = kcalloc(new_blocks, sizeof(struct buffer_head *), GFP_KERNEL); @@ -1071,9 +1164,6 @@ static int ocfs2_add_branch(struct ocfs2_super *osb, goto bail; } - eb = (struct ocfs2_extent_block *)(*last_eb_bh)->b_data; - new_cpos = ocfs2_sum_rightmost_rec(&eb->h_list); - /* Note: new_eb_bhs[new_blocks - 1] is the guy which will be * linked with the rest of the tree. * conversly, new_eb_bhs[0] is the new bottommost leaf. -- 1.6.2.rc2.16.gf474c
Mark Fasheh
2009-Jun-12 00:45 UTC
[Ocfs2-devel] [PATCH] ocfs2: Adjust rightmost path in ocfs2_add_branch.
On Tue, Jun 02, 2009 at 02:14:03PM +0800, Tao Ma wrote:> In ocfs2_add_branch, we use the rightmost rec of the leaf extent block > to generate the e_cpos for the new added branch. In the most case, it > is OK but if there is a gap between the the root(or branch) 's rightmost > rec and the leaf, it will cause kernel panic if we insert some clusters > in it. The message is something like: > (7445,1):ocfs2_insert_at_leaf:3775 ERROR: bug expression: > le16_to_cpu(el->l_next_free_rec) >= le16_to_cpu(el->l_count) > (7445,1):ocfs2_insert_at_leaf:3775 ERROR: inode 66053, depth 0, count 28, > next free 28, rec.cpos 270, rec.clusters 1, insert.cpos 275, insert.clusters 1 > [<fa7ad565>] ? ocfs2_do_insert_extent+0xb58/0xda0 [ocfs2] > [<fa7b08f2>] ? ocfs2_insert_extent+0x5bd/0x6ba [ocfs2] > [<fa7b1b8b>] ? ocfs2_add_clusters_in_btree+0x37f/0x564 [ocfs2] > ... > > The panic can be easily reproduced by the following small test case > (with bs=512, cs=4K, and I remove all the error handling so that it looks > clear enough for reading). > > int main(int argc, char **argv) > { > int fd, i; > char buf[5] = "test"; > > fd = open(argv[1], O_RDWR|O_CREAT); > > for (i = 0; i < 30; i++) { > lseek(fd, 40960 * i, SEEK_SET); > write(fd, buf, 5); > } > > ftruncate(fd, 1146880); > > lseek(fd, 1126400, SEEK_SET); > write(fd, buf, 5); > > close(fd); > > return 0; > } > > The reason of the panic is that: > the 30 writes and the ftruncate makes the file's extent list looks like: > > Tree Depth: 1 Count: 19 Next Free Rec: 1 > ## Offset Clusters Block# > 0 0 280 86183 > SubAlloc Bit: 7 SubAlloc Slot: 0 > Blknum: 86183 Next Leaf: 0 > CRC32: 00000000 ECC: 0000 > Tree Depth: 0 Count: 28 Next Free Rec: 28 > ## Offset Clusters Block# Flags > 0 0 1 143368 0x0 > 1 10 1 143376 0x0 > ... > 26 260 1 143576 0x0 > 27 270 1 143584 0x0 > > Now another write at 1126400(275 cluster) whiich will write at the gap > between 271 and 280 will trigger ocfs2_add_branch, but the result after > the function looks like: > Tree Depth: 1 Count: 19 Next Free Rec: 1 > ## Offset Clusters Block# > 0 0 280 86183 > 0 271 0 143592 > So the extent record is intersected and make the following operation bug out.Is it possible for us to add another patch which changes the bug to something that just goes readonly?> This patch just try to remove the gap before we add the new branch, so that > the root(branch) rightmost rec will cover the same right position. So in the > above case, before adding branch the tree will be changed to > Tree Depth: 1 Count: 19 Next Free Rec: 1 > ## Offset Clusters Block# > 0 0 271 86183 > SubAlloc Bit: 7 SubAlloc Slot: 0 > Blknum: 86183 Next Leaf: 0 > CRC32: 00000000 ECC: 0000 > Tree Depth: 0 Count: 28 Next Free Rec: 28 > ## Offset Clusters Block# Flags > 0 0 1 143368 0x0 > 1 10 1 143376 0x0 > ... > 26 260 1 143576 0x0 > 27 270 1 143584 0x0 > And after branch add, the tree looks like > Tree Depth: 1 Count: 19 Next Free Rec: 1 > ## Offset Clusters Block# > 0 0 271 86183 > 0 271 0 143592 > > Signed-off-by: Tao Ma <tao.ma at oracle.com> > --- > fs/ocfs2/alloc.c | 98 +++++++++++++++++++++++++++++++++++++++++++++++++++-- > 1 files changed, 94 insertions(+), 4 deletions(-) > > diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c > index 678a067..7ce5e2f 100644 > --- a/fs/ocfs2/alloc.c > +++ b/fs/ocfs2/alloc.c > @@ -475,6 +475,8 @@ struct ocfs2_path { > #define path_leaf_el(_path) ((_path)->p_node[(_path)->p_tree_depth].el) > #define path_num_items(_path) ((_path)->p_tree_depth + 1) > > +static int ocfs2_find_path(struct inode *inode, struct ocfs2_path *path, > + u32 cpos); > /* > * Reset the actual path elements so that we can re-use the structure > * to build another path. Generally, this involves freeing the buffer > @@ -1012,6 +1014,75 @@ static inline u32 ocfs2_sum_rightmost_rec(struct ocfs2_extent_list *el) > ocfs2_rec_clusters(el, &el->l_recs[i]); > } > > +static int ocfs2_adjust_new_end(handle_t *handle, > + struct inode *inode, > + struct ocfs2_extent_tree *et, > + u32 new_end) > +{ > + int status, i; > + struct ocfs2_path *path = NULL; > + struct ocfs2_extent_list *el; > + struct ocfs2_extent_rec *rec; > + > + path = ocfs2_new_path_from_et(et); > + if (!path) { > + status = -ENOMEM; > + return status; > + } > + > + status = ocfs2_find_path(inode, path, new_end); > + if (status < 0) { > + mlog_errno(status); > + goto out; > + } > + > + status = ocfs2_extend_trans(handle, handle->h_buffer_credits + > + path_num_items(path)); > + if (status < 0) { > + mlog_errno(status); > + goto out; > + } > + > + status = ocfs2_et_root_journal_access(handle, inode, et, > + OCFS2_JOURNAL_ACCESS_WRITE); > + if (status < 0) { > + mlog_errno(status); > + goto out; > + } > + > + /* Change all the branch except the leaf. */ > + for (i = 1; i < path_num_items(path) - 1; i++) { > + status = ocfs2_journal_access_eb(handle, inode, > + path->p_node[i].bh, > + OCFS2_JOURNAL_ACCESS_WRITE); > + if (status < 0) { > + mlog_errno(status); > + goto out; > + } > + } > + > + el = path_root_el(path); > + rec = &el->l_recs[le32_to_cpu(el->l_next_free_rec) - 1]; > + rec->e_int_clusters = cpu_to_le32(new_end - > + le32_to_cpu(rec->e_cpos)); > + > + for (i = 1; i < path_num_items(path) - 1; i++) { > + el = path->p_node[i].el; > + rec = &el->l_recs[le32_to_cpu(el->l_next_free_rec) - 1]; > + rec->e_int_clusters = cpu_to_le32(new_end - > + le32_to_cpu(rec->e_cpos)); > + } > + > + for (i = 0; i < path_num_items(path) - 1; i++) { > + status = ocfs2_journal_dirty(handle, path->p_node[i].bh); > + if (status < 0) > + mlog_errno(status); > + } > +out: > + ocfs2_free_path(path); > + return status; > +} > + > /* > * Add an entire tree branch to our inode. eb_bh is the extent block > * to start at, if we don't want to start the branch at the dinode > @@ -1038,7 +1109,7 @@ static int ocfs2_add_branch(struct ocfs2_super *osb, > struct ocfs2_extent_block *eb; > struct ocfs2_extent_list *eb_el; > struct ocfs2_extent_list *el; > - u32 new_cpos; > + u32 new_cpos, root_end; > > mlog_entry_void(); > > @@ -1055,6 +1126,28 @@ static int ocfs2_add_branch(struct ocfs2_super *osb, > > new_blocks = le16_to_cpu(el->l_tree_depth); > > + eb = (struct ocfs2_extent_block *)(*last_eb_bh)->b_data; > + new_cpos = ocfs2_sum_rightmost_rec(&eb->h_list); > + root_end = ocfs2_sum_rightmost_rec(et->et_root_el); > + > + /* > + * If there is a gap before the root end and the real end > + * of the righmost leaf block, we need to remove the gap > + * between new_cpos and root_end first so that the tree > + * is consistent after we add a new branch(it will start > + * from new_cpos). > + */ > + if (root_end > new_cpos) { > + mlog(0, "adjust the cluster end from %u to %u\n", > + root_end, new_cpos); > + status = ocfs2_adjust_new_end(handle, inode, > + et, new_cpos); > + if (status) { > + mlog_errno(status); > + goto bail; > + } > + } > +Your solution looks technically correct. What I wanted to do is something like changing this section in ocfs2_add_branch(): eb = (struct ocfs2_extent_block *)(*last_eb_bh)->b_data; new_cpos = ocfs2_sum_rightmost_rec(&eb->h_list); to something like: eb = (struct ocfs2_extent_block *)(*eb_bh)->b_data; new_cpos = ocfs2_sum_rightmost_rec(&eb->h_list); That would fix up the starting cpos of our new branch. But we'll still wind up with the gap in a non-dege path, which I believe is "illegal" for Ocfs2. So, this is a better solution I think. Can we possibly make use of ocfs2_adjust_rightmost_records() - it looks like it'd handle the bottommost part of ocfs2_adjust_new_end() with extra checking... --Mark -- Mark Fasheh
Apparently Analagous Threads
- [PATCH 0/2] ocfs2: Adjust rightmost path in ocfs2_add_branch.v2
- [PATCH 0/40] ocfs2: Detach ocfs2 metadata I/O from struct inode
- [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE
- [PATCH 0/2] Unwritten extent merge update, V2
- [PATCH 0/2] Two b-tree bug fixes.