Joel Becker
2007-Mar-28 18:50 UTC
[Ocfs2-devel] [PATCH 0/3] Properly protect the extent map
The extent map code was seeing an error trip when inserting records. There is a sanity check ensuring that the new record is within the known size of the file - as known by the extent map. The problem was that the "known" size was smaller than expected, returning an error to the caller. This is a lot easier to see in local mounts. It has only been rarely seen in clusters. We have three "independant" problems. 1) We don't take ip_alloc_sem when touching directory allocations. 2) We don't lock em_clusters properly. 3) Local mounts race to ocfs2_extent_map_trunc() in ocfs2_meta_lock_update(). The patches that follow are against mainline. They have been tested against 1.2 with local mounts. Patches (1) and (2) will be ported to the ocfs2 1.2 branch directly, and all three fixes will be pushed to mainline. Joel -- Life's Little Instruction Book #69 "Whistle" jlbec@innerx.net http://ocala.cs.miami.edu/~jlbec
Joel Becker
2007-Mar-28 18:50 UTC
[Ocfs2-devel] [PATCH 1/3] ocfs2: Wrap access of directory allocations with ip_alloc_sem.
OCFS2_I(inode)->ip_alloc_sem is a read-write semaphore protecting local concurrent access of ocfs2 inodes. However, ocfs2 directories were not taking the semaphore while they accessed or modified the allocation tree. ocfs2_extend_dir() needs to take the semaphore in a write mode when it adds to the allocation. All other directory users get there via ocfs2_bread(), which takes the semaphore in read mode. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- fs/ocfs2/dir.c | 7 ++++++- fs/ocfs2/inode.c | 2 ++ 2 files changed, 8 insertions(+), 1 deletions(-) diff --git a/fs/ocfs2/dir.c b/fs/ocfs2/dir.c index 66821e1..fd41f8b 100644 --- a/fs/ocfs2/dir.c +++ b/fs/ocfs2/dir.c @@ -402,7 +402,7 @@ static int ocfs2_extend_dir(struct ocfs2 struct buffer_head **new_de_bh) { int status = 0; - int credits, num_free_extents; + int credits, num_free_extents, drop_alloc_sem = 0; loff_t dir_i_size; struct ocfs2_dinode *fe = (struct ocfs2_dinode *) parent_fe_bh->b_data; struct ocfs2_alloc_context *data_ac = NULL; @@ -451,6 +451,9 @@ static int ocfs2_extend_dir(struct ocfs2 credits = OCFS2_SIMPLE_DIR_EXTEND_CREDITS; } + down_write(&OCFS2_I(dir)->ip_alloc_sem); + drop_alloc_sem = 1; + handle = ocfs2_start_trans(osb, credits); if (IS_ERR(handle)) { status = PTR_ERR(handle); @@ -496,6 +499,8 @@ static int ocfs2_extend_dir(struct ocfs2 *new_de_bh = new_bh; get_bh(*new_de_bh); bail: + if (drop_alloc_sem) + up_write(&OCFS2_I(dir)->ip_alloc_sem); if (handle) ocfs2_commit_trans(osb, handle); diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index 28ab56f..480d90f 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -1121,8 +1121,10 @@ struct buffer_head *ocfs2_bread(struct i return NULL; } + down_read(&OCFS2_I(inode)->ip_alloc_sem); tmperr = ocfs2_extent_map_get_blocks(inode, block, 1, &p_blkno, NULL); + up_read(&OCFS2_I(inode)->ip_alloc_sem); if (tmperr < 0) { mlog_errno(tmperr); goto fail; -- 1.4.2.3
Joel Becker
2007-Mar-28 18:51 UTC
[Ocfs2-devel] [PATCH 3/3] ocfs2: Local mounts should not truncate the extent map.
A local mount can implicitly trust state on the inode. There are no other nodes to change the filesystem. Thus, local mounts should leave the extent map intact until ->clear_inode(). This fixes a bug where a call to ocfs2_meta_lock_update() can race queries of the extent map. A clustered mount only allows one process through ocfs2_meta_lock_update(), but a local mount allows all processes through. Thus, one process can be all the way to ocfs2_extent_map_get_blocks() when a second process is truncating the extent map in ocfs2_meta_lock_update(). That no longer happens because the extent map is unmodified. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- fs/ocfs2/dlmglue.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/dlmglue.c b/fs/ocfs2/dlmglue.c index e335541..eb1c8fa 100644 --- a/fs/ocfs2/dlmglue.c +++ b/fs/ocfs2/dlmglue.c @@ -1508,8 +1508,9 @@ static int ocfs2_meta_lock_update(struct ocfs2_metadata_cache_purge(inode); /* will do nothing for inode types that don't use the extent - * map (directories, bitmap files, etc) */ - ocfs2_extent_map_trunc(inode, 0); + * map (bitmap files, etc) */ + if (!ocfs2_mount_local(osb)) + ocfs2_extent_map_trunc(inode, 0); if (lockres && ocfs2_meta_lvb_is_trustable(inode, lockres)) { mlog(0, "Trusting LVB on inode %llu\n", -- 1.4.2.3
Joel Becker
2007-Mar-28 18:51 UTC
[Ocfs2-devel] [PATCH 2/3] ocfs2: Properly lock extent map size changes.
The extent map code failed to properly lock changes to ->em_clusters, the extent map's idea of its own size. This leads to a subtle race. One process is updating the size to match an inode that changed, while another process is already past that in the lookup code checking the size against its arguments. For a moment, the size is wrong (due to how the size is checked and calculated). Properly locking the update and the query makes this safe. The check for size change is abstracted into a common function. Signed-off-by: Joel Becker <joel.becker@oracle.com> --- fs/ocfs2/extent_map.c | 90 +++++++++++++++++++++++++++---------------------- 1 files changed, 50 insertions(+), 40 deletions(-) diff --git a/fs/ocfs2/extent_map.c b/fs/ocfs2/extent_map.c index 80ac69f..9ff4351 100644 --- a/fs/ocfs2/extent_map.c +++ b/fs/ocfs2/extent_map.c @@ -84,6 +84,8 @@ static int ocfs2_extent_map_try_insert(s struct ocfs2_extent_rec *rec, int tree_depth, struct ocfs2_em_insert_context *ctxt); +static void ocfs2_extent_map_check_size_change(struct inode *inode, + u32 expected_clusters); /* returns 1 only if the rec contains all the given clusters -- that is that * rec's cpos is <= the cluster cpos and that the rec endpoint (cpos + @@ -558,8 +560,10 @@ static int ocfs2_extent_map_insert(struc int ret; struct ocfs2_em_insert_context ctxt = {0, }; + spin_lock(&OCFS2_I(inode)->ip_lock); if ((le32_to_cpu(rec->e_cpos) + le32_to_cpu(rec->e_clusters)) > OCFS2_I(inode)->ip_map.em_clusters) { + spin_unlock(&OCFS2_I(inode)->ip_lock); ret = -EBADR; mlog_errno(ret); return ret; @@ -569,6 +573,7 @@ static int ocfs2_extent_map_insert(struc if (!rec->e_clusters) { if ((le32_to_cpu(rec->e_cpos) + le32_to_cpu(rec->e_clusters)) ! OCFS2_I(inode)->ip_map.em_clusters) { + spin_unlock(&OCFS2_I(inode)->ip_lock); ret = -EBADR; mlog_errno(ret); ocfs2_error(inode->i_sb, @@ -578,9 +583,12 @@ static int ocfs2_extent_map_insert(struc return ret; } + spin_unlock(&OCFS2_I(inode)->ip_lock); + /* Ignore the truncated tail */ return 0; } + spin_unlock(&OCFS2_I(inode)->ip_lock); ret = -ENOMEM; ctxt.new_ent = kmem_cache_alloc(ocfs2_em_ent_cachep, @@ -662,15 +670,8 @@ int ocfs2_extent_map_append(struct inode BUG_ON(!new_clusters); BUG_ON(le32_to_cpu(rec->e_clusters) < new_clusters); - if (em->em_clusters < OCFS2_I(inode)->ip_clusters) { - /* - * Size changed underneath us on disk. Drop any - * straddling records and update our idea of - * i_clusters - */ - ocfs2_extent_map_drop(inode, em->em_clusters - 1); - em->em_clusters = OCFS2_I(inode)->ip_clusters; - } + ocfs2_extent_map_check_size_change(inode, + OCFS2_I(inode)->ip_clusters); mlog_bug_on_msg((le32_to_cpu(rec->e_cpos) + le32_to_cpu(rec->e_clusters)) !@@ -745,7 +746,6 @@ int ocfs2_extent_map_get_rec(struct inod int *tree_depth) { int ret = -ENOENT; - struct ocfs2_extent_map *em = &OCFS2_I(inode)->ip_map; struct ocfs2_extent_map_entry *ent; *rec = NULL; @@ -753,15 +753,7 @@ int ocfs2_extent_map_get_rec(struct inod if (cpos >= OCFS2_I(inode)->ip_clusters) return -EINVAL; - if (cpos >= em->em_clusters) { - /* - * Size changed underneath us on disk. Drop any - * straddling records and update our idea of - * i_clusters - */ - ocfs2_extent_map_drop(inode, em->em_clusters - 1); - em->em_clusters = OCFS2_I(inode)->ip_clusters ; - } + ocfs2_extent_map_check_size_change(inode, cpos); ent = ocfs2_extent_map_lookup(&OCFS2_I(inode)->ip_map, cpos, 1, NULL, NULL); @@ -782,7 +774,6 @@ int ocfs2_extent_map_get_clusters(struct { int ret; u32 coff, ccount; - struct ocfs2_extent_map *em = &OCFS2_I(inode)->ip_map; struct ocfs2_extent_map_entry *ent = NULL; *p_cpos = ccount = 0; @@ -790,16 +781,7 @@ int ocfs2_extent_map_get_clusters(struct if ((v_cpos + count) > OCFS2_I(inode)->ip_clusters) return -EINVAL; - if ((v_cpos + count) > em->em_clusters) { - /* - * Size changed underneath us on disk. Drop any - * straddling records and update our idea of - * i_clusters - */ - ocfs2_extent_map_drop(inode, em->em_clusters - 1); - em->em_clusters = OCFS2_I(inode)->ip_clusters; - } - + ocfs2_extent_map_check_size_change(inode, v_cpos + count); ret = ocfs2_extent_map_lookup_read(inode, v_cpos, count, &ent); if (ret) @@ -838,7 +820,6 @@ int ocfs2_extent_map_get_blocks(struct i u32 cpos, clusters; int bpc = ocfs2_clusters_to_blocks(inode->i_sb, 1); struct ocfs2_extent_map_entry *ent = NULL; - struct ocfs2_extent_map *em = &OCFS2_I(inode)->ip_map; struct ocfs2_extent_rec *rec; *p_blkno = 0; @@ -852,15 +833,7 @@ int ocfs2_extent_map_get_blocks(struct i return ret; } - if ((cpos + clusters) > em->em_clusters) { - /* - * Size changed underneath us on disk. Drop any - * straddling records and update our idea of - * i_clusters - */ - ocfs2_extent_map_drop(inode, em->em_clusters - 1); - em->em_clusters = OCFS2_I(inode)->ip_clusters; - } + ocfs2_extent_map_check_size_change(inode, cpos + clusters); ret = ocfs2_extent_map_lookup_read(inode, cpos, clusters, &ent); if (ret) { @@ -996,6 +969,43 @@ int ocfs2_extent_map_drop(struct inode * } /* + * This is almost a wrapper of ocfs2_extent_map_drop(), but must + * handle its locking carefully. + */ +static void ocfs2_extent_map_check_size_change(struct inode *inode, + u32 expected_clusters) +{ + struct rb_node *free_head = NULL; + struct ocfs2_extent_map *em = &OCFS2_I(inode)->ip_map; + struct ocfs2_extent_map_entry *ent; + + spin_lock(&OCFS2_I(inode)->ip_lock); + + if (em->em_clusters < expected_clusters) { + /* + * Size changed underneath us on disk. Drop any + * straddling records and update our idea of + * i_clusters + */ + __ocfs2_extent_map_drop(inode, em->em_clusters -1, + &free_head, &ent); + + if (ent) { + rb_erase(&ent->e_node, &em->em_extents); + ent->e_node.rb_right = free_head; + free_head = &ent->e_node; + } + + em->em_clusters = OCFS2_I(inode)->ip_clusters; + } + + spin_unlock(&OCFS2_I(inode)->ip_lock); + + if (free_head) + __ocfs2_extent_map_drop_cleanup(free_head); +} + +/* * Remove all entries past new_clusters and also clip any extent * straddling new_clusters, if there is one. This does not check * or modify ip_clusters -- 1.4.2.3
Fabio Massimo Di Nitto
2007-Mar-29 02:04 UTC
[Ocfs2-devel] [PATCH 0/3] Properly protect the extent map
Joel Becker wrote:> The extent map code was seeing an error trip when inserting records. > There is a sanity check ensuring that the new record is within the known > size of the file - as known by the extent map. The problem was that the > "known" size was smaller than expected, returning an error to the caller. > > This is a lot easier to see in local mounts. It has only been rarely seen > in clusters. > > We have three "independant" problems. > > 1) We don't take ip_alloc_sem when touching directory allocations. > 2) We don't lock em_clusters properly. > 3) Local mounts race to ocfs2_extent_map_trunc() in ocfs2_meta_lock_update(). > > The patches that follow are against mainline. They have been tested > against 1.2 with local mounts. > > Patches (1) and (2) will be ported to the ocfs2 1.2 branch directly, > and all three fixes will be pushed to mainline. > > Joeladd a Signed-off-by: Fabio M. Di Nitto <fabbione@ubuntu.com> on all 3 of them. They fix the problem as i reported it, but it looks like they introduce a performance hit. Somebody other than me should test it too. Fabio -- I'm going to make him an offer he can't refuse.
Mark Fasheh
2007-Apr-02 11:45 UTC
[Ocfs2-devel] [PATCH 0/3] Properly protect the extent map
On Thu, Mar 29, 2007 at 10:38:24AM +0200, Fabio Massimo Di Nitto wrote:> on all 3 of them. They fix the problem as i reported it, but it looks like they > introduce a performance hit.Can you elaborate on that please? I don't see anything in those patches that should cause any performance problems but if you have any more information about what you saw, it might point to something we missed. --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com
Mark Fasheh
2007-Apr-02 11:47 UTC
[Ocfs2-devel] [PATCH 0/3] Properly protect the extent map
On Wed, Mar 28, 2007 at 06:27:07PM -0700, Joel Becker wrote:> The patches that follow are against mainline. They have been tested > against 1.2 with local mounts. > > Patches (1) and (2) will be ported to the ocfs2 1.2 branch directly, > and all three fixes will be pushed to mainline.Ok. These patches look good - thanks. Have you had a chance to test against mainline? --Mark -- Mark Fasheh Senior Software Developer, Oracle mark.fasheh@oracle.com