Tao Ma
2009-Jan-15 21:58 UTC
[Ocfs2-devel] [PATCH 0/3] ocfs2: Inode Allocation Strategy Improvement.v2
Changelog from V1 to V2: 1. Modify some codes according to Mark's advice. 2. Attach some test statistics in the commit log of patch 3 and in this e-mail also. See below. Hi all, In ocfs2, when we create a fresh file system and create inodes in it, they are contiguous and good for readdir+stat. While if we delete all the inodes and created again, the new inodes will get spread out and that isn't what we need. The core problem here is that the inode block search looks for the "emptiest" inode group to allocate from. So if an inode alloc file has many equally (or almost equally) empty groups, new inodes will tend to get spread out amongst them, which in turn can put them all over the disk. This is undesirable because directory operations on conceptually "nearby" inodes force a large number of seeks. For more details, please see http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy. So this patch set try to fix this problem. patch 1: Optimize inode allocation by remembering last group. We add ip_last_used_group in core directory inodes which records the last used allocation group. Another field named ip_last_used_slot is also added in case inode stealing happens. When claiming new inode, we passed in directory's inode so that the allocation can use this information. patch 2: let the Inode group allocs use the global bitmap directly. patch 3: we add osb_last_alloc_group in ocfs2_super to record the last used allocation group so that we can make inode groups contiguous enough. I have done some basic test and the results are cool. 1. single node test: first column is the result without inode allocation patches, and the second one with inode allocation patched enabled. You see we have great improvement with the second "ls -lR". echo 'y'|mkfs.ocfs2 -b 4K -C 4K -M local /dev/sda11 mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null real 0m20.548s 0m20.106s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time ls -lR /mnt/ocfs2/ 1>/dev/null real 0m13.965s 0m13.766s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time rm /mnt/ocfs2/linux-2.6.28/ -rf real 0m13.198s 0m13.091s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null real 0m23.022s 0m21.360s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time ls -lR /mnt/ocfs2/ 1>/dev/null real 2m45.189s 0m15.019s yes, that is it. ;) I don't know we can improve so much when I start up. 2. Tested with 4 nodes(megabyte switch for both cross-node communication and iscsi), with the same command sequence(using openmpi to run the command simultaneously). Although we spend a lot of time in cross-node communication, we still have some performance improvement. the 1st tar: real 356.22s 357.70s the 1st ls -lR: real 187.33s 187.32s the rm: real 260.68s 262.42s the 2nd tar: real 371.92s 358.47s the 2nd ls: real 197.16s 188.36s Regards, Tao
Tao Ma
2009-Jan-15 22:00 UTC
[Ocfs2-devel] [PATCH 1/3] ocfs2: Optimize inode allocation by remembering last group.
In ocfs2, the inode block search looks for the "emptiest" inode group to allocate from. So if an inode alloc file has many equally (or almost equally) empty groups, new inodes will tend to get spread out amongst them, which in turn can put them all over the disk. This is undesirable because directory operations on conceptually "nearby" inodes force a large number of seeks. So we add ip_last_used_group in core directory inodes which records the last used allocation group. Another field named ip_last_used_slot is also added in case inode stealing happens. When claiming new inode, we passed in directory's inode so that the allocation can use this information. For more details, please see http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy. Signed-off-by: Tao Ma <tao.ma at oracle.com> --- fs/ocfs2/inode.c | 2 ++ fs/ocfs2/inode.h | 4 ++++ fs/ocfs2/namei.c | 4 ++-- fs/ocfs2/suballoc.c | 36 ++++++++++++++++++++++++++++++++++++ fs/ocfs2/suballoc.h | 2 ++ 5 files changed, 46 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index 229e707..0435000 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -351,6 +351,8 @@ void ocfs2_populate_inode(struct inode *inode, struct ocfs2_dinode *fe, ocfs2_set_inode_flags(inode); + OCFS2_I(inode)->ip_last_used_slot = 0; + OCFS2_I(inode)->ip_last_used_group = 0; mlog_exit_void(); } diff --git a/fs/ocfs2/inode.h b/fs/ocfs2/inode.h index eb3c302..e1978ac 100644 --- a/fs/ocfs2/inode.h +++ b/fs/ocfs2/inode.h @@ -72,6 +72,10 @@ struct ocfs2_inode_info struct inode vfs_inode; struct jbd2_inode ip_jinode; + + /* Only valid if the inode is the dir. */ + u32 ip_last_used_slot; + u64 ip_last_used_group; }; /* diff --git a/fs/ocfs2/namei.c b/fs/ocfs2/namei.c index 084aba8..9372b23 100644 --- a/fs/ocfs2/namei.c +++ b/fs/ocfs2/namei.c @@ -469,8 +469,8 @@ static int ocfs2_mknod_locked(struct ocfs2_super *osb, *new_fe_bh = NULL; - status = ocfs2_claim_new_inode(osb, handle, inode_ac, &suballoc_bit, - &fe_blkno); + status = ocfs2_claim_new_inode(osb, handle, dir, parent_fe_bh, + inode_ac, &suballoc_bit, &fe_blkno); if (status < 0) { mlog_errno(status); goto leave; diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index a696286..487f00c 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -1618,8 +1618,41 @@ bail: return status; } +static void ocfs2_init_inode_ac_group(struct inode *dir, + struct buffer_head *parent_fe_bh, + struct ocfs2_alloc_context *ac) +{ + struct ocfs2_dinode *fe = (struct ocfs2_dinode *)parent_fe_bh->b_data; + /* + * Try to allocate inodes from some specific group. + * + * If the parent dir has recorded the last group used in allocation, + * cool, use it. Otherwise if we try to allocate new inode from the + * same slot the parent dir belongs to, use the same chunk. + * + * We are very careful here to avoid the mistake of setting + * ac_last_group to a group descriptor from a different (unlocked) slot. + */ + if (OCFS2_I(dir)->ip_last_used_group && + OCFS2_I(dir)->ip_last_used_slot == ac->ac_alloc_slot) + ac->ac_last_group = OCFS2_I(dir)->ip_last_used_group; + else if (le16_to_cpu(fe->i_suballoc_slot) == ac->ac_alloc_slot) + ac->ac_last_group = ocfs2_which_suballoc_group( + le64_to_cpu(fe->i_blkno), + le16_to_cpu(fe->i_suballoc_bit)); +} + +static inline void ocfs2_save_inode_ac_group(struct inode *dir, + struct ocfs2_alloc_context *ac) +{ + OCFS2_I(dir)->ip_last_used_group = ac->ac_last_group; + OCFS2_I(dir)->ip_last_used_slot = ac->ac_alloc_slot; +} + int ocfs2_claim_new_inode(struct ocfs2_super *osb, handle_t *handle, + struct inode *dir, + struct buffer_head *parent_fe_bh, struct ocfs2_alloc_context *ac, u16 *suballoc_bit, u64 *fe_blkno) @@ -1635,6 +1668,8 @@ int ocfs2_claim_new_inode(struct ocfs2_super *osb, BUG_ON(ac->ac_bits_wanted != 1); BUG_ON(ac->ac_which != OCFS2_AC_USE_INODE); + ocfs2_init_inode_ac_group(dir, parent_fe_bh, ac); + status = ocfs2_claim_suballoc_bits(osb, ac, handle, @@ -1653,6 +1688,7 @@ int ocfs2_claim_new_inode(struct ocfs2_super *osb, *fe_blkno = bg_blkno + (u64) (*suballoc_bit); ac->ac_bits_given++; + ocfs2_save_inode_ac_group(dir, ac); status = 0; bail: mlog_exit(status); diff --git a/fs/ocfs2/suballoc.h b/fs/ocfs2/suballoc.h index e3c13c7..ea85a4c 100644 --- a/fs/ocfs2/suballoc.h +++ b/fs/ocfs2/suballoc.h @@ -88,6 +88,8 @@ int ocfs2_claim_metadata(struct ocfs2_super *osb, u64 *blkno_start); int ocfs2_claim_new_inode(struct ocfs2_super *osb, handle_t *handle, + struct inode *dir, + struct buffer_head *parent_fe_bh, struct ocfs2_alloc_context *ac, u16 *suballoc_bit, u64 *fe_blkno); -- 1.5.5
Tao Ma
2009-Jan-15 22:00 UTC
[Ocfs2-devel] [PATCH 2/3] ocfs2: Allocate inode groups from global_bitmap.
Inode groups used to be allocated from local alloc file, but since we want all inodes to be contiguous enough, we will try to allocate them directly from global_bitmap. Signed-off-by: Tao Ma <tao.ma at oracle.com> --- fs/ocfs2/suballoc.c | 29 +++++++++++++++++++---------- 1 files changed, 19 insertions(+), 10 deletions(-) diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index 487f00c..b7a065e 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -48,7 +48,8 @@ #include "buffer_head_io.h" #define NOT_ALLOC_NEW_GROUP 0 -#define ALLOC_NEW_GROUP 1 +#define ALLOC_NEW_GROUP 0x1 +#define ALLOC_GROUPS_FROM_GLOBAL 0x2 #define OCFS2_MAX_INODES_TO_STEAL 1024 @@ -64,7 +65,8 @@ static int ocfs2_block_group_fill(handle_t *handle, static int ocfs2_block_group_alloc(struct ocfs2_super *osb, struct inode *alloc_inode, struct buffer_head *bh, - u64 max_block); + u64 max_block, + int flags); static int ocfs2_cluster_group_search(struct inode *inode, struct buffer_head *group_bh, @@ -116,6 +118,7 @@ static inline void ocfs2_block_to_cluster_group(struct inode *inode, u16 *bg_bit_off); static int ocfs2_reserve_clusters_with_limit(struct ocfs2_super *osb, u32 bits_wanted, u64 max_block, + int flags, struct ocfs2_alloc_context **ac); void ocfs2_free_ac_resource(struct ocfs2_alloc_context *ac) @@ -403,7 +406,8 @@ static inline u16 ocfs2_find_smallest_chain(struct ocfs2_chain_list *cl) static int ocfs2_block_group_alloc(struct ocfs2_super *osb, struct inode *alloc_inode, struct buffer_head *bh, - u64 max_block) + u64 max_block, + int flags) { int status, credits; struct ocfs2_dinode *fe = (struct ocfs2_dinode *) bh->b_data; @@ -423,7 +427,7 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, cl = &fe->id2.i_chain; status = ocfs2_reserve_clusters_with_limit(osb, le16_to_cpu(cl->cl_cpg), - max_block, &ac); + max_block, flags, &ac); if (status < 0) { if (status != -ENOSPC) mlog_errno(status); @@ -531,7 +535,7 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb, struct ocfs2_alloc_context *ac, int type, u32 slot, - int alloc_new_group) + int flags) { int status; u32 bits_wanted = ac->ac_bits_wanted; @@ -587,7 +591,7 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb, goto bail; } - if (alloc_new_group != ALLOC_NEW_GROUP) { + if (!(flags & ALLOC_NEW_GROUP)) { mlog(0, "Alloc File %u Full: wanted=%u, free_bits=%u, " "and we don't alloc a new group for it.\n", slot, bits_wanted, free_bits); @@ -596,7 +600,7 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb, } status = ocfs2_block_group_alloc(osb, alloc_inode, bh, - ac->ac_max_block); + ac->ac_max_block, flags); if (status < 0) { if (status != -ENOSPC) mlog_errno(status); @@ -740,7 +744,9 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb, atomic_set(&osb->s_num_inodes_stolen, 0); status = ocfs2_reserve_suballoc_bits(osb, *ac, INODE_ALLOC_SYSTEM_INODE, - osb->slot_num, ALLOC_NEW_GROUP); + osb->slot_num, + ALLOC_NEW_GROUP | + ALLOC_GROUPS_FROM_GLOBAL); if (status >= 0) { status = 0; @@ -806,6 +812,7 @@ bail: * things a bit. */ static int ocfs2_reserve_clusters_with_limit(struct ocfs2_super *osb, u32 bits_wanted, u64 max_block, + int flags, struct ocfs2_alloc_context **ac) { int status; @@ -823,7 +830,8 @@ static int ocfs2_reserve_clusters_with_limit(struct ocfs2_super *osb, (*ac)->ac_max_block = max_block; status = -ENOSPC; - if (ocfs2_alloc_should_use_local(osb, bits_wanted)) { + if (!(flags & ALLOC_GROUPS_FROM_GLOBAL) && + ocfs2_alloc_should_use_local(osb, bits_wanted)) { status = ocfs2_reserve_local_alloc_bits(osb, bits_wanted, *ac); @@ -861,7 +869,8 @@ int ocfs2_reserve_clusters(struct ocfs2_super *osb, u32 bits_wanted, struct ocfs2_alloc_context **ac) { - return ocfs2_reserve_clusters_with_limit(osb, bits_wanted, 0, ac); + return ocfs2_reserve_clusters_with_limit(osb, bits_wanted, 0, + ALLOC_NEW_GROUP, ac); } /* -- 1.5.5
Tao Ma
2009-Jan-15 22:00 UTC
[Ocfs2-devel] [PATCH 3/3] ocfs2: Optimize inode group allocation by recording last used group.
In ocfs2, the block group search looks for the "emptiest" group to allocate from. So if the allocator has many equally(or almost equally) empty groups, new block group will tend to get spread out amongst them. So we add osb_inode_alloc_group in ocfs2_super to record the last used inode allocation group. For more details, please see http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy. I have done some basic test and the results are cool. 1. single node test: first column is the result without inode allocation patches, and the second one with inode allocation patched enabled. You see we have great improvement with the second "ls -lR". echo 'y'|mkfs.ocfs2 -b 4K -C 4K -M local /dev/sda11 mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null real 0m20.548s 0m20.106s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time ls -lR /mnt/ocfs2/ 1>/dev/null real 0m13.965s 0m13.766s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time rm /mnt/ocfs2/linux-2.6.28/ -rf real 0m13.198s 0m13.091s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null real 0m23.022s 0m21.360s umount /mnt/ocfs2/ echo 2 > /proc/sys/vm/drop_caches mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ time ls -lR /mnt/ocfs2/ 1>/dev/null real 2m45.189s 0m15.019s(yes, that is it. :) ) 2. Tested with 4 nodes(megabyte switch for both cross-node communication and iscsi), with the same command sequence(using openmpi to run the command simultaneously). Although we spend a lot of time in cross-node communication, we still have some performance improvement. the 1st tar: real 356.22s 357.70s the 1st ls -lR: real 187.33s 187.32s the rm: real 260.68s 262.42s the 2nd tar: real 371.92s 358.47s the 2nd ls: real 197.16s 188.36s Signed-off-by: Tao Ma <tao.ma at oracle.com> --- fs/ocfs2/ocfs2.h | 3 +++ fs/ocfs2/suballoc.c | 32 ++++++++++++++++++++++++++++---- 2 files changed, 31 insertions(+), 4 deletions(-) diff --git a/fs/ocfs2/ocfs2.h b/fs/ocfs2/ocfs2.h index ad5c24a..f0377bd 100644 --- a/fs/ocfs2/ocfs2.h +++ b/fs/ocfs2/ocfs2.h @@ -335,6 +335,9 @@ struct ocfs2_super struct ocfs2_node_map osb_recovering_orphan_dirs; unsigned int *osb_orphan_wipes; wait_queue_head_t osb_wipe_event; + + /* the group we used to allocate inodes. */ + u64 osb_inode_alloc_group; }; #define OCFS2_SB(sb) ((struct ocfs2_super *)(sb)->s_fs_info) diff --git a/fs/ocfs2/suballoc.c b/fs/ocfs2/suballoc.c index b7a065e..4c1399c 100644 --- a/fs/ocfs2/suballoc.c +++ b/fs/ocfs2/suballoc.c @@ -66,6 +66,7 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, struct inode *alloc_inode, struct buffer_head *bh, u64 max_block, + u64 *last_alloc_group, int flags); static int ocfs2_cluster_group_search(struct inode *inode, @@ -407,6 +408,7 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, struct inode *alloc_inode, struct buffer_head *bh, u64 max_block, + u64 *last_alloc_group, int flags) { int status, credits; @@ -444,6 +446,11 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, goto bail; } + if (last_alloc_group && *last_alloc_group != 0) { + mlog(0, "use old allocation group %llu for block group alloc\n", + (unsigned long long)*last_alloc_group); + ac->ac_last_group = *last_alloc_group; + } status = ocfs2_claim_clusters(osb, handle, ac, @@ -518,6 +525,11 @@ static int ocfs2_block_group_alloc(struct ocfs2_super *osb, alloc_inode->i_blocks = ocfs2_inode_sector_count(alloc_inode); status = 0; + + /* save the new last alloc group so that the caller can cache it. */ + if (last_alloc_group) + *last_alloc_group = ac->ac_last_group; + bail: if (handle) ocfs2_commit_trans(osb, handle); @@ -535,6 +547,7 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb, struct ocfs2_alloc_context *ac, int type, u32 slot, + u64 *last_alloc_group, int flags) { int status; @@ -600,7 +613,8 @@ static int ocfs2_reserve_suballoc_bits(struct ocfs2_super *osb, } status = ocfs2_block_group_alloc(osb, alloc_inode, bh, - ac->ac_max_block, flags); + ac->ac_max_block, + last_alloc_group, flags); if (status < 0) { if (status != -ENOSPC) mlog_errno(status); @@ -644,7 +658,7 @@ int ocfs2_reserve_new_metadata_blocks(struct ocfs2_super *osb, status = ocfs2_reserve_suballoc_bits(osb, (*ac), EXTENT_ALLOC_SYSTEM_INODE, - slot, ALLOC_NEW_GROUP); + slot, NULL, ALLOC_NEW_GROUP); if (status < 0) { if (status != -ENOSPC) mlog_errno(status); @@ -690,7 +704,8 @@ static int ocfs2_steal_inode_from_other_nodes(struct ocfs2_super *osb, status = ocfs2_reserve_suballoc_bits(osb, ac, INODE_ALLOC_SYSTEM_INODE, - slot, NOT_ALLOC_NEW_GROUP); + slot, NULL, + NOT_ALLOC_NEW_GROUP); if (status >= 0) { ocfs2_set_inode_steal_slot(osb, slot); break; @@ -707,6 +722,7 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb, { int status; s16 slot = ocfs2_get_inode_steal_slot(osb); + u64 alloc_group; *ac = kzalloc(sizeof(struct ocfs2_alloc_context), GFP_KERNEL); if (!(*ac)) { @@ -742,14 +758,22 @@ int ocfs2_reserve_new_inode(struct ocfs2_super *osb, goto inode_steal; atomic_set(&osb->s_num_inodes_stolen, 0); + alloc_group = osb->osb_inode_alloc_group; status = ocfs2_reserve_suballoc_bits(osb, *ac, INODE_ALLOC_SYSTEM_INODE, osb->slot_num, + &alloc_group, ALLOC_NEW_GROUP | ALLOC_GROUPS_FROM_GLOBAL); if (status >= 0) { status = 0; + spin_lock(&osb->osb_lock); + osb->osb_inode_alloc_group = alloc_group; + spin_unlock(&osb->osb_lock); + mlog(0, "after reservation, new allocation group is " + "%llu\n", (unsigned long long)alloc_group); + /* * Some inodes must be freed by us, so try to allocate * from our own next time. @@ -796,7 +820,7 @@ int ocfs2_reserve_cluster_bitmap_bits(struct ocfs2_super *osb, status = ocfs2_reserve_suballoc_bits(osb, ac, GLOBAL_BITMAP_SYSTEM_INODE, - OCFS2_INVALID_SLOT, + OCFS2_INVALID_SLOT, NULL, ALLOC_NEW_GROUP); if (status < 0 && status != -ENOSPC) { mlog_errno(status); -- 1.5.5
tristan.ye
2009-Jan-16 08:05 UTC
[Ocfs2-devel] [PATCH 0/3] ocfs2: Inode Allocation Strategy Improvement.v2
On Fri, 2009-01-16 at 05:58 +0800, Tao Ma wrote:> Changelog from V1 to V2: > 1. Modify some codes according to Mark's advice. > 2. Attach some test statistics in the commit log of patch 3 and in > this e-mail also. See below. > > Hi all, > In ocfs2, when we create a fresh file system and create inodes in it, > they are contiguous and good for readdir+stat. While if we delete all > the inodes and created again, the new inodes will get spread out and > that isn't what we need. The core problem here is that the inode block > search looks for the "emptiest" inode group to allocate from. So if an > inode alloc file has many equally (or almost equally) empty groups, new > inodes will tend to get spread out amongst them, which in turn can put > them all over the disk. This is undesirable because directory operations > on conceptually "nearby" inodes force a large number of seeks. For more > details, please see > http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy. > > So this patch set try to fix this problem. > patch 1: Optimize inode allocation by remembering last group. > We add ip_last_used_group in core directory inodes which records > the last used allocation group. Another field named ip_last_used_slot > is also added in case inode stealing happens. When claiming new inode, > we passed in directory's inode so that the allocation can use this > information. > > patch 2: let the Inode group allocs use the global bitmap directly. > > patch 3: we add osb_last_alloc_group in ocfs2_super to record the last > used allocation group so that we can make inode groups contiguous enough. > > I have done some basic test and the results are cool. > 1. single node test: > first column is the result without inode allocation patches, and the > second one with inode allocation patched enabled. You see we have > great improvement with the second "ls -lR". > > echo 'y'|mkfs.ocfs2 -b 4K -C 4K -M local /dev/sda11 > > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null > > real 0m20.548s 0m20.106s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time ls -lR /mnt/ocfs2/ 1>/dev/null > > real 0m13.965s 0m13.766s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time rm /mnt/ocfs2/linux-2.6.28/ -rf > > real 0m13.198s 0m13.091s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null > > real 0m23.022s 0m21.360s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time ls -lR /mnt/ocfs2/ 1>/dev/null > > real 2m45.189s 0m15.019s > yes, that is it. ;) I don't know we can improve so much when I start up.Tao, I'm wondering why the 1st 'ls -lR' did not show us such a huge enhancement, are the system load(by uptime) simliar when doing your 2rd 'ls -lR' contrast tests? if so, that's a really significant gain!!!!:-),great congs! To get more persuasive testing results, i suggest you do the same tests by considerable times,and then a average statistic results should be more attractive to us:-), and it also minimize the influence of some exceptional system loads:-) Tristan> > 2. Tested with 4 nodes(megabyte switch for both cross-node > communication and iscsi), with the same command sequence(using > openmpi to run the command simultaneously). Although we spend > a lot of time in cross-node communication, we still have some > performance improvement. > > the 1st tar: > real 356.22s 357.70s > > the 1st ls -lR: > real 187.33s 187.32s > > the rm: > real 260.68s 262.42s > > the 2nd tar: > real 371.92s 358.47s > > the 2nd ls: > real 197.16s 188.36s > > Regards, > Tao > > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel
tristan.ye
2009-Feb-13 02:42 UTC
[Ocfs2-devel] [PATCH 0/3] ocfs2: Inode Allocation Strategy Improvement.v2
On Fri, 2009-01-16 at 05:58 +0800, Tao Ma wrote:> Changelog from V1 to V2: > 1. Modify some codes according to Mark's advice. > 2. Attach some test statistics in the commit log of patch 3 and in > this e-mail also. See below. > > Hi all, > In ocfs2, when we create a fresh file system and create inodes in it, > they are contiguous and good for readdir+stat. While if we delete all > the inodes and created again, the new inodes will get spread out and > that isn't what we need. The core problem here is that the inode block > search looks for the "emptiest" inode group to allocate from. So if an > inode alloc file has many equally (or almost equally) empty groups, new > inodes will tend to get spread out amongst them, which in turn can put > them all over the disk. This is undesirable because directory operations > on conceptually "nearby" inodes force a large number of seeks. For more > details, please see > http://oss.oracle.com/osswiki/OCFS2/DesignDocs/InodeAllocationStrategy. > > So this patch set try to fix this problem. > patch 1: Optimize inode allocation by remembering last group. > We add ip_last_used_group in core directory inodes which records > the last used allocation group. Another field named ip_last_used_slot > is also added in case inode stealing happens. When claiming new inode, > we passed in directory's inode so that the allocation can use this > information. > > patch 2: let the Inode group allocs use the global bitmap directly. > > patch 3: we add osb_last_alloc_group in ocfs2_super to record the last > used allocation group so that we can make inode groups contiguous enough. > > I have done some basic test and the results are cool. > 1. single node test: > first column is the result without inode allocation patches, and the > second one with inode allocation patched enabled. You see we have > great improvement with the second "ls -lR". > > echo 'y'|mkfs.ocfs2 -b 4K -C 4K -M local /dev/sda11 > > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null > > real 0m20.548s 0m20.106s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time ls -lR /mnt/ocfs2/ 1>/dev/null > > real 0m13.965s 0m13.766s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time rm /mnt/ocfs2/linux-2.6.28/ -rf > > real 0m13.198s 0m13.091s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time tar jxvf /home/taoma/linux-2.6.28.tar.bz2 -C /mnt/ocfs2/ 1>/dev/null > > real 0m23.022s 0m21.360s > > umount /mnt/ocfs2/ > echo 2 > /proc/sys/vm/drop_caches > mount -t ocfs2 /dev/sda11 /mnt/ocfs2/ > time ls -lR /mnt/ocfs2/ 1>/dev/null > > real 2m45.189s 0m15.019s > yes, that is it. ;) I don't know we can improve so much when I start up. > > 2. Tested with 4 nodes(megabyte switch for both cross-node > communication and iscsi), with the same command sequence(using > openmpi to run the command simultaneously). Although we spend > a lot of time in cross-node communication, we still have some > performance improvement. > > the 1st tar: > real 356.22s 357.70s > > the 1st ls -lR: > real 187.33s 187.32s > > the rm: > real 260.68s 262.42s > > the 2nd tar: > real 371.92s 358.47s > > the 2nd ls: > real 197.16s 188.36s > > Regards, > TaoTao, mark, I've done a series of more strict tests with a much higher worload to prove a performance gain from tao's patches. Following are the testing steps, 1st Tar: Untar files to a freshly mkfsed and empty fs by proper iterations to fill the whole disk up(Here we use 100G volume) 1st Ls: Try to traverse all inodes in the fs recursivly 1st Rm: remove all inodes in the fs 2nd Tar:Untar files again to the empty fs. 2nd Ls : the same as 1st Ls 2nd Rm: the same as 1st Rm We use the same testing steps to do a comprison test between patched kernels and original kernel.>From the above tests, we were expected to see a performance gain duringthe 2nd Ls and 2nd RM since we know the patched kernel will provide a better inode locality when creating by '2nd Tar' while the original kernel go round robin with the inode allocator that makes a poor locality. And i'd like to say the result of real tests were awesome and encourging...Following are the testing reports. 1. Single node test. ========Time Consumed Statistics(2 iterations)===== [Patched kernel] [Original kernel] 1st Tar: 1745.17s 1751.86s 1st Ls: 2128.81s 2262.13s 1st Rm: 1760.66s 1857.06s 2nd Tar: 1924.77s 1917.75s 2nd Ls: 2313.11s 8196.51s 2nd Rm: 1925.14s 2372.10s 2. Multiple nodes tests. 1). From node1:test5 ========Time Consumed Statistics(2 iterations)===== [Patched kernel] [Original kernel] 1st Tar: 3528.36s 3422.23s 1st Ls: 3035.17s 6009.16s 1st Rm: 2436.65s 2307.37s 2nd Tar: 3131.00s 3521.21s 2nd Ls: 2949.31s 4002.07s 2nd Rm: 2425.09s 3365.42s 2) From node2:test12 ========Time Consumed Statistics(2 iterations)===== [Patched kernel] [Original kernel] 1st Tar: 3470.28s 3876.46s 1st Ls: 2972.58s 6743.32s 1st Rm: 2413.23s 2572.18s 2nd Tar: 3848.56s 3521.21s 2nd Ls: 2887.13s 8259.07s 2nd Rm: 2478.70s 4152.42s The data statistics from above tests were persuasive,this patches set really behaved well during such perf comparison tests:),and it should be the right time to get such patches committed. Regards, Tristan> > _______________________________________________ > Ocfs2-devel mailing list > Ocfs2-devel at oss.oracle.com > http://oss.oracle.com/mailman/listinfo/ocfs2-devel
Reasonably Related Threads
- [PATCH 1/3] ocfs2: Optimize inode allocation by remembering last group.
- [PATCH 0/3] ocfs2-1.4: Backport inode alloc from mainline.
- OCFS2 1.4: Patches backported from mainline
- [git patches] Ocfs2 updates for 2.6.30
- [RFC] metadata alloc fix in machines which has PAGE_SIZE > CLUSTER_SIZE