thr3ads.net - Ocfs2 devel - [Ocfs2-devel] [patch 1/1] offline de-fragmentation tool [Feb 2008]

If this information is useful, please help other people find it:
Share via:

wengang wang

2008-Feb-03 19:04 UTC

[Ocfs2-devel] [patch 1/1] offline de-fragmentation tool

I wrote a user space tool to de-fragmentate global bitmap, hope this
tool is helpful.

for the case of storing non-DB data on ocfs2, there are not several
very large file, but lots of relative small files.
after a long time of using, --especially creating and deleting, the
global bitmap is split into fragments. so that even there is enough free
space(but not contiguous), creating a file may fail.

there is a relative bug 6730723(on bugdb) though closed with "not
supported". and I have made a scenario that "df" show the
partition
usage 51%, but on other nodes creating a file fails with "no space"
error.

this offline tool, o2defrag, can make larger contiguous free bits on
global bitmap by moving data clusters of regular file and directories.
it does:
1) for each group, move data clusters on the group to the front of the
same group to make bigger free space at the end.
2) for groups that has more free space, move data clusters to other
group(s) to make much more free space.
a) firstly, it try to move data clusters to the group on which
there are data clusters of the same file. if no such group or no space
on these groups, goto b).
b) move data clusters to a group on which there is no data
clusters of the same file.
3) does step 1) again.

stuff changed by this tool is
I) moving of data clusters.
II) moving of corresponding bits in global bitmap.
III) extent record in ocfs2_dinode block or extension block

this tool doesn't merge or split extent records for now. and it's nearly
help nothing to fs performance.
the feature to be added is trying to move all data clusters of a file
together as possible. so that accessing to file on ocfs2 can get better
performance.

the patch is based on ocfs2-tool 1.2.6. for compiling, needs to add a
symbolic link named "include" in o2defrag directory of
../debugfs.ocfs2/include.
usage is "o2defrag <ocfs2 partition>"

thanks,
wengang.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: o2defrag.patch
Type: text/x-patch
Size: 54374 bytes
Desc: not available
Url :
http://oss.oracle.com/pipermail/ocfs2-devel/attachments/20080204/48954ff1/o2defrag-0001.bin

Tao Ma

2008-Feb-04 00:56 UTC

head link

[Ocfs2-devel] [patch 1/1] offline de-fragmentation tool

Sunil,
    What do you think of this tools? It is good I think.

Thanks for your hard work, wengang.
It should really take you much time since the patch contains more than 
2000 lines. ;)
Have you written any design doc about it? Do you have any test script 
that can show us how this fantastic tool work?

But next time please use the git tree to get the latest ocfs2-tools 
source code and generate your patch against it, so that I can merge your 
code quickly in my tree. This time, since Joel has moved the include 
file to other places, I can't merge your patch successfully. So I only 
read your patch and give some comments.

wengang wang wrote:> I wrote a user space tool to de-fragmentate global bitmap, hope this 
> tool is helpful.Next time please send it to ocfs2-tools-devel since your patch only 
contains the modification to ocfs2-tools.>
> for the case of storing non-DB data on ocfs2,  there are not several 
> very large file, but lots of relative small files.
> after a long time of using, --especially creating and deleting, the 
> global bitmap is split into fragments. so that even there is enough 
> free space(but not contiguous), creating a file may fail.
>
> there is a relative bug 6730723(on bugdb) though closed with "not 
> supported". and I have made a scenario that "df" show the
partition
> usage 51%, but on other nodes creating a file fails with "no
space"
> error.
>
> this offline tool, o2defrag, can make larger contiguous free bits on 
> global bitmap by moving data clusters of regular file and directories.
> it does:
> 1)  for each group, move data clusters on the group to the front of 
> the same group to make bigger free space at the end.
> 2)  for groups that has more free space, move data clusters to other 
> group(s) to make much more free space.
>     a)   firstly, it try to move data clusters to the group on which 
> there are data clusters of the same file. if no such group or no space 
> on these groups, goto b).
>     b)   move data clusters to a group on which there is no data 
> clusters of the same file.
> 3) does step 1) again.
>
> stuff changed by this tool is
> I)   moving of data clusters.
> II)  moving of corresponding bits in global bitmap.
> III) extent record in ocfs2_dinode block or extension block
>
> this tool doesn't merge or split extent records for now. and it's 
> nearly help nothing to fs performance.
> the feature to be added is trying to move all data clusters of a file 
> together as possible. so that accessing to file on ocfs2 can get 
> better performance.
>
> the patch is based on ocfs2-tool 1.2.6.  for compiling, needs to add a 
> symbolic link named "include" in o2defrag directory of 
> ../debugfs.ocfs2/include.
> usage is  "o2defrag <ocfs2 partition>"
>
>
> thanks,
> wengang.
>
> ------------------------------------------------------------------------
>
> diff -N -u -p -r ocfs2-tools-1.2.6.orig/libocfs2/chain.c
ocfs2-tools-1.2.6/libocfs2/chain.c
> --- ocfs2-tools-1.2.6.orig/libocfs2/chain.c	2008-02-05 01:30:21.000000000
-0500
> +++ ocfs2-tools-1.2.6/libocfs2/chain.c	2008-02-05 02:00:53.000000000 -0500
> @@ -43,7 +43,33 @@ void ocfs2_swap_group_desc(struct ocfs2_
>  	gd->bg_parent_dinode = bswap_64(gd->bg_parent_dinode);
>  	gd->bg_blkno = bswap_64(gd->bg_blkno);
>  }
> +errcode_t ocfs2_read_group_desc2(ocfs2_filesys *fs, uint64_t blkno,
> +					char *gd_buf)
>   You added so many *2 functions. And the biggest difference is that you 
don't allocate the spaces in it and read-write using the buffer given by 
the caller. I am not sure whether it is OK. Even if it is OK, you should 
put it into another patch, since it has no relationship with your 
o2defrag patch. And it also makes your patch smaller so that people can 
review it quickly.
> +struct o2_group_list{
> +	int group_size;		/* count of total bits of a group */
> +	int count;		/* count of groups */
> +	struct o2_group *head;	/* the first group in list */
> +	struct o2_group *last;  /* the last group in list */
> +};
> +
> +struct o2_file_list all_file_list;
> +struct o2_group_list all_group_list;
> +
> +/*
> +struct current_name{
> +	int current_layer;
> +	char *name[MAX_FILE_DEP+1];
> +};
> +
> +struct current_name current_name;
> +*/
>   Could you please remove these unused code before you generate the patch? 
thanks.> +
> +struct o2_modify_meta{
> +	uint64_t meta_blkno;		/* inode number of a fle */
> +	uint64_t old_data_blkno;	
> +	uint64_t new_data_blkno;
> +	int needs_free;			/* this is only for internal use */
> +	struct o2_modify_meta *next;
> +};
> +
> +static void * local_malloc(size_t size, int *needs_free)
> +{
> +	char *ret;
> +	ssize_t cache_size;
> +	int off;
> +
> +	cache_size  = sizeof(struct o2_modify_meta) * COUNT_OF_MODIFY_META;
> +
> +	if (memory_object.current_meta == COUNT_OF_MODIFY_META) {
> +		memory_object.buf = malloc(cache_size);
> +		if (!memory_object.buf) {
> +			printf("no mem\n");
>   
Use com_err like other ocfs2-tools please.> +			return NULL;
> +		}
> +		memory_object.current_meta = 0;
> +	}
> +	
> +	if (memory_object.current_meta == 0) {
> +		*needs_free = 1;
> +	} else {
> +		*needs_free = 0;
> +	}
> +	off = sizeof(struct o2_modify_meta) * memory_object.current_meta;
> +	ret =  memory_object.buf + off;
> +	memory_object.current_meta ++;
> +
> +	return ret;
> +}
> +
> +static int worktree(char *basename, uint64_t blkno)
>   
> +	int res;
> +	char *buf = NULL;
> +	struct ocfs2_dinode *inode = NULL;
> +	struct o2_file *file = NULL;
> +	char type;
> +
> +	res = push_name(basename);
> +	if (res)
> +		return -1;
> +	res = ocfs2_malloc_block(gbls.fs->fs_io, &buf);
> +	if (res) {
> +		printf("ocfs2_malloc_block failed. %d\n", res);
> +		return -1;
> +	}
> +	
> +	res = ocfs2_read_inode2(gbls.fs, blkno, buf);
> +	if (res) {
> +		printf("ocfs2_read_inode error %d\n",res);
> +		goto ret1;
> +	}
> +	inode = (struct ocfs2_dinode *)buf;
> +	if (S_ISREG(inode->i_mode))
> +		type='f';
> +	else if (S_ISDIR(inode->i_mode))
> +		type='d';
> +	else {
> +		// ignore other type
> +		goto ret2;
> +	}
> +
> +	file = new_file(inode, type);
> +	if (!file) {
> +		goto ret1;
> +	}
> +	insert_file_to_list(file);
>   You don't handle the file which has link_count > 2. So here you may 
insert it twice.> +	res = access_one_node(inode, file);
> +	if (res)
> +		goto ret2;
> +
> +	if (file->type == 'd') {
> +		res = ocfs2_dir_iterate(gbls.fs, blkno, 0, NULL,
> +				do_with_childrens, NULL);
> +		if (res) {
> +			printf("ocfs2_dir_iterate failed. %d\n", res);
> +			goto ret2;
> +		}
> +	}
> +	goto ret1;
> +
> +ret2:	
> +	if (file && file->filename)
> +		free(file->filename);
> +	if (file)
> +		free(file);
> +ret1:
> +	if (buf)
> +		ocfs2_free(&buf);
> +	if (res) {
> +		pop_name();
> +	} else {
> +		res = pop_name();
> +	}
> +
> +	return res;
> +}
> +
> +/* find the record --ocfs2_extent_rec, which blkno is old_data_blkno,
change blkno to
> +new_data_blkno
> +*/
> +static inline int modify_file_meta(uint64_t file_blkno, uint64_t
from_blkno,
> +					uint64_t to_blkno)
>   Why go through the file once again? Since you have already iterate all 
the files, you may record the information there. So that you don't need 
to iterate it now the second time.> +{
> +	struct ocfs2_dinode *inode;
> +	struct ocfs2_extent_list *el;
> +	int res;
> +
> +	if (!update_disk)
> +		return 0;
> +
> +	res = ocfs2_read_inode2(gbls.fs, file_blkno, buf_modify_file_meta);
> +	if (res) {
> +		printf("ocfs2_read_inode2 %lu, failed. %d\n",file_blkno, res);
> +		return -1;
> +	}
> +	inode = (struct ocfs2_dinode *)buf_modify_file_meta;
> +	el= &(inode->id2.i_list);
> +
> +	res = find_and_change_extent_rec(el, buf_modify_file_meta, file_blkno,
> +						1, from_blkno, to_blkno);
> +	if (res == 1) {
> +		printf("such meta block %lu not found\n", from_blkno);
> +	}
> +
> +	return res;
> +}
> +
> +/* move the clusters specified by file_group to front of the same group */
> +/* -1 -->error; 0 -->moved successfully; 1 -->no space */
> +static int move_file_group_on_group(struct o2_group *group,
> +				struct o2_file_group *file_group)
> +{
> +	struct ocfs2_group_desc *gd;
> +	int alloc_start_bits_off;
> +	int res;
> +	int first_free_bit;
> +	uint64_t fromblk, toblk;
> +	
> +	gd = group->gd;
> +	
> +	first_free_bit = get_first_free_bit(gd);
> +	if (-1 == first_free_bit) {	//group full
> +		res = 1;
> +		goto done;
> +	}
> +	
> +	if (file_group->off < first_free_bit) {
> +		res = 1;
> +		goto done;
> +	}
> +	
> +	/* clear in memory the bits owned by this file_group */
> +	clear_bits_on_group(file_group->off, file_group->count, gd);
> +
> +	/* try to find enough bits to move to */
> +	res = get_N_contig_free_bits(gd, file_group->count,
> +					&alloc_start_bits_off);
> +	if (res) {
> +		//no space
> +		set_bits_on_group(file_group->off, file_group->count, gd);
> +		res = 1;
> +		goto done;
> +	}
> +	
> +	if (alloc_start_bits_off > file_group->off)
>   Here you need to handle alloc_start_bits_off ==
file_group->off.> +	{
> +		// got a starting address later than the original
> +		clear_bits_on_group(alloc_start_bits_off, file_group->count, gd);
> +		set_bits_on_group(file_group->off, file_group->count, gd);
> +		res = 1;
> +		goto done;
> +	}
> +/* now copy data clusters from old place to new place one by one */
> +	fromblk = file_group->off<<get_shift_bits();
> +	toblk = alloc_start_bits_off<<get_shift_bits();
> +	if (!is_first_group_gd(group)) {
> +		fromblk += group->blkno;
> +		toblk += group->blkno;
> +	} 
> +
> +	res = copy_N_clusters(fromblk, toblk, file_group->count);
> +	if (res) {
> +		//error, won't continue to access.
> +		goto done;
> +	}
> +
> +	/* modify the meta data of the file on disk */
> +	if (!new_and_insert_meta(file_group->file->blkno, fromblk, toblk))
{
> +		//error, won't continue to access.
> +		res = -1;
> +		goto done;
> +	}
> +
> +	/* update the file group object */
> +	file_group->off = alloc_start_bits_off;
> +	file_group->blkno = file_group->off<<get_shift_bits();
> +	if (!is_first_group_gd(file_group->group)) {
> +		file_group->blkno += file_group->group->blkno;
> +	}
> +
> +done:
> +	return res;
> +}
> +
> +
> +/* move all data clusters of files in a group to the end of this group */
>   
You move to the beginning, not the end, it seems. ;)> +/* for all groups, 
> +   for all files that have data-clusters on one of the groups,
> +   move the the data-clusters to front of this group, and move
corresponding
> +   bits to the front of the bitmap in this group.
> +
> +   this moving is based on a ocfs2_extent_rec, no mering or spliting on
ocfs2_extent_rec.
> +
> +   what's modified on disk:
> +   1) data-clusters copying, (while debugging, set the old clusters to 0)
> +   2) file meta-data. --modifying the ocfs2_extent_rec.e_blkno.
> +   3) clearing bits and setting bits on bitmap of that group.
> +*/
> +static int defrag1_move_data_to_front_on_all_group()
> +{
> +	int res;
> +	struct o2_group *group;
> +	struct o2_file_group *tmp;
> +	int first_free_bit;
> +
> +	group = all_group_list.head;
> +	printf("process in defrag1 ...\n");
> +	
> +	while ( group) {
> +		//move group 2 only for debug 
> +		printf("defrag1 working on group %d/%d...",
group->group_num, all_group_list.count);
> +
> +		first_free_bit = get_first_free_bit(group->gd);
> +		if (-1 == first_free_bit) {
> +			//group full
> +			group = group->next;
> +			printf("done.\n");
> +			continue;
> +		}
> +		
> +		tmp = group->file_group_list.head;
> +		while (tmp) {
> +			res = move_file_group_on_group(group, tmp);
> +			if (res == -1) {
> +				printf("move_file_group_on_group failed. %d\n", res);
> +				return res;
> +			}
> +			tmp = tmp->next_in_group;
> +		}
> +		printf("done.\n");
> +		group = group->next;
> +	}
> +
> +	res = update_all_meta();
>   You can't do it here. You really should do it immediately after your 
move the cluster in move_file_group_on_group. Otherwise if the system 
panic before this function, all the files' content is corrupted.

There are many trailing white spaces in your patch. So please remove 
them all.

anyway, thanks again for your work.

Mark Fasheh

2008-Feb-05 12:13 UTC

head link

[Ocfs2-devel] [patch 1/1] offline de-fragmentation tool

On Mon, Feb 04, 2008 at 10:50:49AM +0800, wengang wang
wrote:> this tool doesn't merge or split extent records for now. and it's
nearly
> help nothing to fs performance.
> the feature to be added is trying to move all data clusters of a file 
> together as possible. so that accessing to file on ocfs2 can get better 
> performance.
Merging extents is pretty high value, so I'd definitely like to see that as
your next step ;)
	--Mark

--
Mark Fasheh
Principal Software Developer, Oracle
mark.fasheh@oracle.com

Ocfs2 devel - Feb 2008 - [patch 1/1] offline de-fragmentation tool

[Ocfs2-devel] [patch 1/1] offline de-fragmentation tool

[Ocfs2-devel] [patch 1/1] offline de-fragmentation tool

[Ocfs2-devel] [patch 1/1] offline de-fragmentation tool