adilger@clusterfs.com
2007-Jan-25  04:05 UTC
[Lustre-devel] [Bug 10827] Online defragmenter
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=10827
           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #9265|review?(adilger@clusterfs.co|review+
               Flag|m)                          |
(From update of attachment 9265)
0.1.1
Components of structures always need to have a prefix related to the structure
name.  This makes it easier to disambiguate fields in structures.
For example,
struct ext3_buddy_info {
  unsigned ebi_group;
  unsigned short ebi_counters[];
  void *ebi_bitmap;
}
0.1.2.3
Even in the face of errors, the defragmenter should likely report the current
status before exiting.	It may already have done a lot of work.
0.1.2.5
current_pos and the return value need to be a __u64, since even if it is in
units of filesystem blocks an int would be too small at 16TB (already passed
with ext4).  All code in the tool should assume 64-bit blocks, inode numbers,
file sizes, even if this isn''t yet possible in all cases for ext3.
0.1.2.6
I was just thinking about packing small files, and the current algorithm likely
needs to have a small adjustment to it.  Currently it picks the chunk with the
least space, but there is no idea whether this chunk is 99% full with the end
of a single file, or has 99 small files packed into it.
I think we may need to try and pick chunks that already have several files in
them, otherwise we run into the situation where we are deleting the large files
and this leaves many almost-free chunks behind that have a small file at the
end, instead of the small files being packed into a small number of chunks. 
One possibility is to keep track on a per-chunk basis the number of files that
are referencing blocks in that chunk.
Alex, will there be any policy in mballoc to try and keep small files in a
specific region of the disk, or will it only consider small files when doing
writes on all of them at one time?
0.2.2
Instead of "bit_reset" please use "bit_clear" as that is
more common naming in
the kernel code.
0.2.3
Will the tool also report some kind of status on what work it did defragmenting
the filesystem?  Things like "number of files checked" "number of
fragmented
files found" "number of chunks migrated" etc.
Also, given that we are walking all of the inodes during the run, we could also
keep stats on fragmented files (files not split on chunk-size boundaries),
number of large and small files, fragmentation as a percentage of number of
disjoint chunks / file_size / chunk_size, etc.
0.2.4  The modification time should be a parameter to this function.
0.2.7
We shouldn''t need to call refresh_device_info() for every file moved. 
The
defragger should try to keep its internal state consistent with the changes
that it is making.  Otherwise on large filesystems we will be reading many GB
of group bitmaps from the kernel for every file moved.
Why is the tail of a file ignored?  It might still be migrated if e.g. it
wasn''t filling a chunk or if it was itself fragmented into several
chunks.
0.2.8
For inode_block_map(i) wouldn''t this be equivalent to using the FIBMAP
ioctl
for block i of the file?  The FIBMAP may be more up-to-date, but direct inode
reading via libext2fs may be much more efficient.
0.4
The unit test cases look reasonably good to start.
There should also be some integration tests like running the tool on an
existing filesystem, and verifying that the number of free chunks does not
decrease after the run.