adilger@clusterfs.com
2007-Jan-25 04:05 UTC
[Lustre-devel] [Bug 10827] Online defragmenter
Please don''t reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=10827 What |Removed |Added ---------------------------------------------------------------------------- Attachment #9265|review?(adilger@clusterfs.co|review+ Flag|m) | (From update of attachment 9265) 0.1.1 Components of structures always need to have a prefix related to the structure name. This makes it easier to disambiguate fields in structures. For example, struct ext3_buddy_info { unsigned ebi_group; unsigned short ebi_counters[]; void *ebi_bitmap; } 0.1.2.3 Even in the face of errors, the defragmenter should likely report the current status before exiting. It may already have done a lot of work. 0.1.2.5 current_pos and the return value need to be a __u64, since even if it is in units of filesystem blocks an int would be too small at 16TB (already passed with ext4). All code in the tool should assume 64-bit blocks, inode numbers, file sizes, even if this isn''t yet possible in all cases for ext3. 0.1.2.6 I was just thinking about packing small files, and the current algorithm likely needs to have a small adjustment to it. Currently it picks the chunk with the least space, but there is no idea whether this chunk is 99% full with the end of a single file, or has 99 small files packed into it. I think we may need to try and pick chunks that already have several files in them, otherwise we run into the situation where we are deleting the large files and this leaves many almost-free chunks behind that have a small file at the end, instead of the small files being packed into a small number of chunks. One possibility is to keep track on a per-chunk basis the number of files that are referencing blocks in that chunk. Alex, will there be any policy in mballoc to try and keep small files in a specific region of the disk, or will it only consider small files when doing writes on all of them at one time? 0.2.2 Instead of "bit_reset" please use "bit_clear" as that is more common naming in the kernel code. 0.2.3 Will the tool also report some kind of status on what work it did defragmenting the filesystem? Things like "number of files checked" "number of fragmented files found" "number of chunks migrated" etc. Also, given that we are walking all of the inodes during the run, we could also keep stats on fragmented files (files not split on chunk-size boundaries), number of large and small files, fragmentation as a percentage of number of disjoint chunks / file_size / chunk_size, etc. 0.2.4 The modification time should be a parameter to this function. 0.2.7 We shouldn''t need to call refresh_device_info() for every file moved. The defragger should try to keep its internal state consistent with the changes that it is making. Otherwise on large filesystems we will be reading many GB of group bitmaps from the kernel for every file moved. Why is the tail of a file ignored? It might still be migrated if e.g. it wasn''t filling a chunk or if it was itself fragmented into several chunks. 0.2.8 For inode_block_map(i) wouldn''t this be equivalent to using the FIBMAP ioctl for block i of the file? The FIBMAP may be more up-to-date, but direct inode reading via libext2fs may be much more efficient. 0.4 The unit test cases look reasonably good to start. There should also be some integration tests like running the tool on an existing filesystem, and verifying that the number of free chunks does not decrease after the run.