Hello, TL;DR: I want to only do snapshot-aware defrag on inodes in snapshots that haven't changed since the snapshot was taken. Yay or nay (with a reason why for nay) === How snapshot aware defrag currently works == First the defrag stuff will go through and read in and mark dirty a big area where it finds a bunch of tiny extents. We go to write this stuff out and when the write completes we notice this operation was for defrag. We go through the entire range of existing extents for this inode if there was a snapshot after the inode was created and record all of the extents in this range. Then we look up all of the references for each of the extents we found. So say we have 100 snapshots and we defragged 100 extents, we'll end up with 100 * 100 of these data structures to know what we have to stitch back together. Then we go through each of these things and do btrfs_drop_extents for the range and either add the new extent for this range, or merge it with the previous one if we've already done that. So it looks like this [----- New extent -----] [old1][old2][old3][old4] We will drop old1 and create new1 for the range of old1, so we have this [new1][old2][old3][old4] and then the next one we will drop old2 and merge it with new1 so we have this [-- new1 --][old3][old4] and so on and so forth. We do this because some random extent within this range could have changed and we don't want to overwrite that bit. === Problems that need to be fixed == 1) The memory usage for this is astronomical. Every extent for every snapshot we are replacing is recorded in memory while we do this, which is why this feature was disabled as people were constantly OOM'ing. 2) Currently this stitching operation is done in btrfs_finish_ordered_io, which means if anybody is waiting for the ordered extent or any ordered extent to be completed they are going to block longer waiting for the operation to complete, which could be very time consuming. === Solutions == 1) Move the snapshot aware stitching part off into a different thread. We can make this block unmount until it completes if we want to make sure we always finish our job, or we can make it exit and then we lose some of the snapshot awareness. Either way is fine with me, I figure we'd go with blocking first and change it if somebody complains too loudly. 2) Fix how we do the stitching. This is where I need input. What I want to do is just lookup the first extent, which will give me all of the roots that share the extent range that we are defragging. Then I want to just lookup those inodes, do btrfs_drop_extents and add in the new extent, the same way btrfs_finish_ordered_io works. This will make the stitching operation much simpler and less error prone, and much much faster. The drawback is that we need extra checks to make sure the inode on the snapshots hasn't changed since we took the snapshot. So if any of that file has been modified, even if none of the data has, we won't do anything because we won't be able to verify that it is the same. This isn't the only way we can do it. I can fix the stitching part to just do one extent at a time across all roots. This way we're only allocating N number of snapshots worth of entries at a time so it keeps our allocation low. The other option is to just do one root at a time and process each extent. Either of these will be fine and reduce our memory usage but will be pretty disk io intensive. === Summary and what I need == Option 1: Only relink inodes that haven't changed since the snapshot was taken. Pros: -Faster -Simpler -Less duplicated code, uses existing functions for tricky operations so less likely to introduce weird bugs. Cons: -Could possibly lost some of the snapshot-awareness of the defrag. If you just touch a file we would not do the relinking and you'd end up with twice the space usage. Option 2: Process each root one extent at a time in whatever way results in less memory usage. Pros: -Maximizes the space reduction of the snapshot-aware defrag. Every extent that is the same as the original will be replaced with the new extent and all will be well with the world. Cons: -Way slower. We'll have to walk and check every extent to make sure we can actually replace it. This is how it used to work so we'd be consistent with the 2 or 3 releases where we had snapshot-aware defrag enabled. -More complicated. We have to do a lot of extra checking and such, new code, possibility for bugs to show up. So tell me which one you want and I'll do that one. If you want Option 2 please explain your use case so I can keep it in mind when deciding on how to go about making the memory usage suck less. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html