We have a big problem, but it involves a lot of moving parts, so I'm going to explain all of the parts, and then the problem, and then what I am doing to fix the problem. I want you guys to check my work to make sure I'm not missing something so when I come back from paternity leave in a few weeks I can just sit down and finish this work out. ===== Extent refs ====== This is basically how extent refs work. You have either key.objectid = bytenr; key.type = BTRFS_EXTENT_ITEM_KEY; key.offset = length; or you have key.objectid = bytenr; key.type = BTRFS_METADATA_ITEM_KEY; key.offset = level of the metadata block; in the case of skinny metadata. Then you have the extent item which describes the number of refs and such, followed by 1 or more inline refs. All you need to know for this problem is how I'm going to describe them. What I call a "normal ref" or a "full ref" is a reference that has the actual root information in the ref. What I call a "shared ref" is one where we only know the tree block that owns the particular ref. So how does this work in practice? 1) Normal allocation - metadata: We allocate a tree block as we add new items to a tree. We know that this root owns this tree block so we create a normal ref with the root objectid in the extent ref. We also set the owner of the block itself to our objectid. This is important to keep in mind. 2) Normal allocaiton - data: We allocate some data for a given fs tree and we add a extent ref with the root objectid of the tree we are in, the inode number and the logical offset into the inode for this inode. 3) Splitting a data extent: We write to the middle of an existing extent. We will split this extent into two BTRFS_EXTENT_DATA_KEY items and the increase the ref count of the original extent by 1. This means we look up the extent ref for root->objectid, inode number and the _original_ inode offset. We don't create another extent ref, this is important to keep in mind. ===== btrfs_copy_root/update_ref_for_cow/btrfs_inc_ref/btrfs_dec_ref ==== But Josef, didn't you say there were shared refs? Why yes I did, but I need to explain it in context of the people who actually do the dirty work. We'll start with the easy case 1) btrfs_copy_root - where snapshots start: When we make a snapshot we call this function, which allocates a completely new block with a new root objectid and then memcpy's the original root we are snapshotting. Then we call btrfs_inc_ref on our new buffer, which will walk all items in that buffer and add a new normal ref to each of those blocks for our new root. This is only at the level below the new root, nothing below that point. 2) btrfs_inc_ref/btrfs_dec_ref - how we deal with snapshots: These guys are responsible for dealing with the particular action we want to make on our given buffer. So if we are free'ing our buffer, we need to drop any refs it has to the blocks it points to. For level > 0 this means modifying refs for all of the tree blocks it points to. For level == 0 this means modifying refs for any data extents the leaf may point to. 3) update_ref_for_cow - this is where the magic happens: This has a few different modes of operation, but every operation means we check to see if the block is shared, which is we see if we have been snapshotted and if we have been see if this block has changed since we snapshotted. If it is shared then we look up the extent refs and the flags. If not then we carry on. From here we have a few options. 3a) Not shared: Don't do anything, we can do our normal cow operations and carry on. 3b) Shared and cowing from the owning root: This is where the btrfs_header_owner() is important. If we owned this block and it is shared then we know that any of the upper levels won't have a normal ref to anything underneath this block, so we need to add a shared ref for anything this block points to. So the first thing we do is btrfs_inc_ref(), but we set the full backref flag. This means that when we add refs for everything this block points to we don't use a root objectid, we use the bytenr of this block. Then we set BTRFS_BLOCK_FLAG_FULL_BACKREF for the extent flags for this give block. 3c) Shared and cowing from not the owning root: So if we are cowing down from the snapshot we need to make sure that any block we own completely ourselves has normal refs for any blocks it points to. So we cow down and hit a shared block that we aren't the owner of, so we do btrfs_inc_ref() for our block and don't set full_backref so we get a normal ref on everything this block points to with our actual root objectid. 3d) Last ref but has FULL_BACKREF set: The last time we cow a block that was converted to a full backref from 3b. This is where we call btrfs_inc_ref() without full_backref set so we get our normal root objectid ref added, and then we call btrfs_dec_ref() with full_backref set so that we remove the shared ref for this block since it is about to be deleted. ===== Delayed refs ===== In order to make sure we don't recursively update the extent tree for every modification as we cow down a tree we have delayed refs. Every time we modify the ref count of an extent (allocate, inc ref, free) we allocate a delayed ref with the information about that ref. So every call to btrfs_inc_ref, btrfs_dec_ref, btrfs_allocate_tree_bloc, btrfs_alloc_reserved_file_extent, btrfs_free_extent all ends up allocating a delayed ref. The nice thing about this is it can keep us from actually modifying the extent tree. If we cow a block to free items and then subsequently free that block we can avoid touching the extent tree altogether for this. PROBLEM: For qgroups we needed a way to have a coherent view of the extent tree at a given time, which is where our sequence numbers come in. Each delayed ref new gets a sequence number, and a different sequence number means we can't do this in-memory merging right away like we used to. Now instead we have btrfs_merge_delayed_refs, which works pretty well, but is O(n) for the number of delayed refs we have for a given extent. Usually this is a small number, but there is one key area this isn't true, which is snapshot aware defrag. ====== Snapshot aware defrag ==== This has a couple of problems with it, but the problem that relates to this email is that we can have many many updates for one extent. So remember how we will add an extent ref every time we split an extent? Imagine writing a 128mb extent and then overwriting every other block. That is going to end up with 32768 file extents all pointing to the same extent. Now snapshot this fs 5 thousand times, now you have 163,840,000 extents you are going to be updating when you defrag this range. How this then ends up is you have that many delayed refs outstanding at the end of your defrag operation, that is if you are lucky enough to not OOM. This is because of the sequence numbers, we can't merge them until we actually go to run them. So now you've made it to actually calling btrfs_merge_delayed_refs, and then you wonder why your box just livelocked for 10 seconds for no apparent reason. With out the sequence numbers we would have merged each of these operations as we added them, so when we finished the defrag operation we'd have 5000 delayed refs each with a -32768 count for the freeing of the fragmented extent. Now obviously 1 part of this fix is to do one root at a time so we aren't queueing up N number of snapshots worth of delayed refs, but that still leaves the actual number of extents we have, which is why we need to kill/mitigate the pain of the sequence numbers. ======= Qgroups ====== I'm starting over with the qgroup accounting because it simply needs to operate without as much dependance on the sequence numbers. So here is how it is going to work 1) Create a delayed ref sort of scheme for quota operations. This lets us process things in bulk without having to worry about weird locking and such. 2) Only create these quota operations when we modify the actual refs. Part of the problem with the current work is that it tries to figure out what quota stuff is going to be used based on delayed refs. These updates could end up being chucked, which will screw up our counts. So instead only add a qgroup action when we add the first ref for that root objectid or remove the last ref for that root objectid. This will mean adding stuff to the actual ref modifying stuff in extent-tree.c. I have this part pretty well nailed down in my qgroup-work branch in btrfs-next. 3) Create qgroup operations in update_ref_for_cow. One of the tricky parts of killing sequence numbers is that we can completely lose extent ref operations that we really do care about. So say we snapshot, which means we convert all of the referenced space in the original root to not exclusive and add it to the referenced count of the snapshot. If we rm -rf on that snapshot we will cow down only to free the leaf, which means that the delayed ref will get dropped. This will cause us to lose the qgroup action because we only account for real ref modificatoins to the tree, and so we won't convert this space back to exclusive space for the original tree. So what we need to do here is add a qgroup operation saying that we are converting shared space to exclusive space with the cow and then set a flag on the extent buffer so we know that we have an outstanding convert operation. When this convert operation runs we simply look the extent buffer up and clear the flag. If we try to free the tree block before then we make sure to add a real qgroup operation. 4) The tricky part is data because we have implied refs, that is refs to data with no real way to know other than to lookup all of it's parents. We will still need sequence numbers in a way for this case, but way differently than what we have now. We only care about knowing about a given point in time previous to this special case. The special case is when we remove what looks like the last ref for this extent for this root but there are still outstanding refs on the extent. This case we need to lookup the backrefs to make sure there isn't an implied ref from our root to this data block. This is the point at which we increment the sequence number. So what happens is you have a whole bunch of delayed refs with the same sequence number, and then we hit this case and inc the sequence number, and you have a whole bunch of delayed refs with the new sequence number. You still have to post-process merge to make sure, but you are far more likely to merge everything in real-time since you are only changing the sequence number every once and a while instead of for every new sequence number. Then you lookup the backrefs for everything that happened up until but not after your sequence number to see if you had an implied ref for the data. ==== Conclusion ==== I realize this is a lot of information, but I'm not the only one working in this area at the moment so I thought it was important to get it all on paper so we're all on the same page. Please let me know if you have questions or suggestions, Like I said I'm on paternity leave so theres going to be a lag time for my responses. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html