Hello, If I write large sequential file on snapshot, then create another snapshot, overwrite file with small amount of data and delete first snapshot, second snapshot has very large data extent and only small part of it is used. For example if I use following sequence: mkfs.btrfs /dev/sdn mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b btrfs sub snap /mnt/b /mnt/b/snap1 dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535 sync btrfs sub snap /mnt/b/snap1 /mnt/b/snap2 dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048 sync btrfs sub delete /mnt/b/snap1 btrfs-debug-tree /dev/sdn I see following data extents item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53 extent data disk byte 1103101952 nr 194641920 extent data offset 0 nr 4096 ram 194641920 extent compression 0 item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53 extent data disk byte 2086129664 nr 4096 extent data offset 0 nr 4096 ram 4096 extent compression 0 In item 6: only 4096 from 194641920 are in use. Rest of space is wasted. If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it release wasted space. But I can''t use defragment because if I have few snapshots I need to run defragment on each snapshot and it disconnect relation between snapshot and create multiple copies of same data. In our test that create and delete snapshots while writing data, we end up with few GBs of disk space wasted. Is it possible to limit size of allocated data extents? Is it possible to defragment subvolume without breaking snapshots relations? Any other idea how to recover wasted space? Thanks, Moshe Melnikov -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 04, 2013 at 02:08:01AM -0700, Moshe wrote:> Hello, > > If I write large sequential file on snapshot, then create another snapshot, > overwrite file with small amount of data and delete first snapshot, second > snapshot has very large data extent and only small part of it is used. > For example if I use following sequence: > mkfs.btrfs /dev/sdn > mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b > btrfs sub snap /mnt/b /mnt/b/snap1 > dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535 > sync > btrfs sub snap /mnt/b/snap1 /mnt/b/snap2 > dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048 > sync > btrfs sub delete /mnt/b/snap1 > btrfs-debug-tree /dev/sdn > I see following data extents > item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53 > extent data disk byte 1103101952 nr 194641920 > extent data offset 0 nr 4096 ram 194641920 > extent compression 0 > item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53 > extent data disk byte 2086129664 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 > > In item 6: only 4096 from 194641920 are in use. Rest of space is wasted. > > If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it release > wasted space. But I can''t use defragment because if I have few snapshots I > need to run defragment on each snapshot and it disconnect relation between > snapshot and create multiple copies of same data. > > In our test that create and delete snapshots while writing data, we end up > with few GBs of disk space wasted. > > Is it possible to limit size of allocated data extents? > Is it possible to defragment subvolume without breaking snapshots relations? > Any other idea how to recover wasted space?This is all by design to try and limit the size of the extent tree. Instead of splitting references in the extent tree to account for the split extent we do it in the file tree. In your case it results in a lot of wasted space. This is on the list of things to fix, we will just split the references in the extent tree and deal with the larger extent tree, but it''s on the back burner while we get things a bit more stable. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Thanks for your reply Josef. I want to experiment with extents size, to see how it influence size of extent tree. Can you point me to code that I can change to limit size of data extents? Thanks, Moshe Melnikov -----Original Message----- From: Josef Bacik Sent: Monday, February 04, 2013 5:56 PM To: Moshe Cc: linux-btrfs@vger.kernel.org Subject: Re: btrfs wastes disk space after snapshot deletetion. On Mon, Feb 04, 2013 at 02:08:01AM -0700, Moshe wrote:> Hello, > > If I write large sequential file on snapshot, then create another > snapshot, > overwrite file with small amount of data and delete first snapshot, second > snapshot has very large data extent and only small part of it is used. > For example if I use following sequence: > mkfs.btrfs /dev/sdn > mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b > btrfs sub snap /mnt/b /mnt/b/snap1 > dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535 > sync > btrfs sub snap /mnt/b/snap1 /mnt/b/snap2 > dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048 > sync > btrfs sub delete /mnt/b/snap1 > btrfs-debug-tree /dev/sdn > I see following data extents > item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53 > extent data disk byte 1103101952 nr 194641920 > extent data offset 0 nr 4096 ram 194641920 > extent compression 0 > item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53 > extent data disk byte 2086129664 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 > > In item 6: only 4096 from 194641920 are in use. Rest of space is wasted. > > If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it > release > wasted space. But I can''t use defragment because if I have few snapshots I > need to run defragment on each snapshot and it disconnect relation between > snapshot and create multiple copies of same data. > > In our test that create and delete snapshots while writing data, we end up > with few GBs of disk space wasted. > > Is it possible to limit size of allocated data extents? > Is it possible to defragment subvolume without breaking snapshots > relations? > Any other idea how to recover wasted space?This is all by design to try and limit the size of the extent tree. Instead of splitting references in the extent tree to account for the split extent we do it in the file tree. In your case it results in a lot of wasted space. This is on the list of things to fix, we will just split the references in the extent tree and deal with the larger extent tree, but it''s on the back burner while we get things a bit more stable. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 05, 2013 at 02:09:02AM -0700, Moshe wrote:> Thanks for your reply Josef. > I want to experiment with extents size, to see how it influence size of > extent tree. Can you point me to code that I can change to limit size of > data extents?So it''s not the size of the data extents, it''s how we deal with references to them. Let me map out what happens now 1) we do a write and create a 1 gig data extent. 2) create a file extent item in the fs tree pointing to the extent 3) create an reference with a count of 1 for the entire extent 4) create a snapshot of the data extent 5) write 4k to the middle of the extent 6a) we cow down to the file extent item we need to split and add a ref to the original 1 gig extent because of the snapshot. 6b) split the file extent item in the fs tree into 3 extents. - one from 0 to the random offset - one from random offset to random offset + 4k - one from random offset + 4k to the end of the original extent this points to an offset within the original 1 gig extent 6c) in the split we increase the refcount of the original 1 gig extent by 1 7) add an extent reference for the 4k extent we wrote. So at the end of this our original 1 gig extent has 3 references, 1 for the original snapshot with it''s unmodified extent, 2 for the snapshot which includes a reference for each chunk of the split extent. In order to free up this space you would either have to overwrite the entirety of the remaining chunks of the original extent in the snapshot and free up the extent in the original fs by some means. So say you delete the file in the original file system, and then do something horrible like overwrite every other 4k block in the file you''d end up with around 1.5gig of data in use for logically 1 gig of actual space. The way to fix this is in 6c. In file.c you have __btrfs_drop_extents which does this btrfs_inc_extent_ref on an extent it has to split on two sides. Instead of doing this we would probably add another delayed extent operation for splitting the extent reference. So instead of having file extents that span large areas and stick around forever, we just fix the extent references to account for the actual file extents, so when you drop a part you actually recover the space. There is no code for this yet because this is kind of an overhaul of how things are done, and I''m still getting "if I do blah it panics the box" emails so I want to spend time stabilizing. If this is something you want to tackle go for it, but be prepared to spend a few months on it. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Moshe Melnikov
2013-Feb-05 15:27 UTC
Re: btrfs wastes disk space after snapshot deletetion.
Is it possible in step 1) to create few smaller extents instead of 1 gig data extent? Moshe -----Original Message----- From: Josef Bacik Sent: Tuesday, February 05, 2013 4:41 PM To: Moshe Cc: Josef Bacik ; linux-btrfs@vger.kernel.org Subject: Re: btrfs wastes disk space after snapshot deletetion. On Tue, Feb 05, 2013 at 02:09:02AM -0700, Moshe wrote:> Thanks for your reply Josef. > I want to experiment with extents size, to see how it influence size of > extent tree. Can you point me to code that I can change to limit size of > data extents?So it''s not the size of the data extents, it''s how we deal with references to them. Let me map out what happens now 1) we do a write and create a 1 gig data extent. 2) create a file extent item in the fs tree pointing to the extent 3) create an reference with a count of 1 for the entire extent 4) create a snapshot of the data extent 5) write 4k to the middle of the extent 6a) we cow down to the file extent item we need to split and add a ref to the original 1 gig extent because of the snapshot. 6b) split the file extent item in the fs tree into 3 extents. - one from 0 to the random offset - one from random offset to random offset + 4k - one from random offset + 4k to the end of the original extent this points to an offset within the original 1 gig extent 6c) in the split we increase the refcount of the original 1 gig extent by 1 7) add an extent reference for the 4k extent we wrote. So at the end of this our original 1 gig extent has 3 references, 1 for the original snapshot with it''s unmodified extent, 2 for the snapshot which includes a reference for each chunk of the split extent. In order to free up this space you would either have to overwrite the entirety of the remaining chunks of the original extent in the snapshot and free up the extent in the original fs by some means. So say you delete the file in the original file system, and then do something horrible like overwrite every other 4k block in the file you''d end up with around 1.5gig of data in use for logically 1 gig of actual space. The way to fix this is in 6c. In file.c you have __btrfs_drop_extents which does this btrfs_inc_extent_ref on an extent it has to split on two sides. Instead of doing this we would probably add another delayed extent operation for splitting the extent reference. So instead of having file extents that span large areas and stick around forever, we just fix the extent references to account for the actual file extents, so when you drop a part you actually recover the space. There is no code for this yet because this is kind of an overhaul of how things are done, and I''m still getting "if I do blah it panics the box" emails so I want to spend time stabilizing. If this is something you want to tackle go for it, but be prepared to spend a few months on it. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Feb 05, 2013 at 05:27:45PM +0200, Moshe Melnikov wrote:> > Is it possible in step 1) to create few smaller extents instead of 1 > gig data extent?DIO or O_SYNC can help to create extents whose size is your ''bs=xxx'', but you know, this is not expected as fast as buffered write. thanks, liubo> > Moshe > > -----Original Message----- From: Josef Bacik > Sent: Tuesday, February 05, 2013 4:41 PM > To: Moshe > Cc: Josef Bacik ; linux-btrfs@vger.kernel.org > Subject: Re: btrfs wastes disk space after snapshot deletetion. > > On Tue, Feb 05, 2013 at 02:09:02AM -0700, Moshe wrote: > >Thanks for your reply Josef. > >I want to experiment with extents size, to see how it influence size of > >extent tree. Can you point me to code that I can change to limit size of > >data extents? > > > So it''s not the size of the data extents, it''s how we deal with > references to > them. Let me map out what happens now > > 1) we do a write and create a 1 gig data extent. > 2) create a file extent item in the fs tree pointing to the extent > 3) create an reference with a count of 1 for the entire extent > 4) create a snapshot of the data extent > 5) write 4k to the middle of the extent > 6a) we cow down to the file extent item we need to split and add a > ref to the > original 1 gig extent because of the snapshot. > 6b) split the file extent item in the fs tree into 3 extents. > - one from 0 to the random offset > - one from random offset to random offset + 4k > - one from random offset + 4k to the end of the original extent > this points to an offset within the original 1 gig extent > 6c) in the split we increase the refcount of the original 1 gig extent by 1 > 7) add an extent reference for the 4k extent we wrote. > > So at the end of this our original 1 gig extent has 3 references, 1 for the > original snapshot with it''s unmodified extent, 2 for the snapshot > which includes > a reference for each chunk of the split extent. In order to free up > this space > you would either have to overwrite the entirety of the remaining > chunks of the > original extent in the snapshot and free up the extent in the original fs by > some means. > > So say you delete the file in the original file system, and then do > something > horrible like overwrite every other 4k block in the file you''d end up with > around 1.5gig of data in use for logically 1 gig of actual space. > The way to > fix this is in 6c. > > In file.c you have __btrfs_drop_extents which does this > btrfs_inc_extent_ref on > an extent it has to split on two sides. Instead of doing this we > would probably > add another delayed extent operation for splitting the extent reference. So > instead of having file extents that span large areas and stick > around forever, > we just fix the extent references to account for the actual file extents, so > when you drop a part you actually recover the space. There is no > code for this > yet because this is kind of an overhaul of how things are done, and > I''m still > getting "if I do blah it panics the box" emails so I want to spend time > stabilizing. If this is something you want to tackle go for it, but > be prepared > to spend a few months on it. Thanks, > > Josef > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Feb 04, 2013 at 11:08:01AM +0200, Moshe wrote:> Hello, > > If I write large sequential file on snapshot, then create another > snapshot, overwrite file with small amount of data and delete first > snapshot, second snapshot has very large data extent and only small > part of it is used. > For example if I use following sequence: > mkfs.btrfs /dev/sdn > mount -o noatime,nodatacow,nospace_cache /dev/sdn /mnt/b > btrfs sub snap /mnt/b /mnt/b/snap1 > dd if=/dev/zero of=/mnt/b/snap1/t count=15000 bs=65535 > sync > btrfs sub snap /mnt/b/snap1 /mnt/b/snap2 > dd if=/dev/zero of=/mnt/b/snap2/t seek=3 count=1 bs=2048 > sync > btrfs sub delete /mnt/b/snap1 > btrfs-debug-tree /dev/sdn > I see following data extents > item 6 key (257 EXTENT_DATA 0) itemoff 3537 itemsize 53 > extent data disk byte 1103101952 nr 194641920 > extent data offset 0 nr 4096 ram 194641920 > extent compression 0 > item 7 key (257 EXTENT_DATA 4096) itemoff 3484 itemsize 53 > extent data disk byte 2086129664 nr 4096 > extent data offset 0 nr 4096 ram 4096 > extent compression 0 > > In item 6: only 4096 from 194641920 are in use. Rest of space is wasted. > > If I defragment like: btrfs filesystem defragment /mnt/b/snap2/t it > release wasted space. But I can''t use defragment because if I have > few snapshots I need to run defragment on each snapshot and it > disconnect relation between snapshot and create multiple copies of > same data.Well, just for this case, you can try our experimental feature for your test, ''snapshot-aware defrag'', which is designed for this kind of problems. It''s still floating on the ML, and I''ve no idea when it''ll land in upstream. Currently the latest patch is V6, and NOTE: if you want to use autodefrag(which is recommended), you''d like to apply the v6 patch along with another patch for autodefrag, otherwise it may crash your box. FYI, - snapshot-aware defrag https://patchwork.kernel.org/patch/2058911/ - autodefrag fix https://patchwork.kernel.org/patch/2058921/ thanks, liubo> > In our test that create and delete snapshots while writing data, we > end up with few GBs of disk space wasted. > > Is it possible to limit size of allocated data extents? > Is it possible to defragment subvolume without breaking snapshots relations? > Any other idea how to recover wasted space? > > Thanks, > Moshe Melnikov > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html