I recall their being a thread here a number of months back regarding data-deduplication support for bttfs. Did anyone end up picking that up and giving a go at it? Block level data dedup would be *awesome* in a Linux filesystem. It does wonders for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t have this feature yet (although I''ve read discussions on them looking to add it). Thanks for everyone''s hard work! Ray -- Ray Van Dolson <rayvd@bludgeon.org> GPG Fingerprint: 175B D779 4BC9 D5FF 5CC9 CE79 BCB4 0703 B51E 9F1A -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ray Van Dolson <rayvd@bludgeon.org> writes:> I recall their being a thread here a number of months back regarding > data-deduplication support for bttfs. > > Did anyone end up picking that up and giving a go at it? Block level > data dedup would be *awesome* in a Linux filesystem. It does wonders > for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t > have this feature yet (although I''ve read discussions on them looking > to add it).There are some patches to do in QEMU''s cow format for KVM. That''s user level only. -Andi -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote:> I recall their being a thread here a number of months back regarding > data-deduplication support for bttfs. > > Did anyone end up picking that up and giving a go at it? Block level > data dedup would be *awesome* in a Linux filesystem. It does wonders > for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t > have this feature yet (although I''ve read discussions on them looking > to add it). >So far nobody has grabbed this one, but I''ve had more requests (no shocker there, the kvm people are interested in it too). It probably won''t make 1.0 but the disk format will be able to support it. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen wrote:> Ray Van Dolson <rayvd@bludgeon.org> writes: > > >> I recall their being a thread here a number of months back regarding >> data-deduplication support for bttfs. >> >> Did anyone end up picking that up and giving a go at it? Block level >> data dedup would be *awesome* in a Linux filesystem. It does wonders >> for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t >> have this feature yet (although I''ve read discussions on them looking >> to add it). >> > > There are some patches to do in QEMU''s cow format for KVM. That''s > user level only. >And thus, doesn''t work for sharing between different images, especially at runtime. I''d really, really [any number of reallies], really like to see btrfs deduplication. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote:> Andi Kleen wrote: > >Ray Van Dolson <rayvd@bludgeon.org> writes: > > > > > >>I recall their being a thread here a number of months back regarding > >>data-deduplication support for bttfs. > >> > >>Did anyone end up picking that up and giving a go at it? Block level > >>data dedup would be *awesome* in a Linux filesystem. It does wonders > >>for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t > >>have this feature yet (although I''ve read discussions on them looking > >>to add it). > >> > > > >There are some patches to do in QEMU''s cow format for KVM. That''s > >user level only. > > > > And thus, doesn''t work for sharing between different images, especially > at runtime.It would work if the images are all based once on a reference image, won''t it? I would imagine that''s the common situation for installing lots of VMs.> I''d really, really [any number of reallies], really like to > see btrfs deduplication.Sure it would be useful for a couple of things. -Andi -- ak@linux.intel.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 15, 2008 at 3:15 PM, Andi Kleen <andi@firstfloor.org> wrote:> On Wed, Oct 15, 2008 at 03:39:16PM +0200, Avi Kivity wrote: >> Andi Kleen wrote: >> >Ray Van Dolson <rayvd@bludgeon.org> writes: >> > >> > >> >>I recall their being a thread here a number of months back regarding >> >>data-deduplication support for bttfs. >> >> >> >>Did anyone end up picking that up and giving a go at it? Block level >> >>data dedup would be *awesome* in a Linux filesystem. It does wonders >> >>for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t >> >>have this feature yet (although I''ve read discussions on them looking >> >>to add it). >> >> >> > >> >There are some patches to do in QEMU''s cow format for KVM. That''s >> >user level only. >> > >> >> And thus, doesn''t work for sharing between different images, especially >> at runtime. > > It would work if the images are all based once on a reference image, won''t it? > I would imagine that''s the common situation for installing lots of VMs.Like, using bcp (btrfs specific cp) for creating "new" images from a base one? Will that suffice? With modifications after that being COW, that could be a simple way of having a "stupid/hack" no duplication.> >> I''d really, really [any number of reallies], really like to >> see btrfs deduplication. > > Sure it would be useful for a couple of things. > > -Andi > > -- > ak@linux.intel.com > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >-- Miguel Sousa Filipe -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Like, using bcp (btrfs specific cp) for creating "new" images from a base one? > Will that suffice? > With modifications after that being COW, that could be a simple way of > having a "stupid/hack" no duplication.qcow already supports that. The challenge is just to deduplicate later e.g. when you start applying security updates to the images. -Andi -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andi Kleen wrote:>>> There are some patches to do in QEMU''s cow format for KVM. That''s >>> user level only. >>> >>> >> And thus, doesn''t work for sharing between different images, especially >> at runtime. >> > > It would work if the images are all based once on a reference image, won''t it? >Yes and no. It''s difficult to do it at runtime, and it allows one qemu to access another guest''s data (for read-only). Also, it''s almost impossible to do at runtime. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Oct 13, 2008 at 07:02:14AM -0400, Chris Mason wrote:> On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote: > > I recall their being a thread here a number of months back regarding > > data-deduplication support for bttfs. > > > > Did anyone end up picking that up and giving a go at it? Block level > > data dedup would be *awesome* in a Linux filesystem. It does wonders > > for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t > > have this feature yet (although I''ve read discussions on them looking > > to add it). > > > > So far nobody has grabbed this one, but I''ve had more requests (no > shocker there, the kvm people are interested in it too). It probably > won''t make 1.0 but the disk format will be able to support it.Both deduplication and compression have an interesting side effect in which a write to a previously "allocated" block can return ENOSPC. This is even more exciting when you factor in mmap. Any thoughts on how to handle this? -VAL -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote:> On Mon, Oct 13, 2008 at 07:02:14AM -0400, Chris Mason wrote: > > On Sat, 2008-10-11 at 19:06 -0700, Ray Van Dolson wrote: > > > I recall their being a thread here a number of months back regarding > > > data-deduplication support for bttfs. > > > > > > Did anyone end up picking that up and giving a go at it? Block level > > > data dedup would be *awesome* in a Linux filesystem. It does wonders > > > for storing virtual machines w/ NetApp and WAFL, and even ZFS doesn''t > > > have this feature yet (although I''ve read discussions on them looking > > > to add it). > > > > > > > So far nobody has grabbed this one, but I''ve had more requests (no > > shocker there, the kvm people are interested in it too). It probably > > won''t make 1.0 but the disk format will be able to support it. > > Both deduplication and compression have an interesting side effect in > which a write to a previously "allocated" block can return ENOSPC. > This is even more exciting when you factor in mmap. Any thoughts on > how to handle this?Unfortunately we''ll have a number of places where ENOSPC will jump in where people don''t expect it, and this includes any COW overwrite of an existing extent. The old extent isn''t freed until snapshot deletion time, which won''t happen until after the current transaction commits. Another example is fallocate. The extent will have a little flag that says I''m a preallocated extent, which is how we''ll know we''re allowed to overwrite it directly instead of doing COW. But, to write to the fallocated extent, we''ll have to clear the flag. So, we''ll have to cow the block that holds the file extent pointer, which means we can enospc. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 16, 2008 at 03:30:49PM -0400, Chris Mason wrote:> On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote: > > > > Both deduplication and compression have an interesting side effect in > > which a write to a previously "allocated" block can return ENOSPC. > > This is even more exciting when you factor in mmap. Any thoughts on > > how to handle this? > > Unfortunately we''ll have a number of places where ENOSPC will jump in > where people don''t expect it, and this includes any COW overwrite of an > existing extent. The old extent isn''t freed until snapshot deletion > time, which won''t happen until after the current transaction commits. > > Another example is fallocate. The extent will have a little flag that > says I''m a preallocated extent, which is how we''ll know we''re allowed to > overwrite it directly instead of doing COW. > > But, to write to the fallocated extent, we''ll have to clear the flag. > So, we''ll have to cow the block that holds the file extent pointer, > which means we can enospc.I''m sure you know this, but for the peanut gallery: You can avoid some of these sort of purely copy-on-write ENOSPC cases. Any operation where the space used afterwards is less than or equal to the space used before - like in your fallocate case - can avoid ENOSPC as long as you reserve a certain amount of space on the fs and break down the changes into small enough groups. Most file systems don''t let you fill up beyond 90-95% anyway because performance goes to hell. You also need to do this so you can delete when your file system is full. In general, it''d be nice to say that if your app can''t handle suprise ENOSPC, then if you run without snapshots, compression, or data dedup, we guarantee you''ll only get ENOSPC in the "normal" cases. What do you think? -VAL -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 16, 2008 at 03:25:01PM -0400, Valerie Aurora Henson wrote:> Both deduplication and compression have an interesting side effect in > which a write to a previously "allocated" block can return ENOSPC. > This is even more exciting when you factor in mmap. Any thoughts on > how to handle this?Note that this can already happen in todays filesystems. Writing into some preallocated space can always cause splits of the allocation or bmap btrees as the pervious big preallocated extent now is split into one allocated and at least one (or two if writing into the middle) preallocated extents. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2008-10-17 at 14:24 -0400, Valerie Aurora Henson wrote:> On Thu, Oct 16, 2008 at 03:30:49PM -0400, Chris Mason wrote: > > On Thu, 2008-10-16 at 15:25 -0400, Valerie Aurora Henson wrote: > > > > > > Both deduplication and compression have an interesting side effect in > > > which a write to a previously "allocated" block can return ENOSPC. > > > This is even more exciting when you factor in mmap. Any thoughts on > > > how to handle this? > > > > Unfortunately we''ll have a number of places where ENOSPC will jump in > > where people don''t expect it, and this includes any COW overwrite of an > > existing extent. The old extent isn''t freed until snapshot deletion > > time, which won''t happen until after the current transaction commits. > > > > Another example is fallocate. The extent will have a little flag that > > says I''m a preallocated extent, which is how we''ll know we''re allowed to > > overwrite it directly instead of doing COW. > > > > But, to write to the fallocated extent, we''ll have to clear the flag. > > So, we''ll have to cow the block that holds the file extent pointer, > > which means we can enospc. > > I''m sure you know this, but for the peanut gallery: You can avoid some > of these sort of purely copy-on-write ENOSPC cases. Any operation > where the space used afterwards is less than or equal to the space > used before - like in your fallocate case - can avoid ENOSPC as long > as you reserve a certain amount of space on the fs and break down the > changes into small enough groups. Most file systems don''t let you > fill up beyond 90-95% anyway because performance goes to hell. You > also need to do this so you can delete when your file system is full. > > In general, it''d be nice to say that if your app can''t handle suprise > ENOSPC, then if you run without snapshots, compression, or data dedup, > we guarantee you''ll only get ENOSPC in the "normal" cases. What do > you think?I think I''ll have to come back to this after getting ENOSPC to work at all ;) You''re right that reserved space can do wonders to dig us out of holes, it has to be reserved at a multiple of the number of procs that I allow into the transaction. I should be able to go into an emergency one writer at a time theme as space gets really tight, but there are lots of missing pieces that haven''t been coded yet in that area. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Oct 19, 2008 at 08:16:31PM -0400, Chris Mason wrote:> > I think I''ll have to come back to this after getting ENOSPC to work at > all ;) You''re right that reserved space can do wonders to dig us out of:) Having been through this before, the ENOSPC accounting was incredibly hard to get right. It''s at least worth thinking about the edge cases while you''re writing the first version, although you will probably just have to throw one away no matter what.> holes, it has to be reserved at a multiple of the number of procs that I > allow into the transaction. > > I should be able to go into an emergency one writer at a time theme as > space gets really tight, but there are lots of missing pieces that > haven''t been coded yet in that area.Makes sense. I have the following "behave like I expect" rules for things that often aren''t right in the first version of a COW file system. * If a write could succeed in the future without any user-level changes to the file system, then it will succeeed the first time. Basically, this is reflecting what happens when space used by the previous version of the fs is freed after the next COW version is written out. A naive implementation of COW will fail the write if it happens while enough other writes are outstanding, even if there would be enough space after the other writes have been synced to disk and the blocks from the old version are freed. This means backing off to the one-writer-at-a-time mode you are talking about. * Rewriting metadata will always succeed. Again, with naive COW, you can get into a state where doing a chmod() on a file could end up returning ENOSPC. Totally uncool. Pretty much just requires a little reserved space. * Deletion will always succeed. Again, reserved space, plus a little forethought in metadata design. It is not automatically the case that your metadata will be designed such that deletion will always result in more free space afterwards, so it''s worth a review pass just to be sure. One thing I ran into before is that it''s non-trivial to calculate exactly how many blocks will need to be COW''d for even the tiniest write. Leaves split, directories grow another block, the inode block has to be copied, the tree grows another level, you have to allocate a new free space extent, etc., etc. The worst case can be hundreds of KB per 1-byte write. Logically, you may only be writing a few bytes, but they may require megabytes of free space to sync out to disk. Very annoying. -VAL -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html