Steve Radich, BitShop, Inc.
2009-Dec-13 20:51 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
I enabled compression on a zfs filesystem with compression=gzip9 - i.e. fairly slow compression - this stores backups of databases (which compress fairly well). The next question is: Is the CRC on the disk based on the uncompressed data (which seems more likely to be able to be recovered) or based on the zipped data (which seems slightly less likely to be able to be recovered). Why? Because if you can de-dup anyway why bother to compress THEN check? This SEEMS to be the behaviour - i.e. I would suspect many of the files I''m writing are dups - however I see high cpu use even though on some of the copies I see almost no disk writes. If the dup check logic happens first AND it''s a duplicate I shouldn''t see hardly any CPU use (because it won''t need to compress the data). Steve Radich BitShop.com -- This message posted from opensolaris.org
Robert Milkowski
2009-Dec-14 09:02 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote:> I enabled compression on a zfs filesystem with compression=gzip9 - i.e. fairly slow compression - this stores backups of databases (which compress fairly well). > > The next question is: Is the CRC on the disk based on the uncompressed data (which seems more likely to be able to be recovered) or based on the zipped data (which seems slightly less likely to be able to be recovered). > > Why? > > Because if you can de-dup anyway why bother to compress THEN check? This SEEMS to be the behaviour - i.e. I would suspect many of the files I''m writing are dups - however I see high cpu use even though on some of the copies I see almost no disk writes. > > If the dup check logic happens first AND it''s a duplicate I shouldn''t see hardly any CPU use (because it won''t need to compress the data). > > >First, the checksum is calculated after compression happens. If both compression and dedup is enabled for a given dataset then zfs will first compress the data, calculae the checksum and then dedup it. It makes perfect sense as if you have some data which is very compressible and the unique set is big enough so compression would be useful it makes sense to use them both. If you don''t want the compression while using dedup just disable it. -- Robert Milkowski http://milek.blogspot.com
On Sun, Dec 13, 2009 at 11:51 PM, Steve Radich, BitShop, Inc. <stever at bitshop.com> wrote:> I enabled compression on a zfs filesystem with compression=gzip9 - i.e. fairly slow compression - this stores backups of databases (which compress fairly well). > > The next question is: ?Is the CRC on the disk based on the uncompressed data (which seems more likely to be able to be recovered) or based on the zipped data (which seems slightly less likely to be able to be recovered). > > Why? > > Because if you can de-dup anyway why bother to compress THEN check? This SEEMS to be the behaviour - i.e. IZFS deduplication is block-level, so to deduplicate one needs data broken into blocks to be written. With compression enabled, you don''t have these until data is compressed. Looks like cycles waste indeed, but ... Regards, Andrey> would suspect many of the files I''m writing are dups - however I see high cpu use even though > on some of the copies I see almost no disk writes. > > If the dup check logic happens first AND it''s a duplicate I shouldn''t see hardly any CPU use (because it won''t need to compress the data). > > Steve Radich > BitShop.com > -- > This message posted from opensolaris.org > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
A Darren Dunham
2009-Dec-14 18:46 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
On Mon, Dec 14, 2009 at 09:30:29PM +0300, Andrey Kuzmin wrote:> ZFS deduplication is block-level, so to deduplicate one needs data > broken into blocks to be written. With compression enabled, you don''t > have these until data is compressed. Looks like cycles waste indeed, > but ...ZFS compression is also block-level. Both are done on ZFS blocks. ZFS compression is not streamwise. -- Darren
Casper.Dik at Sun.COM
2009-Dec-14 18:53 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
>On Mon, Dec 14, 2009 at 09:30:29PM +0300, Andrey Kuzmin wrote: >> ZFS deduplication is block-level, so to deduplicate one needs data >> broken into blocks to be written. With compression enabled, you don''t >> have these until data is compressed. Looks like cycles waste indeed, >> but ... > >ZFS compression is also block-level. Both are done on ZFS blocks. ZFS >compression is not streamwise.And if you enable "verify" and you checksum the uncompressed data, you will need to uncompress before you can verify. Casper
On Mon, Dec 14, 2009 at 9:53 PM, <Casper.Dik at sun.com> wrote:> >>On Mon, Dec 14, 2009 at 09:30:29PM +0300, Andrey Kuzmin wrote: >>> ZFS deduplication is block-level, so to deduplicate one needs data >>> broken into blocks to be written. With compression enabled, you don''t >>> have these until data is compressed. Looks like cycles waste indeed, >>> but ... >> >>ZFS compression is also block-level. ?Both are done on ZFS blocks. ?ZFS >>compression is not streamwise. > > > And if you enable "verify" and you checksum the uncompressed data, you > will need to uncompress before you can verify.Right, but ''verify'' seems to be ''extreme safety'' and thus rather rare use case. Saving cycles lost to compress duplicates looks to outweigh ''uncompress before verify'' overhead, imo. Regards, Andrey> > Casper > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
On Mon, Dec 14, 2009 at 9:32 PM, Andrey Kuzmin <andrey.v.kuzmin at gmail.com> wrote:> > Right, but ''verify'' seems to be ''extreme safety'' and thus rather rare > use case.Hmm, dunno. I wouldn''t set anything, but scratch file system to dedup=on. Anything of even slight significance is set to dedup=verify.> Saving cycles lost to compress duplicates looks to outweigh > ''uncompress before verify'' overhead, imo.Dedup doesn''t come for free - it imposes additional load on CPU. just like a checksumming and compression. The more fancy things we want our file system to do for us, the stronger CPU it''ll take. -- Regards, Cyril
> > Hmm, dunno. I wouldn''t set anything, but scratch file > system to > dedup=on. Anything of even slight significance is set > to dedup=verify.Why? Are you saying this because the ZFS dedup code is relatively new? Or because you think there''s some other problem/disadvantage to it? We''re planning on using deduplication for archiving old data, and I see good use cases for it for virtual machine data.> > Dedup doesn''t come for free - it imposes additional > load on CPU. just > like a checksumming and compression. The more fancy > things we want our > file system to do for us, the stronger CPU it''ll > take. >Understood and agreed...but if you have the extra CPU cycles already, then, depending on the type of data and your deduplication ratios, it may be worth it to use the extra CPU to avoid buying the disk. -Nick -- This message posted from opensolaris.org
On 12/14/09, Cyril Plisko <cyril.plisko at mountall.com> wrote:> On Mon, Dec 14, 2009 at 9:32 PM, Andrey Kuzmin > <andrey.v.kuzmin at gmail.com> wrote: >> >> Right, but ''verify'' seems to be ''extreme safety'' and thus rather rare >> use case. > > Hmm, dunno. I wouldn''t set anything, but scratch file system to > dedup=on. Anything of even slight significance is set to dedup=verify. > >> Saving cycles lost to compress duplicates looks to outweigh >> ''uncompress before verify'' overhead, imo. > > Dedup doesn''t come for free - it imposes additional load on CPU. just > like a checksumming and compression. The more fancy things we want our > file system to do for us, the stronger CPU it''ll take. > > -- > Regards, > Cyril >Verify mode actually looks compress/dedupe order-neutral. To do byte-comparison, one can either compress new block or decompress old one, and the latter is usually a bit easter. Pipeline design may dictate a choice, for instance one could compress new block while old one is being fetched from disk for comparison, but overall it looks pretty close. And with dedupe=on reversing the order, if feasible, saves quite some cycles. Regards, Andrey
Darren J Moffat
2009-Dec-15 10:00 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Cyril Plisko wrote:> On Mon, Dec 14, 2009 at 9:32 PM, Andrey Kuzmin > <andrey.v.kuzmin at gmail.com> wrote: >> Right, but ''verify'' seems to be ''extreme safety'' and thus rather rare >> use case. > > Hmm, dunno. I wouldn''t set anything, but scratch file system to > dedup=on. Anything of even slight significance is set to dedup=verify.Why ? Is it because you don''t believe SHA256 (which is the default checksum used when dedup=on is specified) is strong enough ? -- Darren J Moffat
Kjetil Torgrim Homme
2009-Dec-15 12:06 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Robert Milkowski <milek at task.gda.pl> writes:> On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote: >> Because if you can de-dup anyway why bother to compress THEN check? >> This SEEMS to be the behaviour - i.e. I would suspect many of the >> files I''m writing are dups - however I see high cpu use even though >> on some of the copies I see almost no disk writes. > > First, the checksum is calculated after compression happens.for some reason I, like Steve, thought the checksum was calculated on the uncompressed data, but a look in the source confirms you''re right, of course. thinking about the consequences of changing it, RAID-Z recovery would be much more CPU intensive if hashing was done on uncompressed data -- every possible combination of the N-1 disks would have to be decompressed (and most combinations would fail), and *then* the remaining candidates would be hashed to see if the data is correct. this would be done on a per recordsize basis, not per stripe, which means reconstruction would fail if two disk blocks (512 octets) on different disks and in different stripes go bad. (doing an exhaustive search for all possible permutations to handle that case doesn''t seem realistic.) in addition, hashing becomes slightly more expensive since more data needs to be hashed. overall, my guess is that this choice (made before dedup!) will give worse performance in normal situations in the future, when dedup+lzjb will be very common, at a cost of faster and more reliable resilver. in any case, there is not much to be done about it now. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
On Tue, Dec 15, 2009 at 3:06 PM, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:> Robert Milkowski <milek at task.gda.pl> writes: >> On 13/12/2009 20:51, Steve Radich, BitShop, Inc. wrote: >>> Because if you can de-dup anyway why bother to compress THEN check? >>> This SEEMS to be the behaviour - i.e. I would suspect many of the >>> files I''m writing are dups - however I see high cpu use even though >>> on some of the copies I see almost no disk writes. >> >> First, the checksum is calculated after compression happens. > > for some reason I, like Steve, thought the checksum was calculated on > the uncompressed data, but a look in the source confirms you''re right, > of course. > > thinking about the consequences of changing it, RAID-Z recovery would be > much more CPU intensive if hashing was done on uncompressed data --I don''t quite see how dedupe (based on sha256) and parity (based on crc32) are related. Regards, Andrey> every possible combination of the N-1 disks would have to be > decompressed (and most combinations would fail), and *then* the > remaining candidates would be hashed to see if the data is correct. > > this would be done on a per recordsize basis, not per stripe, which > means reconstruction would fail if two disk blocks (512 octets) on > different disks and in different stripes go bad. ?(doing an exhaustive > search for all possible permutations to handle that case doesn''t seem > realistic.) > > in addition, hashing becomes slightly more expensive since more data > needs to be hashed. > > overall, my guess is that this choice (made before dedup!) will give > worse performance in normal situations in the future, when dedup+lzjb > will be very common, at a cost of faster and more reliable resilver. ?in > any case, there is not much to be done about it now. > > -- > Kjetil T. Homme > Redpill Linpro AS - Changing the game > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Kjetil Torgrim Homme
2009-Dec-16 14:18 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes:> Kjetil Torgrim Homme wrote: >> for some reason I, like Steve, thought the checksum was calculated on >> the uncompressed data, but a look in the source confirms you''re right, >> of course. >> >> thinking about the consequences of changing it, RAID-Z recovery would be >> much more CPU intensive if hashing was done on uncompressed data -- > > I don''t quite see how dedupe (based on sha256) and parity (based on > crc32) are related.I tried to hint at an explanation:>> every possible combination of the N-1 disks would have to be >> decompressed (and most combinations would fail), and *then* the >> remaining candidates would be hashed to see if the data is correct.the key is that you don''t know which block is corrupt. if everything is hunky-dory, the parity will match the data. parity in RAID-Z1 is not a checksum like CRC32, it is simply XOR (like in RAID 5). here''s an example with four data disks and one paritydisk: D1 D2 D3 D4 PP 00 01 10 10 01 this is a single stripe with 2-bit disk blocks for simplicity. if you XOR together all the blocks, you get 00. that''s the simple premise for reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and so on. so what happens if a bit flips in D4 and it becomes 00? the total XOR isn''t 00 anymore, it is 10 -- something is wrong. but unless you get a hardware signal from D4, you don''t know which block is corrupt. this is a major problem with RAID 5, the data is irrevocably corrupt. the parity discovers the error, and can alert the user, but that''s the best it can do. in RAID-Z the hash saves the day: first *assume* D1 is bad and reconstruct it from parity. if the hash for the block is OK, D1 *was* bad. otherwise, assume D2 is bad. and so on. so, the parity calculation will indicate which stripes contain bad blocks. but the hashing, the sanity check for which disk blocks are actually bad must be calculated over all the stripes a ZFS block (record) consists of.>> this would be done on a per recordsize basis, not per stripe, which >> means reconstruction would fail if two disk blocks (512 octets) on >> different disks and in different stripes go bad. ?(doing an exhaustive >> search for all possible permutations to handle that case doesn''t seem >> realistic.)actually this is the same for compression before/after hashing. it''s just that each permutation is more expensive to check.>> in addition, hashing becomes slightly more expensive since more data >> needs to be hashed. >> >> overall, my guess is that this choice (made before dedup!) will give >> worse performance in normal situations in the future, when dedup+lzjb >> will be very common, at a cost of faster and more reliable resilver. ?in >> any case, there is not much to be done about it now.-- Kjetil T. Homme Redpill Linpro AS - Changing the game
Yet again, I don''t see how RAID-Z reconstruction is related to the subject discussed (what data should be sha256''ed when both dedupe and compression are enabled, raw or compressed ). sha256 has nothing to do with bad block detection (may be it will when encryption is implemented, but for now sha256 is used for duplicate candidates look-up only). Regards, Andrey On Wed, Dec 16, 2009 at 5:18 PM, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: > >> Kjetil Torgrim Homme wrote: >>> for some reason I, like Steve, thought the checksum was calculated on >>> the uncompressed data, but a look in the source confirms you''re right, >>> of course. >>> >>> thinking about the consequences of changing it, RAID-Z recovery would be >>> much more CPU intensive if hashing was done on uncompressed data -- >> >> I don''t quite see how dedupe (based on sha256) and parity (based on >> crc32) are related. > > I tried to hint at an explanation: > >>> every possible combination of the N-1 disks would have to be >>> decompressed (and most combinations would fail), and *then* the >>> remaining candidates would be hashed to see if the data is correct. > > the key is that you don''t know which block is corrupt. ?if everything is > hunky-dory, the parity will match the data. ?parity in RAID-Z1 is not a > checksum like CRC32, it is simply XOR (like in RAID 5). ?here''s an > example with four data disks and one paritydisk: > > ?D1 ?D2 ?D3 ?D4 ?PP > ?00 ?01 ?10 ?10 ?01 > > this is a single stripe with 2-bit disk blocks for simplicity. ?if you > XOR together all the blocks, you get 00. ?that''s the simple premise for > reconstruction -- D1 = XOR(D2, D3, D4, PP), D2 = XOR(D1, D3, D4, PP) and > so on. > > so what happens if a bit flips in D4 and it becomes 00? ?the total XOR > isn''t 00 anymore, it is 10 -- something is wrong. ?but unless you get a > hardware signal from D4, you don''t know which block is corrupt. ?this is > a major problem with RAID 5, the data is irrevocably corrupt. ?the > parity discovers the error, and can alert the user, but that''s the best > it can do. ?in RAID-Z the hash saves the day: first *assume* D1 is bad > and reconstruct it from parity. ?if the hash for the block is OK, D1 > *was* bad. ?otherwise, assume D2 is bad. ?and so on. > > so, the parity calculation will indicate which stripes contain bad > blocks. ?but the hashing, the sanity check for which disk blocks are > actually bad must be calculated over all the stripes a ZFS block > (record) consists of. > >>> this would be done on a per recordsize basis, not per stripe, which >>> means reconstruction would fail if two disk blocks (512 octets) on >>> different disks and in different stripes go bad. ?(doing an exhaustive >>> search for all possible permutations to handle that case doesn''t seem >>> realistic.) > > actually this is the same for compression before/after hashing. ?it''s > just that each permutation is more expensive to check. > >>> in addition, hashing becomes slightly more expensive since more data >>> needs to be hashed. >>> >>> overall, my guess is that this choice (made before dedup!) will give >>> worse performance in normal situations in the future, when dedup+lzjb >>> will be very common, at a cost of faster and more reliable resilver. ?in >>> any case, there is not much to be done about it now. > > -- > Kjetil T. Homme > Redpill Linpro AS - Changing the game > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Kjetil Torgrim Homme
2009-Dec-16 16:25 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes:> Yet again, I don''t see how RAID-Z reconstruction is related to the > subject discussed (what data should be sha256''ed when both dedupe and > compression are enabled, raw or compressed ). sha256 has nothing to do > with bad block detection (may be it will when encryption is > implemented, but for now sha256 is used for duplicate candidates > look-up only).how do you think RAID-Z resilvering works? please correct me where I''m wrong. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
On Wed, Dec 16, 2009 at 7:25 PM, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: >> Yet again, I don''t see how RAID-Z reconstruction is related to the >> subject discussed (what data should be sha256''ed when both dedupe and >> compression are enabled, raw or compressed ). sha256 has nothing to do >> with bad block detection (may be it will when encryption is >> implemented, but for now sha256 is used for duplicate candidates >> look-up only). > > how do you think RAID-Z resilvering works? ?please correct me where I''m > wrong.Resilvering has noting to do with sha256: one could resilver long before dedupe was introduced in zfs. Regards, Andrey> > -- > Kjetil T. Homme > Redpill Linpro AS - Changing the game > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Darren J Moffat
2009-Dec-16 16:46 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin wrote:> On Wed, Dec 16, 2009 at 7:25 PM, Kjetil Torgrim Homme > <kjetilho at linpro.no> wrote: >> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: >>> Yet again, I don''t see how RAID-Z reconstruction is related to the >>> subject discussed (what data should be sha256''ed when both dedupe and >>> compression are enabled, raw or compressed ). sha256 has nothing to do >>> with bad block detection (may be it will when encryption is >>> implemented, but for now sha256 is used for duplicate candidates >>> look-up only). >> how do you think RAID-Z resilvering works? please correct me where I''m >> wrong. > > Resilvering has noting to do with sha256: one could resilver long > before dedupe was introduced in zfs.SHA256 isn''t just used for dedup it is available as one of the checksum algorithms right back to pool version 1 that integrated in build 27. SHA256 is also used to checksum the pool uberblock. This means that SHA256 is used during resilvering and especially so if you have checksum=sha256 for your datasets. If you still don''t believe me check the source code history: http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sha256.c Look at the date when that integrated 31st October 2005. In case you still doubt me look at the fix I just integrated today: http://mail.opensolaris.org/pipermail/onnv-notify/2009-December/011090.html -- Darren J Moffat
On Wed, Dec 16, 2009 at 7:46 PM, Darren J Moffat <darrenm at opensolaris.org> wrote:> Andrey Kuzmin wrote: >> >> On Wed, Dec 16, 2009 at 7:25 PM, Kjetil Torgrim Homme >> <kjetilho at linpro.no> wrote: >>> >>> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: >>>> >>>> Yet again, I don''t see how RAID-Z reconstruction is related to the >>>> subject discussed (what data should be sha256''ed when both dedupe and >>>> compression are enabled, raw or compressed ). sha256 has nothing to do >>>> with bad block detection (may be it will when encryption is >>>> implemented, but for now sha256 is used for duplicate candidates >>>> look-up only). >>> >>> how do you think RAID-Z resilvering works? ?please correct me where I''m >>> wrong. >> >> Resilvering has noting to do with sha256: one could resilver long >> before dedupe was introduced in zfs. > > SHA256 isn''t just used for dedup it is available as one of the checksum > algorithms right back to pool version 1 that integrated in build 27.''One of'' is the key word. And thanks for code pointers, I''ll take a look. Regards, Andrey> > SHA256 is also used to checksum the pool uberblock. > > This means that SHA256 is used during resilvering and especially so if you > have checksum=sha256 for your datasets. > > If you still don''t believe me check the source code history: > > http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio_checksum.c > http://src.opensolaris.org/source/history/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sha256.c > > Look at the date when that integrated 31st October 2005. > > In case you still doubt me look at the fix I just integrated today: > > http://mail.opensolaris.org/pipermail/onnv-notify/2009-December/011090.html > > > -- > Darren J Moffat >
Kjetil Torgrim Homme
2009-Dec-17 00:33 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes:> Darren J Moffat wrote: >> Andrey Kuzmin wrote: >>> Resilvering has noting to do with sha256: one could resilver long >>> before dedupe was introduced in zfs. >> >> SHA256 isn''t just used for dedup it is available as one of the >> checksum algorithms right back to pool version 1 that integrated in >> build 27. > > ''One of'' is the key word. And thanks for code pointers, I''ll take a > look.I didn''t mention sha256 at all :-). the reasoning is the same no matter what hash algorithm you''re using (fletcher2, fletcher4 or sha256. dedup doesn''t require sha256 either, you can use fletcher4. the question was: why does data have to be compressed before it can be recognised as a duplicate? it does seem like a waste of CPU, no? I attempted to show the downsides to identifying blocks by their uncompressed hash. (BTW, it doesn''t affect storage efficiency, the same duplicate blocks will be discovered either way.) -- Kjetil T. Homme Redpill Linpro AS - Changing the game
Downside you have described happens only when the same checksum is used for data protection and duplicate detection. This implies sha256, BTW, since fletcher-based dedupe has been dropped in recent builds. On 12/17/09, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: >> Darren J Moffat wrote: >>> Andrey Kuzmin wrote: >>>> Resilvering has noting to do with sha256: one could resilver long >>>> before dedupe was introduced in zfs. >>> >>> SHA256 isn''t just used for dedup it is available as one of the >>> checksum algorithms right back to pool version 1 that integrated in >>> build 27. >> >> ''One of'' is the key word. And thanks for code pointers, I''ll take a >> look. > > I didn''t mention sha256 at all :-). the reasoning is the same no matter > what hash algorithm you''re using (fletcher2, fletcher4 or sha256. dedup > doesn''t require sha256 either, you can use fletcher4. > > the question was: why does data have to be compressed before it can be > recognised as a duplicate? it does seem like a waste of CPU, no? I > attempted to show the downsides to identifying blocks by their > uncompressed hash. (BTW, it doesn''t affect storage efficiency, the same > duplicate blocks will be discovered either way.) > > -- > Kjetil T. Homme > Redpill Linpro AS - Changing the game > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Regards, Andrey
Kjetil Torgrim Homme
2009-Dec-17 14:32 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes:> Downside you have described happens only when the same checksum is > used for data protection and duplicate detection. This implies sha256, > BTW, since fletcher-based dedupe has been dropped in recent builds.if the hash used for dedup is completely separate from the hash used for data protection, I don''t see any downsides to computing the dedup hash from uncompressed data. why isn''t it? -- Kjetil T. Homme Redpill Linpro AS - Changing the game
Darren J Moffat
2009-Dec-17 14:45 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Kjetil Torgrim Homme wrote:> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: > >> Downside you have described happens only when the same checksum is >> used for data protection and duplicate detection. This implies sha256, >> BTW, since fletcher-based dedupe has been dropped in recent builds. > > if the hash used for dedup is completely separate from the hash used for > data protection, I don''t see any downsides to computing the dedup hash > from uncompressed data. why isn''t it?It isn''t separate because that isn''t how Jeff and Bill designed it. I think the design the have is great. Instead of trying to pick holes in the theory can you demonstrate a real performance problem with compression=on and dedup=on and show that it is because of the compression step ? Otherwise if you want it changed code it up and show how what you have done is better in all cases. -- Darren J Moffat
Kjetil Torgrim Homme
2009-Dec-17 15:14 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Darren J Moffat <darrenm at opensolaris.org> writes:> Kjetil Torgrim Homme wrote: >> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: >> >>> Downside you have described happens only when the same checksum is >>> used for data protection and duplicate detection. This implies sha256, >>> BTW, since fletcher-based dedupe has been dropped in recent builds. >> >> if the hash used for dedup is completely separate from the hash used >> for data protection, I don''t see any downsides to computing the dedup >> hash from uncompressed data. why isn''t it? > > It isn''t separate because that isn''t how Jeff and Bill designed it.thanks for confirming that, Darren.> I think the design the have is great.I don''t disagree.> Instead of trying to pick holes in the theory can you demonstrate a > real performance problem with compression=on and dedup=on and show > that it is because of the compression step ?compression requires CPU, actually quite a lot of it. even with the lean and mean lzjb, you will get not much more than 150 MB/s per core or something like that. so, if you''re copying a 10 GB image file, it will take a minute or two, just to compress the data so that the hash can be computed so that the duplicate block can be identified. if the dedup hash was based on uncompressed data, the copy would be limited by hashing efficiency (and dedup tree lookup). I don''t know how tightly interwoven the dedup hash tree and the block pointer hash tree are, or if it is all possible to disentangle them. conceptually it doesn''t seem impossible, but that''s easy for me to say, with no knowledge of the zio pipeline... oh, how does encryption play into this? just don''t? knowing that someone else has the same block as you is leaking information, but that may be acceptable -- just make different pools for people you don''t trust.> Otherwise if you want it changed code it up and show how what you have > done is better in all cases.I wish I could :-) -- Kjetil T. Homme Redpill Linpro AS - Changing the game
Darren J Moffat
2009-Dec-17 16:14 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Kjetil Torgrim Homme wrote:> I don''t know how tightly interwoven the dedup hash tree and the block > pointer hash tree are, or if it is all possible to disentangle them.At the moment I''d say very interwoven by desgin.> conceptually it doesn''t seem impossible, but that''s easy for me to > say, with no knowledge of the zio pipeline...Correct it isn''t impossible but instead there would probably need to be two checksums held, one of the untransformed data (ie uncompressed and unencrypted) and one of the transformed data (compressed and encrypted). That has different tradeoffs and SHA256 can be expensive too see: http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_via Note also that the compress/encrypt/checksum and the dedup are separate pipeline stages so while dedup is happening for block N block N+1 can be getting transformed - so this is designed to take advantage of multiple scheduling units (threads,cpus,cores etc).> oh, how does encryption play into this? just don''t? knowing that > someone else has the same block as you is leaking information, but that > may be acceptable -- just make different pools for people you don''t > trust.compress, encrypt, checksum, dedup. You are correct that it is an information leak but only within a dataset and its clones and only if you can observe the deduplication stats (and you need to use zdb to get enough info to see the leak - and that means you have access to the raw devices), the deupratio isn''t really enough unless the pool is really idle or has only one user writing at a time. For the encryption case deduplication of the same plaintext block will only work with in a dataset or a clone of it - because only in those cases do you have the same key (and the way I have implemented the IV generation for AES CCM/GCM mode ensures that the same plaintext will have the same IV so the ciphertexts will match). Also if you place a block in an unencrypted dataset that happens to match the ciphertext in an encrypted dataset they won''t dedup either (you need to understand what I''ve done with the AES CCM/GCM MAC and the zio_chksum_t field in the blkptr_t and how that is used by dedup to see why). If that small information leak isn''t acceptable even within the dataset then don''t enable both encryption and deduplication on those datasets - and don''t delegate that property to your users either. Or you can frequently rekey your per dataset data encryption keys ''zfs key -K'' but then you might as well turn dedup off - other there are some very good usecases in multi level security where doing dedup/encryption and rekey provides a nice effect. -- Darren J Moffat
Bob Friesenhahn
2009-Dec-17 16:18 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
On Thu, 17 Dec 2009, Kjetil Torgrim Homme wrote:> > compression requires CPU, actually quite a lot of it. even with the > lean and mean lzjb, you will get not much more than 150 MB/s per core or > something like that. so, if you''re copying a 10 GB image file, it will > take a minute or two, just to compress the data so that the hash can be > computed so that the duplicate block can be identified. if the dedup > hash was based on uncompressed data, the copy would be limited by > hashing efficiency (and dedup tree lookup).It is useful to keep in mind that dedupication can save a lot of disk space but it is usually only quite effective in certain circumstances, such as when replicating a collection of files. The majority of write I/O will never benefit from deduplication. Based on this, speculatively assuming that the data will not be deduplicated does not result in increased cost most of the time. If the data does end up being deduplicated, then that is a blessing. Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Nicolas Williams
2009-Dec-17 17:13 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
On Thu, Dec 17, 2009 at 03:32:21PM +0100, Kjetil Torgrim Homme wrote:> if the hash used for dedup is completely separate from the hash used for > data protection, I don''t see any downsides to computing the dedup hash > from uncompressed data. why isn''t it?Hash and checksum functions are slow (hash functions are slower, but either way you''ll be loading large blocks of data, which sets a floor for cost). Duplicating work is bad for performance. Using the same checksum for integrity protection and dedup is an optimization, and a very nice one at that. Having separate checksums would require making blkptr_t larger, which imposes its own costs. There''s lots of trade-offs here. Using the same checksum/hash for integrity protection and dedup is a great solution. If you use a non-cryptographic checksum algorithm then you''ll want to enable verification for dedup. That''s all. Nico --
On Thu, Dec 17, 2009 at 6:14 PM, Kjetil Torgrim Homme <kjetilho at linpro.no> wrote:> Darren J Moffat <darrenm at opensolaris.org> writes: >> Kjetil Torgrim Homme wrote: >>> Andrey Kuzmin <andrey.v.kuzmin at gmail.com> writes: >>> >>>> Downside you have described happens only when the same checksum is >>>> used for data protection and duplicate detection. This implies sha256, >>>> BTW, since fletcher-based dedupe has been dropped in recent builds. >>> >>> if the hash used for dedup is completely separate from the hash used >>> for data protection, I don''t see any downsides to computing the dedup >>> hash from uncompressed data. ?why isn''t it? >> >> It isn''t separate because that isn''t how Jeff and Bill designed it. > > thanks for confirming that, Darren. > >> I think the design the have is great. > > I don''t disagree. > >> Instead of trying to pick holes in the theory can you demonstrate a >> real performance problem with compression=on and dedup=on and show >> that it is because of the compression step ? > > compression requires CPU, actually quite a lot of it. ?even with the > lean and mean lzjb, you will get not much more than 150 MB/s per core or > something like that. ?so, if you''re copying a 10 GB image file, it will > take a minute or two, just to compress the data so that the hash can be > computed so that the duplicate block can be identified. ?if the dedup > hash was based on uncompressed data, the copy would be limited by > hashing efficiency (and dedup tree lookup)This isn''t exactly true. If, speculatively, one stores two hashes, one for uncompressed data in ddt and another one, for compressed data, with data block for data healing, one wins compression for duplicates and pays by extra hash computation for singletons. So a more correct question would be if the set of cases where duplicates/singletons and compression/hashing bandwidth ratios are such that one wins is non-empty (or, rather, o practical importance). Regards, Andrey .> > I don''t know how tightly interwoven the dedup hash tree and the block > pointer hash tree are, or if it is all possible to disentangle them. > > conceptually it doesn''t seem impossible, but that''s easy for me to > say, with no knowledge of the zio pipeline... > > oh, how does encryption play into this? ?just don''t? ?knowing that > someone else has the same block as you is leaking information, but that > may be acceptable -- just make different pools for people you don''t > trust. > >> Otherwise if you want it changed code it up and show how what you have >> done is better in all cases. > > I wish I could :-) > > -- > Kjetil T. Homme > Redpill Linpro AS - Changing the game > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >
Daniel Carosone
2009-Dec-17 23:53 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Your parenthetical comments here raise some concerns, or at least eyebrows, with me. Hopefully you can lower them again.> compress, encrypt, checksum, dedup.> (and you need to use zdb to get enough info to see the > leak - and that means you have access to the raw devices)An attacker with access to the raw devices is the primary base threat model for on-disk encryption, surely? An attacker with access to disk traffic, via e.g. iSCSI, who can also deploy dynamic traffic analysis in addition to static content analysis, and who also has similarly greater opportunities for tampering, is another trickier threat model. It seems like entirely wrong thinking (even in parentheses) to dismiss an issue as irrelevant because it only applies in the primary threat model.> (and the way I have implemented the IV > generation for AES CCM/GCM mode ensures that the same > plaintext will have the same IV so the ciphertexts will match).Again, this seems like a cause for concern. Have you effectively turned these fancy and carefully designed crypto modes back into ECB, albeit at a larger block size (and only within a dataset)? Let''s consider copy-on-write semantics: with the above issue an attacker can tell which blocks of a file have changed over time, even if unchanged blocks have been rewritten.. giving even the static image attacker some traffic analysis capability. This would be a problem regardless of dedup, for the scenario where the attacker can see repeated ciphertext on disk (unless the dedup metadata itself is sufficiently encrypted, which I understand it is not).> (you need to understand > what I''ve done with the AES CCM/GCM MACI''d like to, but more to understand what (if any) protection is given against replay attacks (above that already provided by the merkle hash tree). I await ZFS crypto with even more enthusiasm than dedup, thanks for talking about the details with us. -- This message posted from opensolaris.org
Kjetil Torgrim Homme
2009-Dec-18 10:48 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Darren J Moffat <darrenm at opensolaris.org> writes:> Kjetil Torgrim Homme wrote: > >> I don''t know how tightly interwoven the dedup hash tree and the block >> pointer hash tree are, or if it is all possible to disentangle them. > > At the moment I''d say very interwoven by design. > >> conceptually it doesn''t seem impossible, but that''s easy for me to >> say, with no knowledge of the zio pipeline... > > Correct it isn''t impossible but instead there would probably need to > be two checksums held, one of the untransformed data (ie uncompressed > and unencrypted) and one of the transformed data (compressed and > encrypted). That has different tradeoffs and SHA256 can be expensive > too see: > > http://blogs.sun.com/darren/entry/improving_zfs_dedup_performance_viagreat work! SHA256 is more expensive than I thought, even with misc/sha2 it takes 1 ms per 128 KiB? that''s roughly the same CPU usage as lzjb! in that case hashing the (smaller) compressed data is more efficient than doing an additional hash of the full uncompressed block. it''s interesting to note that 64 KiB looks faster (a bit hard to read the chart accurately), L1 cache size coming into play, perhaps?> Note also that the compress/encrypt/checksum and the dedup are > separate pipeline stages so while dedup is happening for block N block > N+1 can be getting transformed - so this is designed to take advantage > of multiple scheduling units (threads,cpus,cores etc).nice. are all of them separate stages, or are compress/encrypt/checksum done as one stage?>> oh, how does encryption play into this? just don''t? knowing that >> someone else has the same block as you is leaking information, but that >> may be acceptable -- just make different pools for people you don''t >> trust. > > compress, encrypt, checksum, dedup. > > You are correct that it is an information leak but only within a > dataset and its clones and only if you can observe the deduplication > stats (and you need to use zdb to get enough info to see the leak - > and that means you have access to the raw devices), the deupratio > isn''t really enough unless the pool is really idle or has only one > user writing at a time. > > For the encryption case deduplication of the same plaintext block will > only work within a dataset or a clone of it - because only in those > cases do you have the same key (and the way I have implemented the IV > generation for AES CCM/GCM mode ensures that the same plaintext will > have the same IV so the ciphertexts will match).makes sense.> Also if you place a block in an unencrypted dataset that happens to > match the ciphertext in an encrypted dataset they won''t dedup either > (you need to understand what I''ve done with the AES CCM/GCM MAC and > the zio_chksum_t field in the blkptr_t and how that is used by dedup > to see why).wow, I didn''t think of that problem. did you get bitten by wrongful dedup during testing with image files? :-)> If that small information leak isn''t acceptable even within the > dataset then don''t enable both encryption and deduplication on those > datasets - and don''t delegate that property to your users either. Or > you can frequently rekey your per dataset data encryption keys ''zfs > key -K'' but then you might as well turn dedup off - other there are > some very good usecases in multi level security where doing > dedup/encryption and rekey provides a nice effect.indeed. ZFS is extremely flexible. thank you for your response, it was very enlightening. -- Kjetil T. Homme Redpill Linpro AS - Changing the game
Darren J Moffat
2009-Dec-21 22:44 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Daniel Carosone wrote:> Your parenthetical comments here raise some concerns, or at least eyebrows, with me. Hopefully you can lower them again. > >> compress, encrypt, checksum, dedup. > > >> (and you need to use zdb to get enough info to see the >> leak - and that means you have access to the raw devices) > > An attacker with access to the raw devices is the primary base threat model for on-disk encryption, surely? > > An attacker with access to disk traffic, via e.g. iSCSI, who can also deploy dynamic traffic analysis in addition to static content analysis, and who also has similarly greater opportunities for tampering, is another trickier threat model. > > It seems like entirely wrong thinking (even in parentheses) to dismiss an issue as irrelevant because it only applies in the primary threat model.I wasn''t dismissing it I was pointing out that this wasn''t something an unprivilege end user could easily do. If the risk is unacceptable then dedup shouldn''t be enabled. For some uses cases the risk is acceptable and for those use cases we want to allow the use of dedup with encryption.>> (and the way I have implemented the IV >> generation for AES CCM/GCM mode ensures that the same >> plaintext will have the same IV so the ciphertexts will match). > > Again, this seems like a cause for concern. Have you effectively turned these fancy and carefully designed crypto modes back into ECB, albeit at a larger block size (and only within a dataset)?No I don''t believe I have. The IV generation when doing deduplication is done by calculating an HMAC of the plaintext using a separate per dataset key (that is also refreshed if ''zfs key -K'' is run to rekey the dataset).> Let''s consider copy-on-write semantics: with the above issue an attacker can tell which blocks of a file have changed over time, even if unchanged blocks have been rewritten.. giving even the static image attacker some traffic analysis capability.So if that is part of your deployment risk model deduplication is not worth enabling in that case.> This would be a problem regardless of dedup, for the scenario where the attacker can see repeated ciphertext on disk (unless the dedup metadata itself is sufficiently encrypted, which I understand it is not).In the case where deduplication is not enabled the IV generation uses a compbination of the txg number, the object and blockid which complies with the recommendations for IV generation for both CCM and GCM.>> (you need to understand >> what I''ve done with the AES CCM/GCM MAC > > I''d like to, but more to understand what (if any) protection is given against replay attacks (above that already provided by the merkle hash tree).What do you mean by a replay attack ? What is being replayed and by whom ? -- Darren J Moffat
Darren J Moffat
2009-Dec-21 22:47 UTC
[zfs-discuss] DeDup and Compression - Reverse Order?
Kjetil Torgrim Homme wrote:>> Note also that the compress/encrypt/checksum and the dedup are >> separate pipeline stages so while dedup is happening for block N block >> N+1 can be getting transformed - so this is designed to take advantage >> of multiple scheduling units (threads,cpus,cores etc). > > nice. are all of them separate stages, or are compress/encrypt/checksum > done as one stage?Originally compress, encrypt, checksum were all separate stages in the zio pipeline they are now all one stage ZIO_WRITE_BP_INIT for the write case and ZIO_READ_BP_INIT for the read case.>> Also if you place a block in an unencrypted dataset that happens to >> match the ciphertext in an encrypted dataset they won''t dedup either >> (you need to understand what I''ve done with the AES CCM/GCM MAC and >> the zio_chksum_t field in the blkptr_t and how that is used by dedup >> to see why). > > wow, I didn''t think of that problem. did you get bitten by wrongful > dedup during testing with image files? :-)No I didn''t see the problem in reality I just thought about it in as a possible risk that needed to be addressed. Solving it didn''t actually require me to do any additional work because ZFS uses a separate table for each checksum algorithm anyway and the checksum algorithm for encrypted datasets is listed as sha256+mac not sha256. It was nice that I didn''t have to write more code to solve the problem but it may not have been that way. -- Darren J Moffat
On Mon, Dec 21, 2009 at 02:44:00PM -0800, Darren J Moffat wrote:> The IV generation when doing deduplication > is done by calculating an HMAC of the plaintext using a separate per > dataset key (that is also refreshed if ''zfs key -K'' is run to rekey the > dataset). > [..] > In the case where deduplication is not enabled the IV generation uses a > compbination of the txg number, the object and blockid which complies > with the recommendations for IV generation for both CCM and GCM.Aha! This was the crucial detail - that IV generation depends on the dedup setting. Makes perfect sense now, and seems like a sensible choice to enable a meaningful risk vs space tradeoff, rather than have mutually defeating features. One (obvious, and no doubt well dealt with) question: I presume the IV method is stored per block, similar to compression, such that changing the dedup setting doesn''t cause decryption to use the wrong IV?> What do you mean by a replay attack? > What is being replayed and by whom?Replay of previous disk blocks, substituted for more recent contents, by an attacker who either has offline access to the disk or MITM access to the storage path. Contrived example: playing back an old disk block with a previous, compromised password, instead of the block with the new, changed password. Many disk encryption mechanisms don''t provide integrity at all, in part because (other than some recent advanced crypto modes) it cost extra space and therefore brought many other complexities. Even those with some integrity may not defend strongly against this replay. ZFS clearly is game-changing, particularly with respect to integrity protections. With the hopefully-imminent introduction of zfs-crypto, it''s worthwhile understanding where the interaction and boundary between protection-against-error and protection-against-malice falls. Including the txg ctr in the IV (for non-dedup) is clearly useful, as is CoW (which changes the block number), but how far does it go back up the tree? Coming back down the tree, when do I first encounter zfs-crypto, and what can I fiddle with on the way to bypass or defeat its ongoing protections? Put as a threat: I have access to your data pool (image). We''ll ignore boot-time integrity issues for your rpool and zfs-crypto executable code, for now. We''ll also assume, for simplicity, that I''m not trying to tamper with a live running pool. Perhaps I''m a malicious SAN admin, or the SAN admin hasn''t locked down his infrastructure. Perhaps you left your usb widget at home and I used my ninja skills to break in. I want to tamper with your data, and for you to accept my changes without detection: With plain ZFS, I have a whole lot of other work to do than I would with normal filesystems, updating hash chains back up to the uberblock. It''s tricky, but doesn''t require any secrets. It''s trivial if I don''t need to rewrite history and can just commit a new txg. With ZFS-crypto, the question is how far the integrity protection is extended with the inclusion of keys. What kinds of attack can still be mounted without keys, particularly against metadata tampering? It''s the difference between hash and hmac/signature, and what''s included in each. One example threat scenario: what if I change the dataset properties to "crypto=off", will that cause future writes to be in plaintext? What would help a user notice? None of this is a criticism, zfs-crypto is never going to be a universal solution to all threats - it''s about understanding its coverage and limitations. It''s about deciding whether/which existing (much less convenient) defenses can be dropped in favour of this new hotness, as much as it is about which of those still provide value or which threats remain unprotected because no defense is available or economical. (Subject: changed accordingly) -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20091222/498b69d0/attachment.bin>