Hi. Currently zfs send has -D flag, which allows to deduplicate blocks in single stream. I''m wondering if it would be possible not to send blocks in incremental stream if we know they are part of the given dataset already and were sent to the remote site with some earlier snapshots. I know deduplication is pool-wide mechanism and the block might be part of many different datasets. In my case I''d need to know that the block I''m about to send is part of this particular dataset. With the current ZFS design, is something like this even possible to implement in some clean way or would there be a need for heavy modifcations of ZFS internals? If it is doable, could you suggest a good starting point? -- Pawel Jakub Dawidek http://www.wheelsystems.com pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100610/bf09ea09/attachment.bin>
On 6/9/2010 4:35 PM, Pawel Jakub Dawidek wrote:> Hi. > > Currently zfs send has -D flag, which allows to deduplicate blocks > in single stream. > > I''m wondering if it would be possible not to send blocks in incremental > stream if we know they are part of the given dataset already and were > sent to the remote site with some earlier snapshots. > > I know deduplication is pool-wide mechanism and the block might be part > of many different datasets. In my case I''d need to know that the block > I''m about to send is part of this particular dataset. > > With the current ZFS design, is something like this even possible to > implement in some clean way or would there be a need for heavy > modifcations of ZFS internals? > > If it is doable, could you suggest a good starting pointFirst off, even with an incremental, you could dedup at the receiving end easily, so really, the only thing you would be doing is cutting down on the amount of data traffic being send over the wire (which, could be significant). You''d have to run some sort of process on the receiving system. There''s no other way to design this kind of thing - you can''t rely on any config/state/etc on the "sending" system; for consistency, you''d *have* to query the receiver for its state. The closest analog to what you''re asking for is rsync. I don''t see any modifications to ZFS that has to be done to support something like that - it''s just a userland app. Note that there would have to be a non-trivial amount of overhead data communication between the two hosts. For each block being sent, the send would have to send the checksum over to the receiving side, which would have to check it''s DDT to see if it''s already there. It would then either send an ACK or NAK back to tell the sender whether or not to send the actual data. So, there''s be a *lot* of small packet traffic between the two machines. I suppose one could be smart and package up multiple blocks'' checksums in a single packet, but the fact remains that such a system would be non-trivially chatty. Much chattier than rsync. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On 06/ 9/10 05:35 PM, Pawel Jakub Dawidek wrote:> Hi. > > Currently zfs send has -D flag, which allows to deduplicate blocks > in single stream. > > I''m wondering if it would be possible not to send blocks in incremental > stream if we know they are part of the given dataset already and were > sent to the remote site with some earlier snapshots. > > I know deduplication is pool-wide mechanism and the block might be part > of many different datasets. In my case I''d need to know that the block > I''m about to send is part of this particular dataset. > > With the current ZFS design, is something like this even possible to > implement in some clean way or would there be a need for heavy > modifcations of ZFS internals? > >It''s not possible to implement unless we establish a bidirectional communication between the sending and receiving side. The logic for send-stream dedup is: for (each block to be written to stream) { get the block''s checksum lookup the block''s checksum in the dedup-table established for *this* stream generation if (an entry in the DDT exists for this checksum) send a "write-by-reference" block across the stream (this contains a reference to a block send earlier in the stream) else { add an entry for this block to the DDT send the full block } } Since the dedup table on sending side only knows about blocks already send in the stream, we have no way of knowing whether a copy of the block already exists on the other side, and even if we did know, we wouldn''t know where it was on the other side. The sending side would have to have a copy of the other side''s on-disk DDT to know whether a write-by-reference could be used. Lori -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100610/14d87436/attachment.html>
On Thu, Jun 10, 2010 at 12:21:02PM -0600, Lori Alt wrote:> It''s not possible to implement unless we establish a bidirectional > communication between the sending and receiving side. The logic for > send-stream dedup is: > > for (each block to be written to stream) { > > get the block''s checksum > > lookup the block''s checksum in the dedup-table > established for *this* stream generation > > if (an entry in the DDT exists for this checksum) > > send a "write-by-reference" block across the stream > (this contains a reference to a block send earlier in the > stream) > > else { > > add an entry for this block to the DDT > > send the full block > > } > > } > > Since the dedup table on sending side only knows about blocks already > send in the stream, we have no way of knowing whether a copy of the > block already exists on the other side, and even if we did know, we > wouldn''t know where it was on the other side. The sending side would > have to have a copy of the other side''s on-disk DDT to know whether a > write-by-reference could be used.If we send incremental stream we can be sure that up to the previous snapshot we have the same data on the other side. I''m aware it doesn''t mean the data has exactly the same checksum (eg. it can be compressed with different algorithm). But in theory, are we able to figure out that the given block we try to send is already part of the dataset''s previous snapshot? I''m fine with discarding incremental stream on the remote site if it uses different compression algorithm or simply deduplication is turned off (bascially when there is no block matching stored checksum). But if I have identical configurations on both ends I''d like not to send the same block multiple times in multiple incremental streams. -- Pawel Jakub Dawidek http://www.wheelsystems.com pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100610/8fac00e7/attachment.bin>
On 06/10/10 02:21 PM, Pawel Jakub Dawidek wrote:> On Thu, Jun 10, 2010 at 12:21:02PM -0600, Lori Alt wrote: > >> It''s not possible to implement unless we establish a bidirectional >> communication between the sending and receiving side. The logic for >> send-stream dedup is: >> >> for (each block to be written to stream) { >> >> get the block''s checksum >> >> lookup the block''s checksum in the dedup-table >> established for *this* stream generation >> >> if (an entry in the DDT exists for this checksum) >> >> send a "write-by-reference" block across the stream >> (this contains a reference to a block send earlier in the >> stream) >> >> else { >> >> add an entry for this block to the DDT >> >> send the full block >> >> } >> >> } >> >> Since the dedup table on sending side only knows about blocks already >> send in the stream, we have no way of knowing whether a copy of the >> block already exists on the other side, and even if we did know, we >> wouldn''t know where it was on the other side. The sending side would >> have to have a copy of the other side''s on-disk DDT to know whether a >> write-by-reference could be used. >> > If we send incremental stream we can be sure that up to the previous > snapshot we have the same data on the other side. I''m aware it doesn''t > mean the data has exactly the same checksum (eg. it can be compressed > with different algorithm). But in theory, are we able to figure out that > the given block we try to send is already part of the dataset''s previous > snapshot? I''m fine with discarding incremental stream on the remote site > if it uses different compression algorithm or simply deduplication is > turned off (bascially when there is no block matching stored checksum). > But if I have identical configurations on both ends I''d like not to send > the same block multiple times in multiple incremental streams. > >Each incremental stream only contains the blocks that are new or changed since the last snapshot, so I don''t see how you can be sure that the data already exists on the receiving side. But even if you did know that the block already exists on the receiving side, you don''t know where it is. That is, you don''t know what to put in the "reference" field of the send stream record. You don''t know the object number and offset of where the block already exists on the receiving side. Lori
On 6/10/2010 1:21 PM, Pawel Jakub Dawidek wrote:> > If we send incremental stream we can be sure that up to the previous > snapshot we have the same data on the other side. I''m aware it doesn''t > mean the data has exactly the same checksum (eg. it can be compressed > with different algorithm). But in theory, are we able to figure out that > the given block we try to send is already part of the dataset''s previous > snapshot? I''m fine with discarding incremental stream on the remote site > if it uses different compression algorithm or simply deduplication is > turned off (bascially when there is no block matching stored checksum). > But if I have identical configurations on both ends I''d like not to send > the same block multiple times in multiple incremental streamsNo, you can''t be sure. You can *assume* you sent the proper incremental stream to the receiving host, but what if you didn''t? Or it got deleted? etc. You *have* to check with receiving host to see what''s there. As Lori pointed out, you need the DDT from the receiving host. As I said earlier, this looks to NOT need code changes, just a smart userland app. I''d use rsync''s model, where you SSH over to the other host, run the same binary (which knows it''s in "receive" mode), and set up the com link between the two. The receiver''s DDT gets generated, passed back to the sender, and the sender can then do lookups using both DDT sets. It''s really not that complicated. My sole worry is that since ''zfs send'' and ''zfs receive'' are moving targets to keep up with the zfs filesystem version features, you''ll have to constantly modify your new app to be compatible with newer zfs versions. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On Thu, Jun 10, 2010 at 04:32:06PM -0700, Erik Trimble wrote:> On 6/10/2010 1:21 PM, Pawel Jakub Dawidek wrote: > > > >If we send incremental stream we can be sure that up to the previous > >snapshot we have the same data on the other side. I''m aware it doesn''t > >mean the data has exactly the same checksum (eg. it can be compressed > >with different algorithm). But in theory, are we able to figure out that > >the given block we try to send is already part of the dataset''s previous > >snapshot? I''m fine with discarding incremental stream on the remote site > >if it uses different compression algorithm or simply deduplication is > >turned off (bascially when there is no block matching stored checksum). > >But if I have identical configurations on both ends I''d like not to send > >the same block multiple times in multiple incremental streams > > No, you can''t be sure. You can *assume* you sent the proper incremental > stream to the receiving host, but what if you didn''t? Or it got deleted? > etc.So for this to work, the following conditions have to be meet: 1. Pools configuration on both sides have to be identical - the same checksum algorithms, the same compression algorithms, etc. 2. No snapshots can be removed on remote site as we can lose block by doing this. 3. We have to have all datasets on the remote site, as it would be too expensive to find if the given block which exists in DDT is referenced by the given dataset. If I want to send a block and it exists in DDT with refcount > 1, I''ve no way to tell which datasets are referencing it besides from scanning all datasets (or at least my dataset). If those conditions are meet, I can safely send checksums of blocks with brith date from before the snapshot I''m sending. Am I right? -- Pawel Jakub Dawidek http://www.wheelsystems.com pjd at FreeBSD.org http://www.FreeBSD.org FreeBSD committer Am I Evil? Yes, I Am! -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 196 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100615/31580aec/attachment.bin>
On 6/15/2010 2:06 AM, Pawel Jakub Dawidek wrote:> On Thu, Jun 10, 2010 at 04:32:06PM -0700, Erik Trimble wrote: > >> On 6/10/2010 1:21 PM, Pawel Jakub Dawidek wrote: >> >>> If we send incremental stream we can be sure that up to the previous >>> snapshot we have the same data on the other side. I''m aware it doesn''t >>> mean the data has exactly the same checksum (eg. it can be compressed >>> with different algorithm). But in theory, are we able to figure out that >>> the given block we try to send is already part of the dataset''s previous >>> snapshot? I''m fine with discarding incremental stream on the remote site >>> if it uses different compression algorithm or simply deduplication is >>> turned off (bascially when there is no block matching stored checksum). >>> But if I have identical configurations on both ends I''d like not to send >>> the same block multiple times in multiple incremental streams >>> >> No, you can''t be sure. You can *assume* you sent the proper incremental >> stream to the receiving host, but what if you didn''t? Or it got deleted? >> etc. >> > So for this to work, the following conditions have to be meet: > > 1. Pools configuration on both sides have to be identical - the same > checksum algorithms, the same compression algorithms, etc. > > 2. No snapshots can be removed on remote site as we can lose block by > doing this. > > 3. We have to have all datasets on the remote site, as it would be too > expensive to find if the given block which exists in DDT is referenced > by the given dataset. If I want to send a block and it exists in DDT > with refcount> 1, I''ve no way to tell which datasets are referencing > it besides from scanning all datasets (or at least my dataset). > > If those conditions are meet, I can safely send checksums of blocks with > brith date from before the snapshot I''m sending. Am I right? >Welll...... I suppose so. In theory, I see nothing wrong with what you are saying. But that''s a *whole* lot of very iffy preconditions, and it''s really not at all practical. In fact, I''d go so far to say that it''s *highly* unlikely you can meet them in most real-world cases. Realistically, you''ve got four scenarios for sending a incremental from sender A to receiver B machines: 1. A''s pool has dedup on, B''s also is on. 2. A''s pool does NOT have dedup on , B''s pool does. 3. A''s pool does NOT have dedup on, neither does B. 4. A''s pool has dedup on, B''s pool doesn''t have it on. I''m assuming that your goal is to minimize the amount of data being send across the wire from host A to B. Cases 3 & 4 mean that you can''t do any better than ''zfs send -D | zfs receive'', as B has nothing to dedup against. You can dedup the sent stream (which B will then expand when receiving it), but that''s it. Case 1 & 2 will both allow you maximum benefit, as B has a DDT for the receiving pool already, and you can compare the to-be-sent stream to this receiving DDT, and do dedup. Case 1 will be faster, since A already has a pool DDT computed for the to-be-sent stream, while Case 2 will have to compute a DDT solely for that stream. You simply *must* talk to the receiving machine and pass back a DDT if you want to have any practical chance of doing this kind of dedup''d stream. Note, that if the checksum type used is different in on host A vs host B, you can''t do any form of extra dedup this way. I''d have to check if different compression types would cause problems, as I can''t recall if that affects the actual checksum being stored (I think it does, as I''m pretty sure ZFS stores the post-compressed block checksum, but I''m not 100% sure). All of these problems would easily be detectable, and a properly written application would be able to report back to the user such conditions (and should then fall back to the standard ''zfs send -D'' behavior). -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA
On 15/06/2010 10:48, Erik Trimble wrote:> Note, that if the checksum type used is different in on host A vs host > B, you can''t do any form of extra dedup this way. I''d have to check if > different compression types would cause problems, as I can''t recall if > that affects the actual checksum being stored (I think it does, as I''m > pretty sure ZFS stores the post-compressed block checksum, but I''m not > 100% sure). All of these problems would easily be detectable, and a > properly written application would be able to report back to the user > such conditions (and should then fall back to the standard ''zfs send -D'' > behavior).Both the checksum and compression are used in the key for the DDT. On disk the checksum is of the compressed block, in the send stream is is the uncompressed block (since zfs send is above the DMU layer). -- Darren J Moffat
On 6/15/2010 2:59 AM, Darren J Moffat wrote:> On 15/06/2010 10:48, Erik Trimble wrote: >> Note, that if the checksum type used is different in on host A vs host >> B, you can''t do any form of extra dedup this way. I''d have to check if >> different compression types would cause problems, as I can''t recall if >> that affects the actual checksum being stored (I think it does, as I''m >> pretty sure ZFS stores the post-compressed block checksum, but I''m not >> 100% sure). All of these problems would easily be detectable, and a >> properly written application would be able to report back to the user >> such conditions (and should then fall back to the standard ''zfs send -D'' >> behavior). > > Both the checksum and compression are used in the key for the DDT. > > On disk the checksum is of the compressed block, in the send stream is > is the uncompressed block (since zfs send is above the DMU layer). >So, that would be even better. It would allow for "super" dedup even if the compression algorithms were different. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA