thr3ads.net - zfs code - [zfs-code] Improving zfs send dedup. [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Pawel Jakub Dawidek

2010-Jun-09 23:35 UTC

[zfs-code] Improving zfs send dedup.

Hi.

Currently zfs send has -D flag, which allows to deduplicate blocks
in single stream.

I''m wondering if it would be possible not to send blocks in incremental
stream if we know they are part of the given dataset already and were
sent to the remote site with some earlier snapshots.

I know deduplication is pool-wide mechanism and the block might be part
of many different datasets. In my case I''d need to know that the block
I''m about to send is part of this particular dataset.

With the current ZFS design, is something like this even possible to
implement in some clean way or would there be a need for heavy
modifcations of ZFS internals?

If it is doable, could you suggest a good starting point?

-- 
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100610/bf09ea09/attachment.bin>

Erik Trimble

2010-Jun-09 23:51 UTC

head link

[zfs-code] Improving zfs send dedup.

On 6/9/2010 4:35 PM, Pawel Jakub Dawidek wrote:> Hi.
>
> Currently zfs send has -D flag, which allows to deduplicate blocks
> in single stream.
>
> I''m wondering if it would be possible not to send blocks in
incremental
> stream if we know they are part of the given dataset already and were
> sent to the remote site with some earlier snapshots.
>
> I know deduplication is pool-wide mechanism and the block might be part
> of many different datasets. In my case I''d need to know that the
block
> I''m about to send is part of this particular dataset.
>
> With the current ZFS design, is something like this even possible to
> implement in some clean way or would there be a need for heavy
> modifcations of ZFS internals?
>
> If it is doable, could you suggest a good starting point
First off, even with an incremental, you could dedup at the receiving 
end easily, so really, the only thing you would be doing is cutting down 
on the amount of data traffic being send over the wire (which, could be 
significant).

You''d have to run some sort of process on the receiving system. 
There''s
no other way to design this kind of thing - you can''t rely on any 
config/state/etc on the "sending" system; for consistency,
you''d *have*
to query the receiver for its state.

The closest analog to what you''re asking for is rsync.

I don''t see any modifications to ZFS that has to be done to support 
something like that - it''s just a userland app. Note that there would 
have to be a non-trivial amount of overhead data communication between 
the two hosts.  For each block being sent, the send would have to send 
the checksum over to the receiving side, which would have to check it''s
DDT to see if it''s already there. It would then either send an ACK or 
NAK back to tell the sender whether or not to send the actual data.  So, 
there''s be a *lot* of small packet traffic between the two machines. I 
suppose one could be smart and package up multiple blocks'' checksums in
a single packet, but the fact remains that such a system would be 
non-trivially chatty.  Much chattier than rsync.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Lori Alt

2010-Jun-10 18:21 UTC

head link

[zfs-code] Improving zfs send dedup.

On 06/ 9/10 05:35 PM, Pawel Jakub Dawidek wrote:> Hi.
>
> Currently zfs send has -D flag, which allows to deduplicate blocks
> in single stream.
>
> I''m wondering if it would be possible not to send blocks in
incremental
> stream if we know they are part of the given dataset already and were
> sent to the remote site with some earlier snapshots.
>
> I know deduplication is pool-wide mechanism and the block might be part
> of many different datasets. In my case I''d need to know that the
block
> I''m about to send is part of this particular dataset.
>
> With the current ZFS design, is something like this even possible to
> implement in some clean way or would there be a need for heavy
> modifcations of ZFS internals?
>
>    
It''s not possible to implement unless we establish a bidirectional 
communication between the sending and receiving side.  The logic for 
send-stream dedup is:

for (each block to be written to stream) {

    get the block''s checksum

    lookup the block''s checksum in the dedup-table
         established for *this* stream generation

    if (an entry in the DDT exists for this checksum)

        send a "write-by-reference" block across the stream
             (this contains a reference to a block send earlier in the
        stream)

    else {

        add an entry for this block to the DDT

        send the full block

    }

}

Since the dedup table on sending side only knows about blocks already 
send in the stream, we have no way of knowing whether a copy of the 
block already exists on the other side, and even if we did know, we 
wouldn''t know where it was on the other side.  The sending side would 
have to have a copy of the other side''s on-disk DDT to know whether a 
write-by-reference could be used.

Lori







-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100610/14d87436/attachment.html>

Pawel Jakub Dawidek

2010-Jun-10 20:21 UTC

head link

[zfs-code] Improving zfs send dedup.

On Thu, Jun 10, 2010 at 12:21:02PM -0600, Lori Alt
wrote:> It''s not possible to implement unless we establish a bidirectional
> communication between the sending and receiving side.  The logic for 
> send-stream dedup is:
> 
> for (each block to be written to stream) {
> 
>    get the block''s checksum
> 
>    lookup the block''s checksum in the dedup-table
>         established for *this* stream generation
> 
>    if (an entry in the DDT exists for this checksum)
> 
>        send a "write-by-reference" block across the stream
>             (this contains a reference to a block send earlier in the
>        stream)
> 
>    else {
> 
>        add an entry for this block to the DDT
> 
>        send the full block
> 
>    }
> 
> }
> 
> Since the dedup table on sending side only knows about blocks already 
> send in the stream, we have no way of knowing whether a copy of the 
> block already exists on the other side, and even if we did know, we 
> wouldn''t know where it was on the other side.  The sending side
would
> have to have a copy of the other side''s on-disk DDT to know
whether a
> write-by-reference could be used.
If we send incremental stream we can be sure that up to the previous
snapshot we have the same data on the other side. I''m aware it
doesn''t
mean the data has exactly the same checksum (eg. it can be compressed
with different algorithm). But in theory, are we able to figure out that
the given block we try to send is already part of the dataset''s
previous
snapshot? I''m fine with discarding incremental stream on the remote
site
if it uses different compression algorithm or simply deduplication is
turned off (bascially when there is no block matching stored checksum).
But if I have identical configurations on both ends I''d like not to
send
the same block multiple times in multiple incremental streams.

-- 
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100610/8fac00e7/attachment.bin>

Lori Alt

2010-Jun-10 23:23 UTC

head link

[zfs-code] Improving zfs send dedup.

On 06/10/10 02:21 PM, Pawel Jakub Dawidek wrote:> On Thu, Jun 10, 2010 at 12:21:02PM -0600, Lori Alt wrote:
>    
>> It''s not possible to implement unless we establish a
bidirectional
>> communication between the sending and receiving side.  The logic for
>> send-stream dedup is:
>>
>> for (each block to be written to stream) {
>>
>>     get the block''s checksum
>>
>>     lookup the block''s checksum in the dedup-table
>>          established for *this* stream generation
>>
>>     if (an entry in the DDT exists for this checksum)
>>
>>         send a "write-by-reference" block across the stream
>>              (this contains a reference to a block send earlier in the
>>         stream)
>>
>>     else {
>>
>>         add an entry for this block to the DDT
>>
>>         send the full block
>>
>>     }
>>
>> }
>>
>> Since the dedup table on sending side only knows about blocks already
>> send in the stream, we have no way of knowing whether a copy of the
>> block already exists on the other side, and even if we did know, we
>> wouldn''t know where it was on the other side.  The sending
side would
>> have to have a copy of the other side''s on-disk DDT to know
whether a
>> write-by-reference could be used.
>>      
> If we send incremental stream we can be sure that up to the previous
> snapshot we have the same data on the other side. I''m aware it
doesn''t
> mean the data has exactly the same checksum (eg. it can be compressed
> with different algorithm). But in theory, are we able to figure out that
> the given block we try to send is already part of the dataset''s
previous
> snapshot? I''m fine with discarding incremental stream on the
remote site
> if it uses different compression algorithm or simply deduplication is
> turned off (bascially when there is no block matching stored checksum).
> But if I have identical configurations on both ends I''d like not
to send
> the same block multiple times in multiple incremental streams.
>
>    Each incremental stream only contains the blocks that are new or changed 
since the last snapshot, so I don''t see how you can be sure that the 
data already exists on the receiving side.  But even if you did know 
that the block already exists on the receiving side, you don''t know 
where it is.  That is, you don''t know what to put in the
"reference"
field of the send stream record.  You don''t know the object number and 
offset of where the block already exists on the receiving side.

Lori

Erik Trimble

2010-Jun-10 23:32 UTC

head link

[zfs-code] Improving zfs send dedup.

On 6/10/2010 1:21 PM, Pawel Jakub Dawidek wrote:>
> If we send incremental stream we can be sure that up to the previous
> snapshot we have the same data on the other side. I''m aware it
doesn''t
> mean the data has exactly the same checksum (eg. it can be compressed
> with different algorithm). But in theory, are we able to figure out that
> the given block we try to send is already part of the dataset''s
previous
> snapshot? I''m fine with discarding incremental stream on the
remote site
> if it uses different compression algorithm or simply deduplication is
> turned off (bascially when there is no block matching stored checksum).
> But if I have identical configurations on both ends I''d like not
to send
> the same block multiple times in multiple incremental streams
No, you can''t be sure.  You can *assume* you sent the proper
incremental
stream to the receiving host, but what if you didn''t? Or it got
deleted?
etc.

You *have* to check with receiving host to see what''s there. As Lori 
pointed out, you need the DDT from the receiving host.  As I said 
earlier, this looks to NOT need code changes, just a smart userland app. 
I''d use rsync''s model, where you SSH over to the other host,
run the
same binary (which knows it''s in "receive" mode), and set up
the com
link between the two. The receiver''s DDT gets generated, passed back to
the sender, and the sender can then do lookups using both DDT sets.  
It''s really not that complicated.

My sole worry is that since ''zfs send'' and ''zfs
receive'' are moving
targets to keep up with the zfs filesystem version features, you''ll
have
to constantly modify your new app to be compatible with newer zfs versions.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Pawel Jakub Dawidek

2010-Jun-15 09:06 UTC

head link

[zfs-code] Improving zfs send dedup.

On Thu, Jun 10, 2010 at 04:32:06PM -0700, Erik Trimble
wrote:> On 6/10/2010 1:21 PM, Pawel Jakub Dawidek wrote:
> >
> >If we send incremental stream we can be sure that up to the previous
> >snapshot we have the same data on the other side. I''m aware it
doesn''t
> >mean the data has exactly the same checksum (eg. it can be compressed
> >with different algorithm). But in theory, are we able to figure out
that
> >the given block we try to send is already part of the
dataset''s previous
> >snapshot? I''m fine with discarding incremental stream on the
remote site
> >if it uses different compression algorithm or simply deduplication is
> >turned off (bascially when there is no block matching stored checksum).
> >But if I have identical configurations on both ends I''d like
not to send
> >the same block multiple times in multiple incremental streams
> 
> No, you can''t be sure.  You can *assume* you sent the proper
incremental
> stream to the receiving host, but what if you didn''t? Or it got
deleted?
> etc.
So for this to work, the following conditions have to be meet:

1. Pools configuration on both sides have to be identical - the same
   checksum algorithms, the same compression algorithms, etc.

2. No snapshots can be removed on remote site as we can lose block by
   doing this.

3. We have to have all datasets on the remote site, as it would be too
   expensive to find if the given block which exists in DDT is referenced
   by the given dataset. If I want to send a block and it exists in DDT
   with refcount > 1, I''ve no way to tell which datasets are
referencing
   it besides from scanning all datasets (or at least my dataset).

If those conditions are meet, I can safely send checksums of blocks with
brith date from before the snapshot I''m sending. Am I right?

-- 
Pawel Jakub Dawidek                       http://www.wheelsystems.com
pjd at FreeBSD.org                           http://www.FreeBSD.org
FreeBSD committer                         Am I Evil? Yes, I Am!
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-code/attachments/20100615/31580aec/attachment.bin>

Erik Trimble

2010-Jun-15 09:48 UTC

head link

[zfs-code] Improving zfs send dedup.

On 6/15/2010 2:06 AM, Pawel Jakub Dawidek wrote:> On Thu, Jun 10, 2010 at 04:32:06PM -0700, Erik Trimble wrote:
>    
>> On 6/10/2010 1:21 PM, Pawel Jakub Dawidek wrote:
>>      
>>> If we send incremental stream we can be sure that up to the
previous
>>> snapshot we have the same data on the other side. I''m
aware it doesn''t
>>> mean the data has exactly the same checksum (eg. it can be
compressed
>>> with different algorithm). But in theory, are we able to figure out
that
>>> the given block we try to send is already part of the
dataset''s previous
>>> snapshot? I''m fine with discarding incremental stream on
the remote site
>>> if it uses different compression algorithm or simply deduplication
is
>>> turned off (bascially when there is no block matching stored
checksum).
>>> But if I have identical configurations on both ends I''d
like not to send
>>> the same block multiple times in multiple incremental streams
>>>        
>> No, you can''t be sure.  You can *assume* you sent the proper
incremental
>> stream to the receiving host, but what if you didn''t? Or it
got deleted?
>> etc.
>>      
> So for this to work, the following conditions have to be meet:
>
> 1. Pools configuration on both sides have to be identical - the same
>     checksum algorithms, the same compression algorithms, etc.
>
> 2. No snapshots can be removed on remote site as we can lose block by
>     doing this.
>
> 3. We have to have all datasets on the remote site, as it would be too
>     expensive to find if the given block which exists in DDT is referenced
>     by the given dataset. If I want to send a block and it exists in DDT
>     with refcount>  1, I''ve no way to tell which datasets are
referencing
>     it besides from scanning all datasets (or at least my dataset).
>
> If those conditions are meet, I can safely send checksums of blocks with
> brith date from before the snapshot I''m sending. Am I right?
>    Welll......

I suppose so.  In theory, I see nothing wrong with what you are saying.  
But that''s a *whole* lot of very iffy preconditions, and it''s
really not
at all practical. In fact, I''d go so far to say that it''s
*highly*
unlikely you can meet them in most real-world cases.

Realistically, you''ve got four scenarios for sending a incremental from
sender A to receiver B machines:

1.    A''s pool has dedup on, B''s also is on.
2.    A''s pool does NOT have dedup on , B''s pool does.
3.    A''s pool does NOT have dedup on, neither does B.
4.    A''s pool has dedup on, B''s pool doesn''t have it
on.

I''m assuming that your goal is to minimize the amount of data being
send
across the wire from host A to B.

Cases 3 & 4 mean that you can''t do any better than ''zfs
send -D | zfs
receive'', as B has nothing to dedup against. You can dedup the sent 
stream (which B will then expand when receiving it), but that''s it.

Case 1 & 2  will both allow you maximum benefit, as B has a DDT for the 
receiving pool already, and you can compare the to-be-sent stream to 
this receiving DDT, and do dedup.  Case 1 will be faster, since A 
already has a pool DDT computed for the to-be-sent stream, while Case 2 
will have to compute a DDT solely for that stream.

You simply *must* talk to the receiving machine and pass back a DDT if 
you want to have any practical chance of doing this kind of dedup''d
stream.

Note, that if the checksum type used is different in on host A vs host 
B, you can''t do any form of extra dedup this way. I''d have to
check if
different compression types would cause problems, as I can''t recall if 
that affects the actual checksum being stored (I think it does, as I''m 
pretty sure ZFS stores the post-compressed block checksum, but I''m not 
100% sure).  All of these problems would easily be detectable, and a 
properly written application would be able to report back to the user 
such conditions (and should then fall back to the standard ''zfs send
-D''
behavior).

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Darren J Moffat

2010-Jun-15 09:59 UTC

head link

[zfs-code] Improving zfs send dedup.

On 15/06/2010 10:48, Erik Trimble wrote:> Note, that if the checksum type used is different in on host A vs host
> B, you can''t do any form of extra dedup this way. I''d
have to check if
> different compression types would cause problems, as I can''t
recall if
> that affects the actual checksum being stored (I think it does, as
I''m
> pretty sure ZFS stores the post-compressed block checksum, but I''m
not
> 100% sure). All of these problems would easily be detectable, and a
> properly written application would be able to report back to the user
> such conditions (and should then fall back to the standard ''zfs
send -D''
> behavior).
Both the checksum and compression are used in the key for the DDT.

On disk the checksum is of the compressed block, in the send stream is 
is the uncompressed block (since zfs send is above the DMU layer).

-- 
Darren J Moffat

Erik Trimble

2010-Jun-15 10:01 UTC

head link

[zfs-code] Improving zfs send dedup.

On 6/15/2010 2:59 AM, Darren J Moffat wrote:> On 15/06/2010 10:48, Erik Trimble wrote:
>> Note, that if the checksum type used is different in on host A vs host
>> B, you can''t do any form of extra dedup this way. I''d
have to check if
>> different compression types would cause problems, as I can''t
recall if
>> that affects the actual checksum being stored (I think it does, as
I''m
>> pretty sure ZFS stores the post-compressed block checksum, but
I''m not
>> 100% sure). All of these problems would easily be detectable, and a
>> properly written application would be able to report back to the user
>> such conditions (and should then fall back to the standard
''zfs send -D''
>> behavior).
>
> Both the checksum and compression are used in the key for the DDT.
>
> On disk the checksum is of the compressed block, in the send stream is 
> is the uncompressed block (since zfs send is above the DMU layer).
>
So, that would be even better.  It would allow for "super" dedup even
if
the compression algorithms were different.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

zfs code - Jun 2010 - Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.

[zfs-code] Improving zfs send dedup.