thr3ads.net - zfs discuss - [zfs-discuss] offline dedup [May 2011]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey

2011-May-26 15:37 UTC

[zfs-discuss] offline dedup

Hey, I got another question for ZFS developers - 

 

Given:  If you enable dedup and write a bunch of data, and then disable
dedup, the formerly written data will remain dedup''d.

 

Given:  The zdb -s command, which simulates dedup to provide dedup
statistics without actually enabling dedup.

 

Question:  Is it possible, or can it easily become possible, to periodically
dedup a pool instead of keeping dedup running all the time?  It is easy to
imagine some situations where idle or maintenance windows might be
appropriate to dedup a pool, but the performance and/or resource
requirements of keeping dedup running all the time might not be desirable.
In some situations.

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110526/86350114/attachment.html>

Brandon High

2011-May-26 16:04 UTC

head link

[zfs-discuss] offline dedup

On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at nedharvey.com>
wrote:> Question:? Is it possible, or can it easily become possible, to
periodically
> dedup a pool instead of keeping dedup running all the time?? It is easy to
I think it''s been discussed before, and the conclusion is that it
would require bp_rewrite.

Offline (or deferred) dedup certainly seems more attractive given the
current real-time performance.

-B

-- 
Brandon High : bhigh at freaks.com

Daniel Carosone

2011-May-27 00:19 UTC

head link

[zfs-discuss] offline dedup

On Thu, May 26, 2011 at 09:04:04AM -0700, Brandon High
wrote:> On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey
> <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> > Question:? Is it possible, or can it easily become possible, to
periodically
> > dedup a pool instead of keeping dedup running all the time?? It is
easy to
> 
> I think it''s been discussed before, and the conclusion is that it
> would require bp_rewrite.
Yes, and possibly would require more of bp_rewrite than any other use
case (ie, a more complex bp_rewrite).
> Offline (or deferred) dedup certainly seems more attractive given the
> current real-time performance.
I''m not so sure.

Or, rather, if it were there and available now, I''m sure some people
would use it and prefer it for their circumstances.  Nothing comes for
free, in terms of development or operational complexity.

It seems attractive for retroactively recovering space, as a rare
operation, while maintaining snapshot integrity (and not taking
everything offline for a send|recv). But you want to be sure you can
carry the cost of that space saving.

Once your data is dedup''ed, by whatever means, access to it is the
same.  You need enough memory+l2arc to indirect references via
DDT.  If this is your performance problem today, it will not be helped
much by deferral. Reads will still have the same issue, as will the
deferred dedup write workload (with more work overall).

But I don''t think it solves the core overhead of freeing deduped
blocks, and once that''s no longer a problem for you, neither is the
synchronous dedup.  Plus, if you''re just on the edge, that can be
deferred as noted previously, though that''s not a very nice place to
be. 

I tend to think that background/deferred dedup is a task more similar
to HSM / archival type activities, that will involve some level of
application responsibility as well as fs-level assistance hooks.  For
all the work it would involve, I''d like to get more value than just a
few saved disk blocks. 

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/c11a2be6/attachment.bin>

Jim Klimov

2011-May-27 00:32 UTC

head link

[zfs-discuss] offline dedup

2011-05-26 19:37, Edward Ned Harvey ?????:>
> Hey, I got another question for ZFS developers -
>
> Given:  If you enable dedup and write a bunch of data, and then 
> disable dedup, the formerly written data will remain dedup''d.
>
> Given:  The zdb -s command, which simulates dedup to provide dedup 
> statistics without actually enabling dedup.
>
> Question:  Is it possible, or can it easily become possible, to 
> periodically dedup a pool instead of keeping dedup running all the 
> time?  It is easy to imagine some situations where idle or maintenance 
> windows might be appropriate to dedup a pool, but the performance 
> and/or resource requirements of keeping dedup running all the time 
> might not be desirable.  In some situations.
>
>
One more rationale in this idea is that with deferred dedup
in place, the DDT may be forced to hold only non-unique
blocks (2+ references), and would require less storage in
RAM, disk, L2ARC, etc. - in case we agree to remake the
DDT on every offline-dedup operation.

Also if the system still uses acceptable checksums (sha256)
finding matches to enable offline dedup should be relatively
easy - just read in all of the metadata (hashes) and sort it ;)

-- 


+============================================================+
|                                                            |
| ?????? ???????,                                 Jim Klimov |
| ??????????? ????????                                   CTO |
| ??? "??? ? ??"                                  JSC COS&HT |
|                                                            |
| +7-903-7705859 (cellular)          mailto:jimklimov at cos.ru |
|                          CC:admin at cos.ru,jimklimov at mail.ru |
+============================================================+
| ()  ascii ribbon campaign - against html mail              |
| /\                        - against microsoft attachments  |
+============================================================+



-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/219dcbf4/attachment.html>

Daniel Carosone

2011-May-27 01:01 UTC

head link

[zfs-discuss] offline dedup

On Fri, May 27, 2011 at 04:32:03AM +0400, Jim Klimov
wrote:> One more rationale in this idea is that with deferred dedup
> in place, the DDT may be forced to hold only non-unique
> blocks (2+ references), and would require less storage in
> RAM, disk, L2ARC, etc. - in case we agree to remake the
> DDT on every offline-dedup operation.
This is an interesting point.  In this case, deferred dedup would be
the only way to get a given block hash to have 2 or more duplicates,
but once in there further copies could be added as normal.  This
probably gives you most of the (space) benefit for much less (memory)
cost. 

In reverse, pruning the DDT of single-instance blocks could be a
useful operation, for recovery from a case where you made a DDT too
large for the system.  It would still need a complex bp_rewrite.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/03a8398e/attachment-0001.bin>

Edward Ned Harvey

2011-May-27 11:28 UTC

head link

[zfs-discuss] offline dedup

> From: Daniel Carosone [mailto:dan at geek.com.au]
> Sent: Thursday, May 26, 2011 8:19 PM
> 
> Once your data is dedup''ed, by whatever means, access to it is the
> same.  You need enough memory+l2arc to indirect references via
> DDT.  
I don''t think this is true.  The reason you need arc+l2arc to store
your DDT
is because when you perform a write, the system will need to check and see
if that block is a duplicate of an already existing block.  If you dedup
once, and later disable dedup, the system won''t bother checking to see
if
there are duplicate blocks anymore.  So the DDT won''t need to be in
arc+l2arc.  I should say "shouldn''t."

Jim Klimov

2011-May-27 13:04 UTC

head link

[zfs-discuss] offline dedup

Dan> ... It would still need a complex bp_rewrite.

Are you certain about that?

For example, scrubbing/resilvering and fixing corrupt blocks with
non-matching checksums is a post-processing operation which
works on an existing pool and rewrites some blocks if needed.
And it works without a bp_rewrite in place...

Basically, you''d need to ensure that a single TXG would include
updates to the DDT entry for found unique blocks, and freeing of
extra blocks with same data (checksum), and creation of "ditto"
copies if a specified threshold is exceeded - where the dittos might
point to one of the already existing extra blocks instead of freeing it.

What''s more: if the offline DDT were modelled (or implemented)
like scrubbing, it could be stopped at any point in progress and
the continued (or redone from start - but with some blocks already
deduped) and have a cumulative effect between invokations,
and this would be acceptable for users with "bursty" writes,
i.e. storing documents on a filer during their work-day.

That is, you could schedule offline-dedup to run say between
0am and 6am, and by the time workers come to office some of
their storage''s disk space may be recovered and the system is
fast and responsive. The next night it continues and maybe
recovers some more space...

Also if the offline-dedup would be throttled like the scrubs can
be throttled now, it could continuously run in the background.
Perhaps with ARC/L2ARC cache large enough, it wouldn''t
even be a huge real-time performance degrader like it is now.

I can stand by Ed''s findings that enabled dedup slows down
write speeds on my system approximately 10x as compared
to writes into non-deduped datasets, however lots of time is
spent by CPU in kernel calls (close to 50% on a dual-core)
and pretty much in disk IOs. At the moment my test system
is down, so I can''t quote specific numbers, but as I remember
there were about 2-3Mb/s writes to each of my 6 disks in
raidz2 while the end-user throughput (according to rsync)
was 1.8-2Mb/s overall. Writes to datasets without dedup
could sustain 20-40Mb/s at least.

HTH,
//Jim
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/3ba1fbd3/attachment-0001.html>

Frank Van Damme

2011-May-27 18:51 UTC

head link

[zfs-discuss] offline dedup

2011/5/27 Edward Ned Harvey
<opensolarisisdeadlongliveopensolaris at
nedharvey.com>:> I don''t think this is true. ?The reason you need arc+l2arc to
store your DDT
> is because when you perform a write, the system will need to check and see
> if that block is a duplicate of an already existing block. ?If you dedup
> once, and later disable dedup, the system won''t bother checking to
see if
> there are duplicate blocks anymore. ?So the DDT won''t need to be
in
> arc+l2arc. ?I should say "shouldn''t."
Except when deleting deduped blocks.

-- 
Frank Van Damme
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Daniel Carosone

2011-May-29 23:55 UTC

head link

[zfs-discuss] offline dedup

On Fri, May 27, 2011 at 07:28:06AM -0400, Edward Ned Harvey
wrote:> > From: Daniel Carosone [mailto:dan at geek.com.au]
> > Sent: Thursday, May 26, 2011 8:19 PM
> > 
> > Once your data is dedup''ed, by whatever means, access to it
is the
> > same.  You need enough memory+l2arc to indirect references via
> > DDT.  
> 
> I don''t think this is true.
> The reason you need arc+l2arc to store your DDT
> is because when you perform a write, the system will need to check and see
> if that block is a duplicate of an already existing block.  If you dedup
> once, and later disable dedup, the system won''t bother checking to
see if
> there are duplicate blocks anymore.  So the DDT won''t need to be
in
> arc+l2arc.  I should say "shouldn''t."
dedup''d blocks are found via the ddt, no matter how many references to
them exist.  The ddt ''owns'' the actual data block, and the
regular
referencing files'' metadata (bp) indicates that this block is
dedup''d
(indirect) rather than regular (direct). 

At least that''s my somewhat-rusty recollection.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110530/e4cecfdd/attachment.bin>

zfs discuss - May 2011 - offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup

[zfs-discuss] offline dedup