Hey, I got another question for ZFS developers - Given: If you enable dedup and write a bunch of data, and then disable dedup, the formerly written data will remain dedup''d. Given: The zdb -s command, which simulates dedup to provide dedup statistics without actually enabling dedup. Question: Is it possible, or can it easily become possible, to periodically dedup a pool instead of keeping dedup running all the time? It is easy to imagine some situations where idle or maintenance windows might be appropriate to dedup a pool, but the performance and/or resource requirements of keeping dedup running all the time might not be desirable. In some situations. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110526/86350114/attachment.html>
On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> Question:? Is it possible, or can it easily become possible, to periodically > dedup a pool instead of keeping dedup running all the time?? It is easy toI think it''s been discussed before, and the conclusion is that it would require bp_rewrite. Offline (or deferred) dedup certainly seems more attractive given the current real-time performance. -B -- Brandon High : bhigh at freaks.com
On Thu, May 26, 2011 at 09:04:04AM -0700, Brandon High wrote:> On Thu, May 26, 2011 at 8:37 AM, Edward Ned Harvey > <opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote: > > Question:? Is it possible, or can it easily become possible, to periodically > > dedup a pool instead of keeping dedup running all the time?? It is easy to > > I think it''s been discussed before, and the conclusion is that it > would require bp_rewrite.Yes, and possibly would require more of bp_rewrite than any other use case (ie, a more complex bp_rewrite).> Offline (or deferred) dedup certainly seems more attractive given the > current real-time performance.I''m not so sure. Or, rather, if it were there and available now, I''m sure some people would use it and prefer it for their circumstances. Nothing comes for free, in terms of development or operational complexity. It seems attractive for retroactively recovering space, as a rare operation, while maintaining snapshot integrity (and not taking everything offline for a send|recv). But you want to be sure you can carry the cost of that space saving. Once your data is dedup''ed, by whatever means, access to it is the same. You need enough memory+l2arc to indirect references via DDT. If this is your performance problem today, it will not be helped much by deferral. Reads will still have the same issue, as will the deferred dedup write workload (with more work overall). But I don''t think it solves the core overhead of freeing deduped blocks, and once that''s no longer a problem for you, neither is the synchronous dedup. Plus, if you''re just on the edge, that can be deferred as noted previously, though that''s not a very nice place to be. I tend to think that background/deferred dedup is a task more similar to HSM / archival type activities, that will involve some level of application responsibility as well as fs-level assistance hooks. For all the work it would involve, I''d like to get more value than just a few saved disk blocks. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/c11a2be6/attachment.bin>
2011-05-26 19:37, Edward Ned Harvey ?????:> > Hey, I got another question for ZFS developers - > > Given: If you enable dedup and write a bunch of data, and then > disable dedup, the formerly written data will remain dedup''d. > > Given: The zdb -s command, which simulates dedup to provide dedup > statistics without actually enabling dedup. > > Question: Is it possible, or can it easily become possible, to > periodically dedup a pool instead of keeping dedup running all the > time? It is easy to imagine some situations where idle or maintenance > windows might be appropriate to dedup a pool, but the performance > and/or resource requirements of keeping dedup running all the time > might not be desirable. In some situations. > >One more rationale in this idea is that with deferred dedup in place, the DDT may be forced to hold only non-unique blocks (2+ references), and would require less storage in RAM, disk, L2ARC, etc. - in case we agree to remake the DDT on every offline-dedup operation. Also if the system still uses acceptable checksums (sha256) finding matches to enable offline dedup should be relatively easy - just read in all of the metadata (hashes) and sort it ;) -- +============================================================+ | | | ?????? ???????, Jim Klimov | | ??????????? ???????? CTO | | ??? "??? ? ??" JSC COS&HT | | | | +7-903-7705859 (cellular) mailto:jimklimov at cos.ru | | CC:admin at cos.ru,jimklimov at mail.ru | +============================================================+ | () ascii ribbon campaign - against html mail | | /\ - against microsoft attachments | +============================================================+ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/219dcbf4/attachment.html>
On Fri, May 27, 2011 at 04:32:03AM +0400, Jim Klimov wrote:> One more rationale in this idea is that with deferred dedup > in place, the DDT may be forced to hold only non-unique > blocks (2+ references), and would require less storage in > RAM, disk, L2ARC, etc. - in case we agree to remake the > DDT on every offline-dedup operation.This is an interesting point. In this case, deferred dedup would be the only way to get a given block hash to have 2 or more duplicates, but once in there further copies could be added as normal. This probably gives you most of the (space) benefit for much less (memory) cost. In reverse, pruning the DDT of single-instance blocks could be a useful operation, for recovery from a case where you made a DDT too large for the system. It would still need a complex bp_rewrite. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/03a8398e/attachment-0001.bin>
> From: Daniel Carosone [mailto:dan at geek.com.au] > Sent: Thursday, May 26, 2011 8:19 PM > > Once your data is dedup''ed, by whatever means, access to it is the > same. You need enough memory+l2arc to indirect references via > DDT.I don''t think this is true. The reason you need arc+l2arc to store your DDT is because when you perform a write, the system will need to check and see if that block is a duplicate of an already existing block. If you dedup once, and later disable dedup, the system won''t bother checking to see if there are duplicate blocks anymore. So the DDT won''t need to be in arc+l2arc. I should say "shouldn''t."
Dan> ... It would still need a complex bp_rewrite. Are you certain about that? For example, scrubbing/resilvering and fixing corrupt blocks with non-matching checksums is a post-processing operation which works on an existing pool and rewrites some blocks if needed. And it works without a bp_rewrite in place... Basically, you''d need to ensure that a single TXG would include updates to the DDT entry for found unique blocks, and freeing of extra blocks with same data (checksum), and creation of "ditto" copies if a specified threshold is exceeded - where the dittos might point to one of the already existing extra blocks instead of freeing it. What''s more: if the offline DDT were modelled (or implemented) like scrubbing, it could be stopped at any point in progress and the continued (or redone from start - but with some blocks already deduped) and have a cumulative effect between invokations, and this would be acceptable for users with "bursty" writes, i.e. storing documents on a filer during their work-day. That is, you could schedule offline-dedup to run say between 0am and 6am, and by the time workers come to office some of their storage''s disk space may be recovered and the system is fast and responsive. The next night it continues and maybe recovers some more space... Also if the offline-dedup would be throttled like the scrubs can be throttled now, it could continuously run in the background. Perhaps with ARC/L2ARC cache large enough, it wouldn''t even be a huge real-time performance degrader like it is now. I can stand by Ed''s findings that enabled dedup slows down write speeds on my system approximately 10x as compared to writes into non-deduped datasets, however lots of time is spent by CPU in kernel calls (close to 50% on a dual-core) and pretty much in disk IOs. At the moment my test system is down, so I can''t quote specific numbers, but as I remember there were about 2-3Mb/s writes to each of my 6 disks in raidz2 while the end-user throughput (according to rsync) was 1.8-2Mb/s overall. Writes to datasets without dedup could sustain 20-40Mb/s at least. HTH, //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/3ba1fbd3/attachment-0001.html>
2011/5/27 Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com>:> I don''t think this is true. ?The reason you need arc+l2arc to store your DDT > is because when you perform a write, the system will need to check and see > if that block is a duplicate of an already existing block. ?If you dedup > once, and later disable dedup, the system won''t bother checking to see if > there are duplicate blocks anymore. ?So the DDT won''t need to be in > arc+l2arc. ?I should say "shouldn''t."Except when deleting deduped blocks. -- Frank Van Damme No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author.
On Fri, May 27, 2011 at 07:28:06AM -0400, Edward Ned Harvey wrote:> > From: Daniel Carosone [mailto:dan at geek.com.au] > > Sent: Thursday, May 26, 2011 8:19 PM > > > > Once your data is dedup''ed, by whatever means, access to it is the > > same. You need enough memory+l2arc to indirect references via > > DDT. > > I don''t think this is true.> The reason you need arc+l2arc to store your DDT > is because when you perform a write, the system will need to check and see > if that block is a duplicate of an already existing block. If you dedup > once, and later disable dedup, the system won''t bother checking to see if > there are duplicate blocks anymore. So the DDT won''t need to be in > arc+l2arc. I should say "shouldn''t."dedup''d blocks are found via the ddt, no matter how many references to them exist. The ddt ''owns'' the actual data block, and the regular referencing files'' metadata (bp) indicates that this block is dedup''d (indirect) rather than regular (direct). At least that''s my somewhat-rusty recollection. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110530/e4cecfdd/attachment.bin>