thr3ads.net - zfs discuss - [zfs-discuss] DDT sync? [May 2011]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey

2011-May-25 21:23 UTC

[zfs-discuss] DDT sync?

I''ve finally returned to this dedup testing project, trying to get a
handle
on why performance is so terrible.  At the moment I''m re-running tests
and
monitoring memory_throttle_count, to see if maybe that''s
what''s causing the
limit.  But while that''s in progress and I''m still thinking...

 

I assume the DDT tree must be stored on disk, in the regular pool, and each
entry is stored independently from each other entry, right?  So whenever
you''re performing new unique writes, that means you''re
creating new entries
in the tree, and every so often the tree will need to rebalance itself.  By
any chance, are DDT entry creation treated as sync writes?  If so, that
could be hurting me.  For every new unique block written, there might be a
significant amount of small random writes taking place that are necessary to
support the actual data write.  Anyone have any knowledge to share along
these lines?

 

Thanks...

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110525/bd28d0ba/attachment.html>

Matthew Ahrens

2011-May-25 22:50 UTC

head link

[zfs-discuss] DDT sync?

On Wed, May 25, 2011 at 2:23 PM, Edward Ned Harvey <
opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:
> I''ve finally returned to this dedup testing project, trying to get
a handle
> on why performance is so terrible.  At the moment I''m re-running
tests and
> monitoring memory_throttle_count, to see if maybe that''s
what''s causing the
> limit.  But while that''s in progress and I''m still
thinking...
>
>
>
> I assume the DDT tree must be stored on disk, in the regular pool, and each
> entry is stored independently from each other entry, right?  So whenever
> you''re performing new unique writes, that means you''re
creating new entries
> in the tree, and every so often the tree will need to rebalance itself.  By
> any chance, are DDT entry creation treated as sync writes?  If so, that
> could be hurting me.  For every new unique block written, there might be a
> significant amount of small random writes taking place that are necessary
to
> support the actual data write.  Anyone have any knowledge to share along
> these lines?
>
>The DDT is a ZAP object, so it is an on-disk hashtable, free of O(log(n))
rebalancing operations.  It is written asynchronously, from syncing context.
 That said, for each block written (unique or not), the DDT must be updated,
which means reading and then writing the block that contains that dedup
table entry, and the indirect blocks to get to it.  With a reasonably large
DDT, I would expect about 1 write to the DDT for every block written to the
pool (or "written" but actually dedup''d).

--matt
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110525/17de586a/attachment.html>

Edward Ned Harvey

2011-May-26 00:42 UTC

head link

[zfs-discuss] DDT sync?

> From: Matthew Ahrens [mailto:mahrens at delphix.com]
> Sent: Wednesday, May 25, 2011 6:50 PM
> 
> The DDT is a ZAP object, so it is an on-disk hashtable, free of O(log(n))
> rebalancing operations. ?It is written asynchronously, from syncing
> context. ?That said, for each block written (unique or not), the DDT must
be> updated, which means reading and then writing the block that contains that
> dedup table entry, and the indirect blocks to get to it. ?With a
reasonably> large DDT, I would expect about 1 write to the DDT for every block written
to> the pool?(or "written" but actually dedup''d).
So ... If the DDT were already cached completely in ARC, and I write a new
unique block to a file, ideally I would hope (after write buffering because
all of this will be async) that one write will be completed to disk - It
would be the aggregate of the new block plus the new DDT entry, but because
of write aggregation it should literally be a single seek+latency penalty. 

Most likely in reality, additional writes will be necessary, to update the
parent block pointers or parent DDT branches and so forth, but hopefully
that''s all managed well and kept to a minimum.  So maybe a single new
write
ultimately yields a dozen times the disk access time...

I''m honing this in closer, but so far what I''m seeing is ...
zpool iostat
indicates 1000 reads taking place for every 20 writes.  This is on a
literally 100% idle pool, where the only activity in the system is me
performing this write benchmark.  The only logical explanation I see for
this behavior is to conclude the DDT must not be cached in ARC.  So every
write yields a flurry of random reads...  50 or so...

Anyway, like I said, still exploring this.  No conclusions drawn yet.

Daniel Carosone

2011-May-26 02:09 UTC

head link

[zfs-discuss] DDT sync?

On Wed, May 25, 2011 at 03:50:09PM -0700, Matthew Ahrens
wrote:>  That said, for each block written (unique or not), the DDT must be
updated,
> which means reading and then writing the block that contains that dedup
> table entry, and the indirect blocks to get to it.  With a reasonably large
> DDT, I would expect about 1 write to the DDT for every block written to the
> pool (or "written" but actually dedup''d).
That, right there, illustrates exactly why some people are
disappointed wrt performance expectations from dedup.

To paraphrase, and in general: 

 * for write, dedup may save bandwidth but will not save write iops.
 * dedup may amplify iops with more metadata reads 
 * dedup may turn larger sequential io into smaller random io patterns 
 * many systems will be iops bound before they are bandwidth or space
   bound (and l2arc only mitigates read iops)
 * any iops benefit will only come on later reads of dedup''d data, so
   is heavily dependent on access pattern.

Assessing whether these amortised costs are worth it for you can be
complex, especially when the above is not clearly understood.

To me, the thing that makes dedup most expensive in iops is the writes
for update when a file (or snapshot) is deleted.  These are additional
iops that dedup creates, not ones that it substitutes for others in
roughly equal number.  

This load is easily forgotten in a cursory analysis, and yet is always
there in a steady state with rolling auto-snapshots.  As I''ve written
before, I''ve had some success managing this load using deferred deletes
and snapshot holds, either to spread the load or to shift it to
otherwise-quiet times, as the case demanded.  I''d rather not have to.
:-)

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110526/ca4efd77/attachment.bin>

Edward Ned Harvey

2011-May-26 11:38 UTC

head link

[zfs-discuss] DDT sync?

> From: Daniel Carosone [mailto:dan at geek.com.au]
> Sent: Wednesday, May 25, 2011 10:10 PM
> 
> These are additional
> iops that dedup creates, not ones that it substitutes for others in
> roughly equal number.
Hey ZFS developers - Of course there are many ways to possibly address these
issues.  Tweaking ARC prioritization and the like...  Has anybody considered
the possibility of making an option to always keep DDT on a specific vdev?
Presumably a nonvolatile mirror with very fast iops.  It is likely a lot of
people already have cache devices present...  Perhaps a property could be
set, which would store the DDT exclusively on that device.  Naturally there
are implications - you would need to recommend mirroring the device, which
you can''t do, so maybe we''re talking about slicing the cache
device...  As I
said, a lot of ways to address the issue.

Both the necessity to read & write the primary storage pool... 
That''s very
hurtful.  And even with infinite ram, it''s going to be unavoidable for
things like destroying snapshots, or anything at all you ever want to do
after a reboot.

Edward Ned Harvey

2011-May-26 14:25 UTC

head link

[zfs-discuss] DDT sync?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
>  
> Both the necessity to read & write the primary storage pool... 
That''s
very> hurtful.  
Actually, I''m seeing two different modes of degradation:
(1) Previously described.  When I run into arc_meta_limit, in a pool of
approx 1.0M to 1.5M unique blocks, I suffer ~50 reads for every new unique
write.  Countermeasure was easy.  Increase arc_meta_limit.

(2) Now, in a pool with 2.4M unique blocks and dedup enabled (no verify), a
test file requires 10m38s to write and 2m54s to delete, but with dedup
disabled it only requires 0m40s to write and 0m13s to delete exactly the
same file.  So ... 13x performance degradation.  

zpool iostat is indicating the disks are fully utilized doing writes.  No
reads.  During this time, it is clear the only bottleneck is write iops.
There is still oodles of free mem.  I am not near arc_meta_limit, nor c_max.
The cpu is 99% idle.  It is write iops limited.  Period.

Assuming DDT maintenance is the only disk write overhead that dedup adds, I
can only conclude that with dedup enabled, and a couple million unique
blocks in the pool, the DDT must require substantial maintenance.  In my
case, something like 12 DDT writes for every 1 actual intended new unique
file block write.

For the heck of it, since this machine has no other purpose at the present
time, I plan to do two more tests.  And I''m open to suggestions if
anyone
can think of anything else useful to measure: 

(1) I''m currently using a recordsize of 512b, because the intended
purpose
of this test has been to rapidly generate a high number of new unique
blocks.  Now just to eliminate the possibility that I''m shooting myself
in
the foot by systematically generating a worst case scenario, I''ll try
to
systematically generate a best-case scenario.  I''ll push the recordsize
back
up to 128k, and then repeat this test something slightly smaller than 128k.
Say, 120k. That way there should be plenty of room available for any write
aggregation the system may be trying to perform.

(2) For the heck of it, why not.  Disable ZIL and confirm that nothing
changes.  (Understanding so far is that all these writes are async, and
therefore ZIL should not be a factor.  Nice to confirm this belief.)

Daniel Carosone

2011-May-27 00:55 UTC

head link

[zfs-discuss] DDT sync?

On Thu, May 26, 2011 at 07:38:05AM -0400, Edward Ned Harvey
wrote:> > From: Daniel Carosone [mailto:dan at geek.com.au]
> > Sent: Wednesday, May 25, 2011 10:10 PM
> > 
> > These are additional
> > iops that dedup creates, not ones that it substitutes for others in
> > roughly equal number.
> 
> Hey ZFS developers - Of course there are many ways to possibly address
these
> issues.  Tweaking ARC prioritization and the like...  Has anybody
considered
> the possibility of making an option to always keep DDT on a specific vdev?
> Presumably a nonvolatile mirror with very fast iops.  It is likely a lot of
> people already have cache devices present...  Perhaps a property could be
> set, which would store the DDT exclusively on that device.  Naturally there
> are implications - you would need to recommend mirroring the device, which
> you can''t do, so maybe we''re talking about slicing the
cache device...  As I
> said, a lot of ways to address the issue.
I think l2arc persistence will just about cover that nicely, perhaps
in combination with some smarter auto-tuning for arc percentages with
large DDT. 

The writes are async, and aren''t so much a problem in themselves other
than that they can get in the way of other more important things.  The
best thing you can do with them is spread them as widely as possible,
rather than bottlenecking specific devices/channels/etc. 

If you have a capacity shortfall overall, either you make the other
things faster in preference (zil, nv write cache, more arc for reads)
or you make the whole pool faster for iops (different layout, more
spindles) or you limit dedup usage within your capacity.

Another thing that can happen is that you have enough other sync
writes going on that DDT writes lag behind and delay the txg close. 
In this case, the same solutions above apply, as does judicious use of
"sync=disabled" to allow more of the writes to be async.
> Both the necessity to read & write the primary storage pool... 
That''s very
> hurtful.  And even with infinite ram, it''s going to be unavoidable
for
> things like destroying snapshots, or anything at all you ever want to do
> after a reboot.
Yeah, again, persistent l2arc helps the post-reboot case.  With
infinite ram, I''m not sure I''d have much use for dedup :)

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/58ef41e1/attachment.bin>

Daniel Carosone

2011-May-27 02:42 UTC

head link

[zfs-discuss] DDT sync?

On Thu, May 26, 2011 at 10:25:04AM -0400, Edward Ned Harvey
wrote:> (2) Now, in a pool with 2.4M unique blocks and dedup enabled (no verify), a
> test file requires 10m38s to write and 2m54s to delete, but with dedup
> disabled it only requires 0m40s to write and 0m13s to delete exactly the
> same file.  So ... 13x performance degradation.  
> 
> zpool iostat is indicating the disks are fully utilized doing writes.  No
> reads.  During this time, it is clear the only bottleneck is write iops.
> There is still oodles of free mem.  I am not near arc_meta_limit, nor
c_max.
> The cpu is 99% idle.  It is write iops limited.  Period.
Ok.
> Assuming DDT maintenance is the only disk write overhead that dedup adds, I
> can only conclude that with dedup enabled, and a couple million unique
> blocks in the pool, the DDT must require substantial maintenance.  In my
> case, something like 12 DDT writes for every 1 actual intended new unique
> file block write.
Where did number come from?  Are there actually 13x as many IOs, or is
that just extrapolated from elapsed time?  It won''t be anything like a
linear extrapolation, especially if the heads are thrashing.

Note that DDT blocks have their own allocation metadata to be updated
as well.

Try to get a number for actual total IOs and scaling factor.
> For the heck of it, since this machine has no other purpose at the present
> time, I plan to do two more tests.  And I''m open to suggestions if
anyone
> can think of anything else useful to measure: 
> 
> (1) I''m currently using a recordsize of 512b, because the intended
purpose
> of this test has been to rapidly generate a high number of new unique
> blocks.  Now just to eliminate the possibility that I''m shooting
myself in
> the foot by systematically generating a worst case scenario, I''ll
try to
> systematically generate a best-case scenario.  I''ll push the
recordsize back
> up to 128k, and then repeat this test something slightly smaller than 128k.
> Say, 120k. That way there should be plenty of room available for any write
> aggregation the system may be trying to perform.
> 
> (2) For the heck of it, why not.  Disable ZIL and confirm that nothing
> changes.  (Understanding so far is that all these writes are async, and
> therefore ZIL should not be a factor.  Nice to confirm this belief.)
Good tests. See how the IO expansion factor changes with block size.

(3) Experiment with the maximum number of allowed outstanding current
io''s per disk (I forget the specific tunable OTTOMH).  If the load
really is ~100% async write, this might well be a case where raising
that figure lets the disk firmware maximise throughput without causing
the latency impact that can happen otherwise (and leads to
recommendations to shorten the limit in general cases).

(4) See if changing the txg sync interval to (much) longer
helps. Multiple DDT entries can live in the same block, and a longer
interval may allow coalescing of these writes.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/4ce2a4e9/attachment.bin>

Frank Van Damme

2011-May-27 08:22 UTC

head link

[zfs-discuss] DDT sync?

Op 26-05-11 13:38, Edward Ned Harvey schreef:> Perhaps a property could be
> set, which would store the DDT exclusively on that device.
Oh yes please, let me put my DDT on an SSD.

But what if you loose it (the vdev), would there be a way to reconstruct
the DDT (which you need to be able to delete old, deduplicated files)?
Let me guess - this requires tracing down all blocks and depends on an
infamous feature called BPR? ;)
> Both the necessity to read & write the primary storage pool... 
That''s
> very hurtful.  And even with infinite ram, it''s going to be
> unavoidable for things like destroying snapshots, or anything at all
> you ever want to do after a reboot.  
Indeed. But then again, zfs also doesn''t (yet?) keep its l2arc cache
between reboots. Once it does, you could flush out the entire arc to
l2arc before reboot.

-- 
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

Edward Ned Harvey

2011-May-27 12:44 UTC

head link

[zfs-discuss] DDT sync?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Frank Van Damme
> 
> Op 26-05-11 13:38, Edward Ned Harvey schreef:
> > Perhaps a property could be
> > set, which would store the DDT exclusively on that device.
> 
> Oh yes please, let me put my DDT on an SSD.
> 
> But what if you loose it (the vdev), would there be a way to reconstruct
> the DDT (which you need to be able to delete old, deduplicated files)?
> Let me guess - this requires tracing down all blocks and depends on an
> infamous feature called BPR? ;)
How is that different from putting your DDT on a hard drive, which is what
we currently do?

Jim Klimov

2011-May-27 13:20 UTC

head link

[zfs-discuss] DDT sync?

> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Frank Van Damme
> > 
> > Op 26-05-11 13:38, Edward Ned Harvey schreef:
> > But what if you loose it (the vdev), would there be a way to 
> reconstruct> the DDT (which you need to be able to delete old, 
> deduplicated files)?
> > Let me guess - this requires tracing down all blocks and 
> > depends on an infamous feature called BPR? ;)
> 
> How is that different from putting your DDT on a hard drive, 
> which is what we currently do?
I think you two might be talking about somewhat different ideas
of implementing such DDT storage.
 
One approach might be like we have now: the DDT blocks are
spread in your pool consisting of several top-level vdevs and
are redundantly protected by ZFS raidz or mirroring. If one of
such top-level vdevs is lost, the whole pool is faulted or dead.
 
Another approach might be more like a dedicated extra device
(or mirror/raidz of devices) like L2ARC or rather ZIL (more 
analogies below) - this task would need a write-oriented media 
like SLC SSDs with a large capacity, and throttling of L2ARC 
hardware link and potential unreliability of MLCs might make 
DDT storage a bad neighbor for L2ARC SSDs.
 
Since ZILs are usually treated as write-only devices with a
low capacity requirement (i.e. 2-4Gb might be more than
enough), dedicating the rest of even a 20Gb SSD to the 
DDT may be a good investment overall.
 
If the ZIL device (mirror) fails, you might need to rollback 
your pool a few transactions back, detach the ZIL and fall
back to using HDD blocks for the ZIL.
 
Since "zdb -s" can seemingly construct a DDT from scratch,
and since for reads you still have many references to a single
on-disk block (DDT is not used for reads, right?) - you can
reconstruct the DDT for either in-pool or external storage.
That might take some downtime, true. But if coupled with
offline dedup (as discussed in another thread) running in the
background, maybe not.
 
One thing to think of, though: what would we do when the
dedicated DDT storage overflows - write the extra entries
into the HDD pool like we do now? 
(BTW, what to we do with dedicated ZIL device - flush the 
TXG early?)
 
//Jim
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/6d124573/attachment.html>

Richard Elling

2011-May-28 00:04 UTC

head link

[zfs-discuss] DDT sync?

On May 27, 2011, at 6:20 AM, Jim Klimov wrote:
> > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > > bounces at opensolaris.org] On Behalf Of Frank Van Damme
> > > 
> > > Op 26-05-11 13:38, Edward Ned Harvey schreef:
> > > But what if you loose it (the vdev), would there be a way to 
> > reconstruct> the DDT (which you need to be able to delete old, 
> > deduplicated files)?
> > > Let me guess - this requires tracing down all blocks and 
> > > depends on an infamous feature called BPR? ;)
> > 
> > How is that different from putting your DDT on a hard drive, 
> > which is what we currently do?
> I think you two might be talking about somewhat different ideas
> of implementing such DDT storage.
>  
> One approach might be like we have now: the DDT blocks are
> spread in your pool consisting of several top-level vdevs and
> are redundantly protected by ZFS raidz or mirroring. If one of
> such top-level vdevs is lost, the whole pool is faulted or dead.
>  
> Another approach might be more like a dedicated extra device
> (or mirror/raidz of devices) like L2ARC or rather ZIL (more
> analogies below) - this task would need a write-oriented media
> like SLC SSDs with a large capacity, and throttling of L2ARC
> hardware link and potential unreliability of MLCs might make
> DDT storage a bad neighbor for L2ARC SSDs.
I filed an RFE for this about 2 years ago... I would send a URL but
Oracle shut down the OpenSolaris bugs database interface and 
left what is left mostly useless :-(
> Since ZILs are usually treated as write-only devices with a
> low capacity requirement (i.e. 2-4Gb might be more than
> enough), dedicating the rest of even a 20Gb SSD to the
> DDT may be a good investment overall.
>  
> If the ZIL device (mirror) fails, you might need to rollback
> your pool a few transactions back, detach the ZIL and fall
> back to using HDD blocks for the ZIL.
>  
> Since "zdb -s" can seemingly construct a DDT from scratch,
> and since for reads you still have many references to a single
> on-disk block (DDT is not used for reads, right?) - you can
> reconstruct the DDT for either in-pool or external storage.
Nope. If you lose the DDT, then you lose any or all deduped data.
Today, the DDT is treated like metadata which means there are always
at least 2 copies in the pool.
> That might take some downtime, true. But if coupled with
> offline dedup (as discussed in another thread) running in the
> background, maybe not.
>  
> One thing to think of, though: what would we do when the
> dedicated DDT storage overflows - write the extra entries
> into the HDD pool like we do now?
Other designs which use fixed-sized DDT areas suffer from that design
limitation -- once it fills, they no longer dedup.> 
> (BTW, what to we do with dedicated ZIL device - flush the
> TXG early?)
No, just write into the pool. Frankly, I don''t see this as a problem
for real
world machines.
 -- richard

Edward Ned Harvey

2011-May-29 11:30 UTC

head link

[zfs-discuss] DDT sync?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> (1) I''ll push the recordsize back
> up to 128k, and then repeat this test something slightly smaller than
128k.> Say, 120k. 
Good news.  :-)  Changing the recordsize made a world of difference. 
I''ve
had this benchmark running for the last 2-3 days now, and I''m up to
3.4M
blocks in the pool (as reported by zdb -D) ...  Over this period, my
arc_meta_used climbed from initial 100M to 1500M.  Call it 412 bytes per
unique block in the pool.  Not too far from what was previously said (376
bytes based on sizeof ddt_entry_t).  

I am repeating a cycle:  time write some unique data with dedup off, then
time rm the file, and time write the same data again with dedup on (no
verify), then time rm the file, and finally time write it again and leave it
on disk just to hog more DDT entries.  Move on to the next unique file.

Each file is about 3G in size, on a single sata disk.  No records are ever
repeated - every block in the pool is unique.  Presently I''m up to
about
500G written of a 1T drive, and the performance difference with/without
dedup is 100sec vs 75sec to write the same 3G of data.

Currently my arc_meta_limit is 7680M (I tweaked it) which theoretically
means I should be able to continue this up to around 19M blocks or so.
Clearly much more than my little 1T drive can handle on a 128K recordsize.
But I don''t just want to see it succeed - I want to predict where it
will
fail, and confirm the hypothesis with a measured result.

So here''s what I''m going to do.  With arc_meta_limit at 7680M,
of which 100M
was consumed "naturally," that leaves me 7580 to play with.  Call it
7500M.
Divide by 412 bytes, it means I''ll hit a brick wall when I reach a
little
over 19M blocks.  Which means if I set my recordsize to 32K, I''ll hit
that
limit around 582G disk space consumed.  That is my hypothesis, and now
beginning the test.

Edward Ned Harvey

2011-Jun-01 03:25 UTC

head link

[zfs-discuss] DDT sync?

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Edward Ned Harvey
> 
> So here''s what I''m going to do.  With arc_meta_limit at
7680M, of which
> 100M
> was consumed "naturally," that leaves me 7580 to play with.  Call
it
7500M.> Divide by 412 bytes, it means I''ll hit a brick wall when I reach a
little
> over 19M blocks.  Which means if I set my recordsize to 32K, I''ll
hit that
> limit around 582G disk space consumed.  That is my hypothesis, and now
> beginning the test.
Well, this is interesting.  With 7580MB theoretically available for DDT in
ARC, the expectation was that 19M DDT entries would finally max out the ARC
and then I''d jump off a performance cliff and start seeing a bunch of
pool
reads killing my write performance.

In reality, what I saw was:  
* Up to a million blocks, the performance difference with/without dedup was
basically negligible.  Write time with dedup = 1x write time without dedup.
* After a million, the dedup write time consistently reached 2x longer than
the native write time.  This happened when my ARC became full of user data
(not meta data)
* As the # of unique blocks in pool increased, gradually, the dedup write
time deviated from the non-dedup write time.  2x, 3x, 4x.  I got a
consistent 4x longer write time with dedup enabled, after the pool reached
22.5M blocks.
* And then it jumped off a cliff.  When I got to 24M blocks, it was the last
datapoint able to be collected.  28x slower write with dedup (4966 sec to
write 3G, as compared to 178sec), and for the first time, a nonzero rm time.
All the way up till now, even with dedup, the rm time was zero.  But now it
was 72sec.  
* I waited another 6 hours, and never got another data point.  So I found
the limit where the pool becomes unusably slow.  

At a cursory look, you might say this supported the hypothesis.  You might
say "24M compared to 19M, that''s not too far off.  This could be
accounted
for by using the 376byte size of ddt_entry_t, instead of the 412byte size
apparently measured... This would adjust the hypothesis to 21.1M blocks."

But I don''t think that''s quite fair.  Because my arc_meta_used
never got
above 5,159.  And I never saw the massive read overload that was predicted
to be the cause of failure.  In fact, starting from 0.4M to 0.5M blocks
(early, early, early on) from that point onward, I always had 40-50 reads
for every 250 writes.  Right to the bitter end.  And my arc is full of user
data, not meta data.

So the conclusions I''m drawing are:

(1)  If you don''t tweak arc_meta_limit, and you want to enable dedup,
you''re
toast.  But if you do tweak arc_meta_limit, you might reasonably expect
dedup to perform 3x to 4x slower on unique data...  And based on results
that I haven''t talked about yet here, dedup performs 3x to 4x faster on
duplicate data.  So if you have 50% or higher duplicate data (dedup ratio 2x
or higher) and you have plenty of memory and tweak it, then your performance
with dedup could be comparable, or even faster than running without dedup.
Of course, depending on your data patterns and usage patterns.  YMMV.

(2)  The above is pretty much the best you can do, if your server is going
to be a "normal" server, handling both reads & writes.  Because
the data and
the meta_data are both stored in the ARC, the data has a tendency to push
the meta_data out.  But in a special use case - Suppose you only care about
write performance and saving disk space.  For example, suppose you''re
the
destination server of a backup policy.  You only do writes, so you
don''t
care about keeping data in cache.  You want to enable dedup to save cost on
backup disks.  You only care about keeping meta_data in ARC.  If you set
primarycache=metadata ....  I''ll go test this now.  The hypothesis is
that
my arc_meta_used should actually climb up to the arc_meta_limit before I
start hitting any disk reads, so my write performance with/without dedup
should be pretty much equal up to that point.  I''m sacrificing the
potential
read benefit of caching data in ARC, in order to hopefully gain write
performance - So write performance can be just as good with dedup enabled or
disabled.  In fact, if there''s much duplicate data, the dedup write
performance in this case should be significantly better than without dedup.

Frank Van Damme

2011-Jun-01 13:06 UTC

head link

[zfs-discuss] DDT sync?

2011/6/1 Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at
nedharvey.com>:> (2) ?The above is pretty much the best you can do, if your server is going
> to be a "normal" server, handling both reads & writes.
?Because the data and
> the meta_data are both stored in the ARC, the data has a tendency to push
> the meta_data out. ?But in a special use case - Suppose you only care about
> write performance and saving disk space. ?For example, suppose
you''re the
> destination server of a backup policy. ?You only do writes, so you
don''t
> care about keeping data in cache. ?You want to enable dedup to save cost on
> backup disks. ?You only care about keeping meta_data in ARC. ?If you set
> primarycache=metadata .... ?I''ll go test this now. ?The hypothesis
is that
> my arc_meta_used should actually climb up to the arc_meta_limit before I
> start hitting any disk reads, so my write performance with/without dedup
> should be pretty much equal up to that point. ?I''m sacrificing the
potential
> read benefit of caching data in ARC, in order to hopefully gain write
> performance - So write performance can be just as good with dedup enabled
or
> disabled. ?In fact, if there''s much duplicate data, the dedup
write
> performance in this case should be significantly better than without dedup.
I guess this is pretty much why I have primarycache=metadata and
set zfs:zfs_arc_meta_limit=0x100000000
set zfs:zfs_arc_min=0xC0000000
in /etc/system.

And the ARC size on this box tends to drop far below arc_min after a
few days, not withstanding the fact it''s supposed to be a hard limit.

I call for an arc_data_max setting :)

-- 
Frank Van Damme
No part of this copyright message may be reproduced, read or seen,
dead or alive or by any means, including but not limited to telepathy
without the benevolence of the author.

zfs discuss - May 2011 - DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?

[zfs-discuss] DDT sync?