I''ve finally returned to this dedup testing project, trying to get a handle on why performance is so terrible. At the moment I''m re-running tests and monitoring memory_throttle_count, to see if maybe that''s what''s causing the limit. But while that''s in progress and I''m still thinking... I assume the DDT tree must be stored on disk, in the regular pool, and each entry is stored independently from each other entry, right? So whenever you''re performing new unique writes, that means you''re creating new entries in the tree, and every so often the tree will need to rebalance itself. By any chance, are DDT entry creation treated as sync writes? If so, that could be hurting me. For every new unique block written, there might be a significant amount of small random writes taking place that are necessary to support the actual data write. Anyone have any knowledge to share along these lines? Thanks... -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110525/bd28d0ba/attachment.html>
On Wed, May 25, 2011 at 2:23 PM, Edward Ned Harvey < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> I''ve finally returned to this dedup testing project, trying to get a handle > on why performance is so terrible. At the moment I''m re-running tests and > monitoring memory_throttle_count, to see if maybe that''s what''s causing the > limit. But while that''s in progress and I''m still thinking... > > > > I assume the DDT tree must be stored on disk, in the regular pool, and each > entry is stored independently from each other entry, right? So whenever > you''re performing new unique writes, that means you''re creating new entries > in the tree, and every so often the tree will need to rebalance itself. By > any chance, are DDT entry creation treated as sync writes? If so, that > could be hurting me. For every new unique block written, there might be a > significant amount of small random writes taking place that are necessary to > support the actual data write. Anyone have any knowledge to share along > these lines? > >The DDT is a ZAP object, so it is an on-disk hashtable, free of O(log(n)) rebalancing operations. It is written asynchronously, from syncing context. That said, for each block written (unique or not), the DDT must be updated, which means reading and then writing the block that contains that dedup table entry, and the indirect blocks to get to it. With a reasonably large DDT, I would expect about 1 write to the DDT for every block written to the pool (or "written" but actually dedup''d). --matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110525/17de586a/attachment.html>
> From: Matthew Ahrens [mailto:mahrens at delphix.com] > Sent: Wednesday, May 25, 2011 6:50 PM > > The DDT is a ZAP object, so it is an on-disk hashtable, free of O(log(n)) > rebalancing operations. ?It is written asynchronously, from syncing > context. ?That said, for each block written (unique or not), the DDT mustbe> updated, which means reading and then writing the block that contains that > dedup table entry, and the indirect blocks to get to it. ?With areasonably> large DDT, I would expect about 1 write to the DDT for every block writtento> the pool?(or "written" but actually dedup''d).So ... If the DDT were already cached completely in ARC, and I write a new unique block to a file, ideally I would hope (after write buffering because all of this will be async) that one write will be completed to disk - It would be the aggregate of the new block plus the new DDT entry, but because of write aggregation it should literally be a single seek+latency penalty. Most likely in reality, additional writes will be necessary, to update the parent block pointers or parent DDT branches and so forth, but hopefully that''s all managed well and kept to a minimum. So maybe a single new write ultimately yields a dozen times the disk access time... I''m honing this in closer, but so far what I''m seeing is ... zpool iostat indicates 1000 reads taking place for every 20 writes. This is on a literally 100% idle pool, where the only activity in the system is me performing this write benchmark. The only logical explanation I see for this behavior is to conclude the DDT must not be cached in ARC. So every write yields a flurry of random reads... 50 or so... Anyway, like I said, still exploring this. No conclusions drawn yet.
On Wed, May 25, 2011 at 03:50:09PM -0700, Matthew Ahrens wrote:> That said, for each block written (unique or not), the DDT must be updated, > which means reading and then writing the block that contains that dedup > table entry, and the indirect blocks to get to it. With a reasonably large > DDT, I would expect about 1 write to the DDT for every block written to the > pool (or "written" but actually dedup''d).That, right there, illustrates exactly why some people are disappointed wrt performance expectations from dedup. To paraphrase, and in general: * for write, dedup may save bandwidth but will not save write iops. * dedup may amplify iops with more metadata reads * dedup may turn larger sequential io into smaller random io patterns * many systems will be iops bound before they are bandwidth or space bound (and l2arc only mitigates read iops) * any iops benefit will only come on later reads of dedup''d data, so is heavily dependent on access pattern. Assessing whether these amortised costs are worth it for you can be complex, especially when the above is not clearly understood. To me, the thing that makes dedup most expensive in iops is the writes for update when a file (or snapshot) is deleted. These are additional iops that dedup creates, not ones that it substitutes for others in roughly equal number. This load is easily forgotten in a cursory analysis, and yet is always there in a steady state with rolling auto-snapshots. As I''ve written before, I''ve had some success managing this load using deferred deletes and snapshot holds, either to spread the load or to shift it to otherwise-quiet times, as the case demanded. I''d rather not have to. :-) -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110526/ca4efd77/attachment.bin>
> From: Daniel Carosone [mailto:dan at geek.com.au] > Sent: Wednesday, May 25, 2011 10:10 PM > > These are additional > iops that dedup creates, not ones that it substitutes for others in > roughly equal number.Hey ZFS developers - Of course there are many ways to possibly address these issues. Tweaking ARC prioritization and the like... Has anybody considered the possibility of making an option to always keep DDT on a specific vdev? Presumably a nonvolatile mirror with very fast iops. It is likely a lot of people already have cache devices present... Perhaps a property could be set, which would store the DDT exclusively on that device. Naturally there are implications - you would need to recommend mirroring the device, which you can''t do, so maybe we''re talking about slicing the cache device... As I said, a lot of ways to address the issue. Both the necessity to read & write the primary storage pool... That''s very hurtful. And even with infinite ram, it''s going to be unavoidable for things like destroying snapshots, or anything at all you ever want to do after a reboot.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > Both the necessity to read & write the primary storage pool... That''svery> hurtful.Actually, I''m seeing two different modes of degradation: (1) Previously described. When I run into arc_meta_limit, in a pool of approx 1.0M to 1.5M unique blocks, I suffer ~50 reads for every new unique write. Countermeasure was easy. Increase arc_meta_limit. (2) Now, in a pool with 2.4M unique blocks and dedup enabled (no verify), a test file requires 10m38s to write and 2m54s to delete, but with dedup disabled it only requires 0m40s to write and 0m13s to delete exactly the same file. So ... 13x performance degradation. zpool iostat is indicating the disks are fully utilized doing writes. No reads. During this time, it is clear the only bottleneck is write iops. There is still oodles of free mem. I am not near arc_meta_limit, nor c_max. The cpu is 99% idle. It is write iops limited. Period. Assuming DDT maintenance is the only disk write overhead that dedup adds, I can only conclude that with dedup enabled, and a couple million unique blocks in the pool, the DDT must require substantial maintenance. In my case, something like 12 DDT writes for every 1 actual intended new unique file block write. For the heck of it, since this machine has no other purpose at the present time, I plan to do two more tests. And I''m open to suggestions if anyone can think of anything else useful to measure: (1) I''m currently using a recordsize of 512b, because the intended purpose of this test has been to rapidly generate a high number of new unique blocks. Now just to eliminate the possibility that I''m shooting myself in the foot by systematically generating a worst case scenario, I''ll try to systematically generate a best-case scenario. I''ll push the recordsize back up to 128k, and then repeat this test something slightly smaller than 128k. Say, 120k. That way there should be plenty of room available for any write aggregation the system may be trying to perform. (2) For the heck of it, why not. Disable ZIL and confirm that nothing changes. (Understanding so far is that all these writes are async, and therefore ZIL should not be a factor. Nice to confirm this belief.)
On Thu, May 26, 2011 at 07:38:05AM -0400, Edward Ned Harvey wrote:> > From: Daniel Carosone [mailto:dan at geek.com.au] > > Sent: Wednesday, May 25, 2011 10:10 PM > > > > These are additional > > iops that dedup creates, not ones that it substitutes for others in > > roughly equal number. > > Hey ZFS developers - Of course there are many ways to possibly address these > issues. Tweaking ARC prioritization and the like... Has anybody considered > the possibility of making an option to always keep DDT on a specific vdev? > Presumably a nonvolatile mirror with very fast iops. It is likely a lot of > people already have cache devices present... Perhaps a property could be > set, which would store the DDT exclusively on that device. Naturally there > are implications - you would need to recommend mirroring the device, which > you can''t do, so maybe we''re talking about slicing the cache device... As I > said, a lot of ways to address the issue.I think l2arc persistence will just about cover that nicely, perhaps in combination with some smarter auto-tuning for arc percentages with large DDT. The writes are async, and aren''t so much a problem in themselves other than that they can get in the way of other more important things. The best thing you can do with them is spread them as widely as possible, rather than bottlenecking specific devices/channels/etc. If you have a capacity shortfall overall, either you make the other things faster in preference (zil, nv write cache, more arc for reads) or you make the whole pool faster for iops (different layout, more spindles) or you limit dedup usage within your capacity. Another thing that can happen is that you have enough other sync writes going on that DDT writes lag behind and delay the txg close. In this case, the same solutions above apply, as does judicious use of "sync=disabled" to allow more of the writes to be async.> Both the necessity to read & write the primary storage pool... That''s very > hurtful. And even with infinite ram, it''s going to be unavoidable for > things like destroying snapshots, or anything at all you ever want to do > after a reboot.Yeah, again, persistent l2arc helps the post-reboot case. With infinite ram, I''m not sure I''d have much use for dedup :) -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/58ef41e1/attachment.bin>
On Thu, May 26, 2011 at 10:25:04AM -0400, Edward Ned Harvey wrote:> (2) Now, in a pool with 2.4M unique blocks and dedup enabled (no verify), a > test file requires 10m38s to write and 2m54s to delete, but with dedup > disabled it only requires 0m40s to write and 0m13s to delete exactly the > same file. So ... 13x performance degradation. > > zpool iostat is indicating the disks are fully utilized doing writes. No > reads. During this time, it is clear the only bottleneck is write iops. > There is still oodles of free mem. I am not near arc_meta_limit, nor c_max. > The cpu is 99% idle. It is write iops limited. Period.Ok.> Assuming DDT maintenance is the only disk write overhead that dedup adds, I > can only conclude that with dedup enabled, and a couple million unique > blocks in the pool, the DDT must require substantial maintenance. In my > case, something like 12 DDT writes for every 1 actual intended new unique > file block write.Where did number come from? Are there actually 13x as many IOs, or is that just extrapolated from elapsed time? It won''t be anything like a linear extrapolation, especially if the heads are thrashing. Note that DDT blocks have their own allocation metadata to be updated as well. Try to get a number for actual total IOs and scaling factor.> For the heck of it, since this machine has no other purpose at the present > time, I plan to do two more tests. And I''m open to suggestions if anyone > can think of anything else useful to measure: > > (1) I''m currently using a recordsize of 512b, because the intended purpose > of this test has been to rapidly generate a high number of new unique > blocks. Now just to eliminate the possibility that I''m shooting myself in > the foot by systematically generating a worst case scenario, I''ll try to > systematically generate a best-case scenario. I''ll push the recordsize back > up to 128k, and then repeat this test something slightly smaller than 128k. > Say, 120k. That way there should be plenty of room available for any write > aggregation the system may be trying to perform. > > (2) For the heck of it, why not. Disable ZIL and confirm that nothing > changes. (Understanding so far is that all these writes are async, and > therefore ZIL should not be a factor. Nice to confirm this belief.)Good tests. See how the IO expansion factor changes with block size. (3) Experiment with the maximum number of allowed outstanding current io''s per disk (I forget the specific tunable OTTOMH). If the load really is ~100% async write, this might well be a case where raising that figure lets the disk firmware maximise throughput without causing the latency impact that can happen otherwise (and leads to recommendations to shorten the limit in general cases). (4) See if changing the txg sync interval to (much) longer helps. Multiple DDT entries can live in the same block, and a longer interval may allow coalescing of these writes. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/4ce2a4e9/attachment.bin>
Op 26-05-11 13:38, Edward Ned Harvey schreef:> Perhaps a property could be > set, which would store the DDT exclusively on that device.Oh yes please, let me put my DDT on an SSD. But what if you loose it (the vdev), would there be a way to reconstruct the DDT (which you need to be able to delete old, deduplicated files)? Let me guess - this requires tracing down all blocks and depends on an infamous feature called BPR? ;)> Both the necessity to read & write the primary storage pool... That''s > very hurtful. And even with infinite ram, it''s going to be > unavoidable for things like destroying snapshots, or anything at all > you ever want to do after a reboot.Indeed. But then again, zfs also doesn''t (yet?) keep its l2arc cache between reboots. Once it does, you could flush out the entire arc to l2arc before reboot. -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Frank Van Damme > > Op 26-05-11 13:38, Edward Ned Harvey schreef: > > Perhaps a property could be > > set, which would store the DDT exclusively on that device. > > Oh yes please, let me put my DDT on an SSD. > > But what if you loose it (the vdev), would there be a way to reconstruct > the DDT (which you need to be able to delete old, deduplicated files)? > Let me guess - this requires tracing down all blocks and depends on an > infamous feature called BPR? ;)How is that different from putting your DDT on a hard drive, which is what we currently do?
> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > bounces at opensolaris.org] On Behalf Of Frank Van Damme > > > > Op 26-05-11 13:38, Edward Ned Harvey schreef: > > But what if you loose it (the vdev), would there be a way to > reconstruct> the DDT (which you need to be able to delete old, > deduplicated files)? > > Let me guess - this requires tracing down all blocks and > > depends on an infamous feature called BPR? ;) > > How is that different from putting your DDT on a hard drive, > which is what we currently do?I think you two might be talking about somewhat different ideas of implementing such DDT storage. One approach might be like we have now: the DDT blocks are spread in your pool consisting of several top-level vdevs and are redundantly protected by ZFS raidz or mirroring. If one of such top-level vdevs is lost, the whole pool is faulted or dead. Another approach might be more like a dedicated extra device (or mirror/raidz of devices) like L2ARC or rather ZIL (more analogies below) - this task would need a write-oriented media like SLC SSDs with a large capacity, and throttling of L2ARC hardware link and potential unreliability of MLCs might make DDT storage a bad neighbor for L2ARC SSDs. Since ZILs are usually treated as write-only devices with a low capacity requirement (i.e. 2-4Gb might be more than enough), dedicating the rest of even a 20Gb SSD to the DDT may be a good investment overall. If the ZIL device (mirror) fails, you might need to rollback your pool a few transactions back, detach the ZIL and fall back to using HDD blocks for the ZIL. Since "zdb -s" can seemingly construct a DDT from scratch, and since for reads you still have many references to a single on-disk block (DDT is not used for reads, right?) - you can reconstruct the DDT for either in-pool or external storage. That might take some downtime, true. But if coupled with offline dedup (as discussed in another thread) running in the background, maybe not. One thing to think of, though: what would we do when the dedicated DDT storage overflows - write the extra entries into the HDD pool like we do now? (BTW, what to we do with dedicated ZIL device - flush the TXG early?) //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110527/6d124573/attachment.html>
On May 27, 2011, at 6:20 AM, Jim Klimov wrote:> > > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > > > bounces at opensolaris.org] On Behalf Of Frank Van Damme > > > > > > Op 26-05-11 13:38, Edward Ned Harvey schreef: > > > But what if you loose it (the vdev), would there be a way to > > reconstruct> the DDT (which you need to be able to delete old, > > deduplicated files)? > > > Let me guess - this requires tracing down all blocks and > > > depends on an infamous feature called BPR? ;) > > > > How is that different from putting your DDT on a hard drive, > > which is what we currently do? > I think you two might be talking about somewhat different ideas > of implementing such DDT storage. > > One approach might be like we have now: the DDT blocks are > spread in your pool consisting of several top-level vdevs and > are redundantly protected by ZFS raidz or mirroring. If one of > such top-level vdevs is lost, the whole pool is faulted or dead. > > Another approach might be more like a dedicated extra device > (or mirror/raidz of devices) like L2ARC or rather ZIL (more > analogies below) - this task would need a write-oriented media > like SLC SSDs with a large capacity, and throttling of L2ARC > hardware link and potential unreliability of MLCs might make > DDT storage a bad neighbor for L2ARC SSDs.I filed an RFE for this about 2 years ago... I would send a URL but Oracle shut down the OpenSolaris bugs database interface and left what is left mostly useless :-(> Since ZILs are usually treated as write-only devices with a > low capacity requirement (i.e. 2-4Gb might be more than > enough), dedicating the rest of even a 20Gb SSD to the > DDT may be a good investment overall. > > If the ZIL device (mirror) fails, you might need to rollback > your pool a few transactions back, detach the ZIL and fall > back to using HDD blocks for the ZIL. > > Since "zdb -s" can seemingly construct a DDT from scratch, > and since for reads you still have many references to a single > on-disk block (DDT is not used for reads, right?) - you can > reconstruct the DDT for either in-pool or external storage.Nope. If you lose the DDT, then you lose any or all deduped data. Today, the DDT is treated like metadata which means there are always at least 2 copies in the pool.> That might take some downtime, true. But if coupled with > offline dedup (as discussed in another thread) running in the > background, maybe not. > > One thing to think of, though: what would we do when the > dedicated DDT storage overflows - write the extra entries > into the HDD pool like we do now?Other designs which use fixed-sized DDT areas suffer from that design limitation -- once it fills, they no longer dedup.>> (BTW, what to we do with dedicated ZIL device - flush the > TXG early?)No, just write into the pool. Frankly, I don''t see this as a problem for real world machines. -- richard
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > (1) I''ll push the recordsize back > up to 128k, and then repeat this test something slightly smaller than128k.> Say, 120k.Good news. :-) Changing the recordsize made a world of difference. I''ve had this benchmark running for the last 2-3 days now, and I''m up to 3.4M blocks in the pool (as reported by zdb -D) ... Over this period, my arc_meta_used climbed from initial 100M to 1500M. Call it 412 bytes per unique block in the pool. Not too far from what was previously said (376 bytes based on sizeof ddt_entry_t). I am repeating a cycle: time write some unique data with dedup off, then time rm the file, and time write the same data again with dedup on (no verify), then time rm the file, and finally time write it again and leave it on disk just to hog more DDT entries. Move on to the next unique file. Each file is about 3G in size, on a single sata disk. No records are ever repeated - every block in the pool is unique. Presently I''m up to about 500G written of a 1T drive, and the performance difference with/without dedup is 100sec vs 75sec to write the same 3G of data. Currently my arc_meta_limit is 7680M (I tweaked it) which theoretically means I should be able to continue this up to around 19M blocks or so. Clearly much more than my little 1T drive can handle on a 128K recordsize. But I don''t just want to see it succeed - I want to predict where it will fail, and confirm the hypothesis with a measured result. So here''s what I''m going to do. With arc_meta_limit at 7680M, of which 100M was consumed "naturally," that leaves me 7580 to play with. Call it 7500M. Divide by 412 bytes, it means I''ll hit a brick wall when I reach a little over 19M blocks. Which means if I set my recordsize to 32K, I''ll hit that limit around 582G disk space consumed. That is my hypothesis, and now beginning the test.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > So here''s what I''m going to do. With arc_meta_limit at 7680M, of which > 100M > was consumed "naturally," that leaves me 7580 to play with. Call it7500M.> Divide by 412 bytes, it means I''ll hit a brick wall when I reach a little > over 19M blocks. Which means if I set my recordsize to 32K, I''ll hit that > limit around 582G disk space consumed. That is my hypothesis, and now > beginning the test.Well, this is interesting. With 7580MB theoretically available for DDT in ARC, the expectation was that 19M DDT entries would finally max out the ARC and then I''d jump off a performance cliff and start seeing a bunch of pool reads killing my write performance. In reality, what I saw was: * Up to a million blocks, the performance difference with/without dedup was basically negligible. Write time with dedup = 1x write time without dedup. * After a million, the dedup write time consistently reached 2x longer than the native write time. This happened when my ARC became full of user data (not meta data) * As the # of unique blocks in pool increased, gradually, the dedup write time deviated from the non-dedup write time. 2x, 3x, 4x. I got a consistent 4x longer write time with dedup enabled, after the pool reached 22.5M blocks. * And then it jumped off a cliff. When I got to 24M blocks, it was the last datapoint able to be collected. 28x slower write with dedup (4966 sec to write 3G, as compared to 178sec), and for the first time, a nonzero rm time. All the way up till now, even with dedup, the rm time was zero. But now it was 72sec. * I waited another 6 hours, and never got another data point. So I found the limit where the pool becomes unusably slow. At a cursory look, you might say this supported the hypothesis. You might say "24M compared to 19M, that''s not too far off. This could be accounted for by using the 376byte size of ddt_entry_t, instead of the 412byte size apparently measured... This would adjust the hypothesis to 21.1M blocks." But I don''t think that''s quite fair. Because my arc_meta_used never got above 5,159. And I never saw the massive read overload that was predicted to be the cause of failure. In fact, starting from 0.4M to 0.5M blocks (early, early, early on) from that point onward, I always had 40-50 reads for every 250 writes. Right to the bitter end. And my arc is full of user data, not meta data. So the conclusions I''m drawing are: (1) If you don''t tweak arc_meta_limit, and you want to enable dedup, you''re toast. But if you do tweak arc_meta_limit, you might reasonably expect dedup to perform 3x to 4x slower on unique data... And based on results that I haven''t talked about yet here, dedup performs 3x to 4x faster on duplicate data. So if you have 50% or higher duplicate data (dedup ratio 2x or higher) and you have plenty of memory and tweak it, then your performance with dedup could be comparable, or even faster than running without dedup. Of course, depending on your data patterns and usage patterns. YMMV. (2) The above is pretty much the best you can do, if your server is going to be a "normal" server, handling both reads & writes. Because the data and the meta_data are both stored in the ARC, the data has a tendency to push the meta_data out. But in a special use case - Suppose you only care about write performance and saving disk space. For example, suppose you''re the destination server of a backup policy. You only do writes, so you don''t care about keeping data in cache. You want to enable dedup to save cost on backup disks. You only care about keeping meta_data in ARC. If you set primarycache=metadata .... I''ll go test this now. The hypothesis is that my arc_meta_used should actually climb up to the arc_meta_limit before I start hitting any disk reads, so my write performance with/without dedup should be pretty much equal up to that point. I''m sacrificing the potential read benefit of caching data in ARC, in order to hopefully gain write performance - So write performance can be just as good with dedup enabled or disabled. In fact, if there''s much duplicate data, the dedup write performance in this case should be significantly better than without dedup.
2011/6/1 Edward Ned Harvey <opensolarisisdeadlongliveopensolaris at nedharvey.com>:> (2) ?The above is pretty much the best you can do, if your server is going > to be a "normal" server, handling both reads & writes. ?Because the data and > the meta_data are both stored in the ARC, the data has a tendency to push > the meta_data out. ?But in a special use case - Suppose you only care about > write performance and saving disk space. ?For example, suppose you''re the > destination server of a backup policy. ?You only do writes, so you don''t > care about keeping data in cache. ?You want to enable dedup to save cost on > backup disks. ?You only care about keeping meta_data in ARC. ?If you set > primarycache=metadata .... ?I''ll go test this now. ?The hypothesis is that > my arc_meta_used should actually climb up to the arc_meta_limit before I > start hitting any disk reads, so my write performance with/without dedup > should be pretty much equal up to that point. ?I''m sacrificing the potential > read benefit of caching data in ARC, in order to hopefully gain write > performance - So write performance can be just as good with dedup enabled or > disabled. ?In fact, if there''s much duplicate data, the dedup write > performance in this case should be significantly better than without dedup.I guess this is pretty much why I have primarycache=metadata and set zfs:zfs_arc_meta_limit=0x100000000 set zfs:zfs_arc_min=0xC0000000 in /etc/system. And the ARC size on this box tends to drop far below arc_min after a few days, not withstanding the fact it''s supposed to be a hard limit. I call for an arc_data_max setting :) -- Frank Van Damme No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author.