Edward Ned Harvey
2011-Jul-09 16:04 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
There were a lot of useful details put into the thread "Summary: Dedup and L2ARC memory requirements" Please refer to that thread as necessary... After much discussion leading up to that thread, I thought I had enough understanding to make dedup useful, but then in practice, it didn''t work out. Now I''ve done a lot more work on it, reduced it all to practice, and I finally feel I can draw up conclusions that are actually useful: I am testing on a Sun Oracle server, X4270, 1 Xeon 4-core 2.4Ghz, 24G ram, 12 disks ea 2T sas 7.2krpm. Solaris 11 express snv_151a OS is installed on a single disk. The remaining 11 disks are all striped into a single 20 TB pool (no redundancy). Obviously not the way you would configure for production, the point is to get maximum usable size and max performance for testing purposes. So I can actually find the limits in this lifetime. With and without dedup, the read and write performance characteristics on duplicate and unique data are completely different. That''s a lot of variables. So here''s how I''m going to break it down: Performance gain of dedup versus performance loss of dedup. --- Performance gain: Unfortunately there was only one area that I found any performance gain. When you read back duplicate data that was previously written with dedup, then you get a lot more cache hits, and as a result, the reads go faster. Unfortunately these gains are diminished... I don''t know by what... But you only have about 2x to 4x performance gain reading previously dedup''d data, as compared to reading the same data which was never dedup''d. Even when repeatedly reading the same file which is 100% duplicate data (created by dd from /dev/zero) so all the data is 100% in cache... I still see only 2x to 4x performance gain with dedup. --- Performance loss: The first conclusion to draw is: For extremely small pools (say, a few GB or so) writing with or without dedup performs exactly the same. As you grow the unique blocks in the pool, the write performance with dedup starts to deviate from the write performance without dedup. It quickly reaches 4x, 6x, 10x slower with dedup ... but this write performance degradation appears to be logarithmic. Where I reached 8x write performance degradation was around 2.4M blocks (290G used) but I never exceeded 11x write performance degradation even when I got up to 14T used (123M blocks) The second conclusion is: Out of the box, dedup is basically useless. Thanks to arc_meta_limit being pathetically small by default, (3840M in my 24G system) I ran into a write performance brick wall around 30M unique blocks in the system ~= 3.6T unique data in the pool. When I say "write performance brick wall," I mean up until that point, dedup writing was about 8x slower than writing without dedup, and after that point, the write performance difference grew exponentially. (Maybe it''s not mathematically exponential, but the numbers look exponential to my naked eye.) I left it running for about 19 hours, and I never got beyond 5.8T written in the system. Fortunately, it''s really easy to tweak arc_meta_limit. So in the second test, that''s what I did. I set the arc_meta_limit so high it would never be reached. In this configuration, the previously described logarithmic write performance degradation continued much higher. In other words, dedup write performance was pretty bad, but there was no "brick wall" as previously described. Basically I kept writing at this rate till the pool was 13.5T full (113M blocks), and the whole time, dedup write performance was approx 10x slower than writing without dedup. At this point, my arc_meta_used reached 15,500M and it would not grow any more, so I reached a "softer" brick wall. I could only conclude that the data being cached in arc was pushing the metadata out of arc. But that''s only a guess. So the 3rd test was to leave the arc_meta_limit at maximum value, and set the primarycache to metadata only. Naturally if you use this configuration in production, you''re likely to have poor read performance because you''re guaranteed you''ll never have a data cache hit... But it could still be a useful configuration, for example, if you are using dedup on a write-only backup server. In this configuration, write performance was much better than the other configurations. I reached 3x slower dedup write performance almost immediately. And 4x occurred around 47M blocks (5.6T used). And 5x occurred around 88M blocks (10.6T used). It maintained 6x until around 142M blocks (17T used) and 15,732 M arc_meta_used. At this point I hit 90% full, and the whole system basically fell apart, so I disregard all the results that came later. Based on the numbers I''m seeing, I have every reason to believe the system could have continued writing with dedup merely 6x slower than writing without dedup, if I had not run out of disk space. At least theoretically this should continue until I cannot fit the metadata in ram anymore, and then I''ll hit a brick wall... But the only way I can measure that is to go remove ram from my system and repeat the test. I don''t think I''ll bother. It has been mentioned before, that you suffer a big performance penalty when you delete things. This is very very true. Snapshot destroy, or even rm a file, and the time to completion is on the same order with the time it took to initially create all that data. This is a really big weak point. It might take several hours to perform the snapshot destroy of the oldest daily snapshot. And naturally that would be likely to occur on a daily basis every midnight. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110709/172ad683/attachment.html>
Edward Ned Harvey
2011-Jul-09 17:55 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
Given the abysmal performance, I have to assume there is a significant number of "overhead" reads or writes in order to maintain the DDT for each "actual" block write operation. Something I didn''t mention in the other email is that I also tracked iostat throughout the whole operation. It''s all writes (or at least 99.9% writes.) So I am forced to conclude it''s a bunch of small DDT maintenance writes taking place and incurring access time penalties in addition to each intended single block access time penalty. The nature of the DDT is that it''s a bunch of small blocks, that tend to be scattered randomly, and require maintenance in order to do anything else. This sounds like precisely the usage pattern that benefits from low latency devices such as SSD''s. I understand the argument, DDT must be stored in the primary storage pool so you can increase the size of the storage pool without running out of space to hold the DDT... But it''s a fatal design flaw as long as you care about performance... If you don''t care about performance, you might as well use the netapp and do offline dedup. The point of online dedup is to gain performance. So in ZFS you have to care about the performance. There are only two possible ways to fix the problem. Either ... The DDT must be changed so it can be stored entirely in a designated sequential area of disk, and maintained entirely in RAM, so all DDT reads/writes can be infrequent and serial in nature... This would solve the case of async writes and large sync writes, but would still perform poorly for small sync writes. And it would be memory intensive. But it should perform very nicely given those limitations. ;-) Or ... The DDT stays as it is now, highly scattered small blocks, and there needs to be an option to store it entirely on low latency devices such as dedicated SSD''s. Eliminate the need for the DDT to reside on the slow primary storage pool disks. I understand you must consider what happens when the dedicated SSD gets full. The obvious choices would be either (a) dedup turns off whenever the metadatadevice is full or (b) it defaults to writing blocks in the main storage pool. Maybe that could even be a configurable behavior. Either way, there''s a very realistic use case here. For some people in some situations, it may be acceptable to say "I have 32G mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a maximum 218M unique blocks in pool, and if I estimate 100K average block size that means up to 20T primary pool storage. If I reach that limit, I''ll add more metadatadevice." Both of those options would also go a long way toward eliminating the "surprise" delete performance black hole.
Edward Ned Harvey
2011-Jul-09 18:21 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > When you read back duplicate data that was previously written with > dedup, then you get a lot more cache hits, and as a result, the reads go > faster.? Unfortunately these gains are diminished...? I don''t know by > what...? But you only have about 2x to 4x performance gain reading > previously dedup''d data, as compared to reading the same data which was > never dedup''d.? Even when repeatedly reading the same file which is 100% > duplicate data (created by dd from /dev/zero) so all the data is 100% in > cache... ??I still see only 2x to 4x performance gain with dedup.For what it''s worth: I also repeated this without dedup. Created a large file (17G, just big enough that it will fit entirely in my ARC). Rebooted. Timed reading it. Now it''s entirely in cache. Time reading it again. When it''s not cached, of course the read time was equal to the original write time. When it''s cached, it goes 4x faster. Perhaps this is only because I''m testing on a machine that has super fast storage... 11 striped SAS disks yielding 8Gbit/sec as compared to all-RAM which yielded 31.2Gbit/sec. It seems in this case, RAM is only 4x faster than the storage itself... But I would have expected a couple orders of magnitude... So perhaps my expectations are off, or the ARC itself simply incurs overhead. Either way, dedup is not to blame for obtaining merely 2x or 4x performance gain over the non-dedup equivalent.
Roy Sigurd Karlsbakk
2011-Jul-09 18:33 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> When it''s not cached, of course the read time was equal to the > original > write time. When it''s cached, it goes 4x faster. Perhaps this is only > because I''m testing on a machine that has super fast storage... 11 > striped > SAS disks yielding 8Gbit/sec as compared to all-RAM which yielded > 31.2Gbit/sec. It seems in this case, RAM is only 4x faster than the > storage > itself... But I would have expected a couple orders of magnitude... So > perhaps my expectations are off, or the ARC itself simply incurs > overhead. > Either way, dedup is not to blame for obtaining merely 2x or 4x > performance > gain over the non-dedup equivalent.Could you test with some SSD SLOGs and see how well or bad the system performs? Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Edward Ned Harvey
2011-Jul-09 19:30 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net] > Sent: Saturday, July 09, 2011 2:33 PM > > Could you test with some SSD SLOGs and see how well or bad the system > performs?These are all async writes, so slog won''t be used. Async writes that have a single fflush() and fsync() at the end to ensure system buffering is not skewing the results.
Roy Sigurd Karlsbakk
2011-Jul-09 19:43 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> > From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net] > > Sent: Saturday, July 09, 2011 2:33 PM > > > > Could you test with some SSD SLOGs and see how well or bad the > > system > > performs? > > These are all async writes, so slog won''t be used. Async writes that > have a single fflush() and fsync() at the end to ensure system > buffering is not skewing the results.Sorry, my bad, I meant L2ARC to help buffer the DDT Vennlige hilsener / Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Edward Ned Harvey
2011-Jul-10 13:50 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > --- Performance loss:I ran one more test, that is rather enlightening. I repeated test #2 (tweak arc_meta_limit, use the default primarycache=all) but this time I wrote 100% duplicate data instead of unique. Dedup=sha256 (no verify). Ideally, you would expect this to write very very fast... Because it''s all duplicate data, and it''s all async, the system should just buffer a bunch of tiny metadata changes, aggregate them, and occasionally write a single serial block when it flushes the TXG. It should be much faster to write dedup. The results are: With dedup, it writes several times slower. Just the same as test #2, minus the amount of time it takes to write the actual data. For example, here''s one datapoint, which is representative of the whole test: time to write unique data without dedup: 7.090 sec time to write unique data with dedup: 47.379 sec time to write duplic data without dedup: 7.016 sec time to write duplic data with dedup: 39.852 sec This clearly breaks it down: 7 sec to write the actual data 40 sec overhead caused by dedup <1 sec is about how fast it should have been writing duplicated data
Edward Ned Harvey
2011-Jul-10 14:01 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net] > Sent: Saturday, July 09, 2011 3:44 PM > > > > Could you test with some SSD SLOGs and see how well or bad the > > > system > > > performs? > > > > These are all async writes, so slog won''t be used. Async writes that > > have a single fflush() and fsync() at the end to ensure system > > buffering is not skewing the results. > > Sorry, my bad, I meant L2ARC to help buffer the DDTOh - It just so happens I don''t have one available, but that doesn''t mean I can''t talk about it. ;-) For quite a lot of these tests, all the data resides in the ARC, period. The only area where the L2ARC would have an effect is after that region... When I''m pushing the limits of ARC then there may be some benefit from the use of L2ARC. So ... It is distinctly possible the L2ARC might help soften the "brick wall." When reaching arc_meta_limit, some of the metadata might have been pushed out to L2ARC in order to leave a (slightly) smaller footprint in the ARC... I doubt it, but maybe there could be some gain here. It is distinctly possible the L2ARC might help test #2 approach the performance of test #3 (test #2 had primarycache=all and suffered approx 10x write performance degradation, while test #3 had primarycache=metadata and suffered approx 6x write performance degradation.) But there''s positively no way the L2ARC would come into play on test #3. In this situation, all the metadata, the complete DDT resides in RAM. So with or without the cache device, the best case we''re currently looking at is approx 6x write performance degradation.
Edward Ned Harvey
2011-Jul-10 14:22 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net] > Sent: Saturday, July 09, 2011 3:44 PM > > Sorry, my bad, I meant L2ARC to help buffer the DDTAlso, bear in mind, the L2ARC is only for reads. So it can''t help accelerate writing updates to the DDT. Those updates need to hit the pool, period. Yes, on test 1 and test 2, there were significant regions where reads were taking place. (Basically the whole test, approx 25% to 30% reads.) On test 3, there were absolutely no reads up till 75M entries (9.07 T used) arc_meta_used= 12960 MB. Up to this point, it was a 4x write performance degradation. Then it suddenly started performing about 5% reads and 95% writes and suddenly jumped to 6x write performance degradation.
Jim Klimov
2011-Jul-12 11:40 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-09 20:04, Edward Ned Harvey ?????:> > --- Performance gain: > > Unfortunately there was only one area that I found any performance > gain. When you read back duplicate data that was previously written > with dedup, then you get a lot more cache hits, and as a result, the > reads go faster. Unfortunately these gains are diminished... I don''t > know by what... But you only have about 2x to 4x performance gain > reading previously dedup''d data, as compared to reading the same data > which was never dedup''d. Even when repeatedly reading the same file > which is 100% duplicate data (created by dd from /dev/zero) so all the > data is 100% in cache... I still see only 2x to 4x performance gain > with dedup. >First of all, thanks for all the experimental research and results, even if the outlook is grim. I''d love to see comments about those systems which use dedup and actually gain benefits, and how much they gain (i.e. VM farms, etc.), and what may differ in terms of setup (i.e. at least 256Gb RAM or whatever). Hopefully the discrepancy between blissful hopes (I had) - that dedup would save disk space and boost the systems somewhat kinda like online compression can do - and cruel reality would result in some improvement project. Perhaps it would be an offline dedup implementation (perhaps with an online-dedup option turnable off), as recently discussed on list. Deleting stuff is still apain though. For the past week my box is trying to delete an rsynced backup of a linux machine, some 300k files summed up to 50Gb. Deleting large files was rather quick, but those consuming just a few blocks are really slow. Even if I batch background RM''s so a hundred processes hang and then they all at once complete in a minute or two. And quite often the iSCSI initiator or target go crazy so one of the boxes (or both) have to be rebooted, about trice a day. I described my setup before, won''t clobber it into here ;) Regarding the low read performance gain, you suggested in a later post that this could be due to the RAM and disk bandwidth difference in your machine. I for one think that (without sufficient ARC block-caching) dedup reading would suffer greatly also from fragmentation - any one large file with some or all deduped data is basically guaranteed to have its blocks scattered across all of your storage. At least if this file was committed to the deduped pool late in its life, when most or all of the blocks were already there. By the way, did you estimate how much is dedup''s overhead in terms of metadata blocks? For example it was often said on the list that you shouldn''t bother with dedup unless you data can be deduped 2x or better, and if you''re lucky to already have it on ZFS - you can estimate the reduction with zdb. Now, I wonder where the number comes from - is it empirical, or would dedup metadata take approx 1x the data space, thus under 2x reduction you gain little or nothing? ;) Thanks for the research, //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110712/324a6033/attachment.html>
Edward Ned Harvey
2011-Jul-12 12:46 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Jim Klimov > > By the way, did you estimate how much is dedup''s overhead > in terms of metadata blocks? For example it was often said > on the list that you shouldn''t bother with dedup unless you > data can be deduped 2x or better, and if you''re lucky to > already have it on ZFS - you can estimate the reduction > with zdb. Now, I wonder where the number comes from - > is it empirical, or would dedup metadata take approx 1x > the data space, thus under 2x reduction you gain little > or nothing? ;)You and I seem to have different interprettations of the empirical "2x" soft-requirement to make dedup worthwhile. I always interpretted it like this: If read/write of DUPLICATE blocks with dedup enabled yields 4x performance gain, and read/write of UNIQUE blocks with dedup enabled yields 4x performance loss, then you need a 50/50 mix of unique and duplicate blocks in the system in order to break even. This is the same as having a 2x dedup ratio. Unfortunately based on this experience, I would now say something like a dedup ratio of 10x is more likely the break even point. Ideally, read/write of unique blocks should be just as fast, with or without dedup. Ideally, read/write of duplicate blocks would be an order of magnitude (or more) faster with dedup. It''s not there right now... But I still have high hopes. You know what? A year ago I would have said dedup still wasn''t stable enough for production. Now I would say it''s plenty stable enough... But it needs performance enhancement before it''s truly useful for most cases.
Jim Klimov
2011-Jul-12 13:05 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
This dedup discussion (and my own bad expreience) have also left me with another grim thought: some time ago sparse-root zone support was ripped out of OpenSolaris. Among the published rationales were transition to IPS and the assumption that most people used them to save on disk space (notion about saving RAM on shared objects was somehow dismissed). Regarding the disk savings, it was said that dedup would solve the problem, at least for those systems which use dedup on zoneroot dataset (and preferably that would be in the rpool, too). On one hand, storing zoneroots in the rpool was never practical for us because we tend to keep the rpool small and un-clobbered, and on the other hand, now adding dedup to rpool would seem like shooting oneself in the foot with a salt-loaded shotgun. Maybe it won''t kill, but would hurt a lot and for a long time. On the third hand ;) with a small rpool hosting zoneroots as well, the DDT would reasonably be small too, and may actually boost performance while saving space. But lots of attention should now be paid to seperating /opt, parts of /var and stuff into delegated datasets from a larger datapool. And software like Sun JES which installs into a full-root zone''s /usr might overwhelm a small rpool as well. Anyhow, Edward, is there a test for this scenario - i.e. a 10Gb pool with lots of non-unique data in small blocks? Thanks, //Jim
Jim Klimov
2011-Jul-12 13:19 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> You and I seem to have different interprettations of the > empirical "2x" soft-requirement to make dedup worthwhile.Well, until recently I had little interpretation for it at all, so your approach may be better. I hope that authors of the requirement statement would step forward and explain what it is about "under the hood" and why 2x ;)> You know what? A year ago I would have said dedup still > wasn''t stable enough for production. Now I would say it''s plenty stable > enough... But it needs performance enhancement before it''s > truly useful for most cases.Well, not that this would contradict you, but on my oi_148a (which may be based on code close to a year old), it seems rather unstable, with systems either freezing or slowing down after some writes and having to be rebooted in order to work (fresh after boot writes are usually relatively good, i.e. 5Mb/s vs. 100k/s). On the iSCSI server side, the LUN and STMF service often lock up with "device busy" even though the volume "pool/dcpool" is not itself deduped. For me this is only solved by a reboot... And reboots of the VM client which fights its way through deleting files from the deduped datasets inside "dcpool" (imported over iSCSI) are beyond counting. Actually in a couple of weeks I might be passing by that machine and may have a chance to update it to oi_151-dev, would that buy me any improvements, or potentially worsen my situation? ;) //Jim -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110712/1ec7a793/attachment.html>
Bob Friesenhahn
2011-Jul-12 13:58 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
On Tue, 12 Jul 2011, Edward Ned Harvey wrote:> > You know what? A year ago I would have said dedup still wasn''t stable > enough for production. Now I would say it''s plenty stable enough... But it > needs performance enhancement before it''s truly useful for most cases.What has changed for you to change your mind? Did the zfs code change in the past year, or is this based on experience with the same old stagnant code? Bob -- Bob Friesenhahn bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/ GraphicsMagick Maintainer, http://www.GraphicsMagick.org/
Edward Ned Harvey
2011-Jul-13 01:08 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: Bob Friesenhahn [mailto:bfriesen at simple.dallas.tx.us] > Sent: Tuesday, July 12, 2011 9:58 AM > > > You know what? A year ago I would have said dedup still wasn''t stable > > enough for production. Now I would say it''s plenty stable enough...But it> > needs performance enhancement before it''s truly useful for most cases. > > What has changed for you to change your mind? Did the zfs code change > in the past year, or is this based on experience with the same old > stagnant code?No idea. I assume they''ve been patching, and I don''t hear many people complaining of dedup instability on this list anymore. But the other option is that nothing''s changed, and only my perception has changed. I acknowledge that''s possible.
Frank Van Damme
2011-Jul-14 07:54 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
Op 12-07-11 13:40, Jim Klimov schreef:> Even if I batch background RM''s so a hundred processes hang > and then they all at once complete in a minute or two.Hmmm. I only run one rm process at a time. You think running more processes at the same time would be faster? -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author.
Jim Klimov
2011-Jul-14 10:28 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-14 11:54, Frank Van Damme ?????:> Op 12-07-11 13:40, Jim Klimov schreef: >> Even if I batch background RM''s so a hundred processes hang >> and then they all at once complete in a minute or two. > Hmmm. I only run one rm process at a time. You think running more > processes at the same time would be faster?Yes, quite often it seems so. Whenever my slow "dcpool" decides to accept a write, it processes a hundred pending deletions instead of one ;) Even so, it took quite a few pool or iscsi hangs and then reboots of both server and client, and about a week overall, to remove a 50Gb dir with 400k small files from a deduped pool served over iscsi from a volume in a physical pool. Just completed this night ;)
Frank Van Damme
2011-Jul-14 11:48 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
Op 14-07-11 12:28, Jim Klimov schreef:>> > Yes, quite often it seems so. > Whenever my slow "dcpool" decides to accept a write, > it processes a hundred pending deletions instead of one ;) > > Even so, it took quite a few pool or iscsi hangs and then > reboots of both server and client, and about a week overall, > to remove a 50Gb dir with 400k small files from a deduped > pool served over iscsi from a volume in a physical pool. > > Just completed this night ;)It seems counter-intuitive - you''d say: concurrent disk access makes things only slower - , but it turns out to be true. I''m deleting a dozen times faster than before. How completely ridiculous. Thank you :-) -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author.
Jim Klimov
2011-Jul-14 15:24 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-14 15:48, Frank Van Damme ?????:> It seems counter-intuitive - you''d say: concurrent disk access makes > things only slower - , but it turns out to be true. I''m deleting a > dozen times faster than before. How completely ridiculous. Thank you :-)Well, look at it this way: it is not only about singular disk accesses (i.e. unlike other FSes, you do not in-place modify a directory entry), with ZFS COW it is about rewriting a tree of block pointers, with any new writes going into free (unreferenced ATM) disk blocks anyway. So by hoarding writes you have a chance to reduce mechanical IOPS required for your tasks. Until you run out of RAM ;) Just in case it helps, to quickly fire up removals of the specific directory after yet another reboot of the box, and not overwhelm it with hundreds of thousands queued "rm"processes either, I made this script as /bin/RM: ==#!/bin/sh SLEEP=10 [ x"$1" != x ] && SLEEP=$1 A=0 # To rm small files: find ... -size -10 find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do du -hs "$LINE" rm -f "$LINE" & A=$(($A+1)) [ "$A" -ge 100 ] && ( date; while [ `ps -ef | grep -wc rm` -gt 50 ]; do echo "Sleep $SLEEP..."; ps -ef | grep -wc rm ; sleep $SLEEP; ps -ef | grep -wc rm; done date ) && A="`ps -ef | grep -wc rm`" done ; date == Essentially, after firing up 100 "rm attempts" it waits for the "rm" process count to go below 50, then goes on. Sizing may vary between systems, phase of the moon and computer''s attitude. Sometimes I had 700 processes stacked and processed quickly. Sometimes it hung on 50... HTH, //Jim
Daniel Carosone
2011-Jul-15 02:21 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
um, this is what xargs -P is for ... -- Dan. On Thu, Jul 14, 2011 at 07:24:52PM +0400, Jim Klimov wrote:> 2011-07-14 15:48, Frank Van Damme ?????: >> It seems counter-intuitive - you''d say: concurrent disk access makes >> things only slower - , but it turns out to be true. I''m deleting a >> dozen times faster than before. How completely ridiculous. Thank you >> :-) > > Well, look at it this way: it is not only about singular disk accesses > (i.e. unlike other FSes, you do not in-place modify a directory entry), > with ZFS COW it is about rewriting a tree of block pointers, with any > new writes going into free (unreferenced ATM) disk blocks anyway. > > So by hoarding writes you have a chance to reduce mechanical > IOPS required for your tasks. Until you run out of RAM ;) > > Just in case it helps, to quickly fire up removals of the specific > directory > after yet another reboot of the box, and not overwhelm it with hundreds > of thousands queued "rm"processes either, I made this script as /bin/RM: > > ==> #!/bin/sh > > SLEEP=10 > [ x"$1" != x ] && SLEEP=$1 > > A=0 > # To rm small files: find ... -size -10 > find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do > du -hs "$LINE" > rm -f "$LINE" & > A=$(($A+1)) > [ "$A" -ge 100 ] && ( date; while [ `ps -ef | grep -wc rm` -gt 50 ]; do > echo "Sleep $SLEEP..."; ps -ef | grep -wc rm ; sleep $SLEEP; ps -ef > | grep -wc rm; > done > date ) && A="`ps -ef | grep -wc rm`" > done ; date > ==> > Essentially, after firing up 100 "rm attempts" it waits for the "rm" > process count to go below 50, then goes on. Sizing may vary > between systems, phase of the moon and computer''s attitude. > Sometimes I had 700 processes stacked and processed quickly. > Sometimes it hung on 50... > > HTH, > //Jim > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/24ca3508/attachment.bin>
Edward Ned Harvey
2011-Jul-15 02:27 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Edward Ned Harvey > > I understand the argument, DDT must be stored in the primary storage pool > so > you can increase the size of the storage pool without running out of space > to hold the DDT... But it''s a fatal design flaw as long as you care about > performance... If you don''t care about performance, you might as well use > the netapp and do offline dedup. The point of online dedup is to gain > performance. So in ZFS you have to care about the performance. > > There are only two possible ways to fix the problem. > Either ... > The DDT must be changed so it can be stored entirely in a designated > sequential area of disk, and maintained entirely in RAM, so all DDT > reads/writes can be infrequent and serial in nature... This would solvethe> case of async writes and large sync writes, but would still perform poorly > for small sync writes. And it would be memory intensive. But it should > perform very nicely given those limitations. ;-) > Or ... > The DDT stays as it is now, highly scattered small blocks, and there needs > to be an option to store it entirely on low latency devices such as > dedicated SSD''s. Eliminate the need for the DDT to reside on the slow > primary storage pool disks. I understand you must consider what happens > when the dedicated SSD gets full. The obvious choices would be either (a) > dedup turns off whenever the metadatadevice is full or (b) it defaults to > writing blocks in the main storage pool. Maybe that could even be a > configurable behavior. Either way, there''s a very realistic use casehere.> For some people in some situations, it may be acceptable to say "I have32G> mirrored metadatadevice, divided by 137bytes per entry I can dedup up to a > maximum 218M unique blocks in pool, and if I estimate 100K average block > size that means up to 20T primary pool storage. If I reach that limit,I''ll> add more metadatadevice." > > Both of those options would also go a long way toward eliminating the > "surprise" delete performance black hole.Is anyone from Oracle reading this? I understand if you can''t say what you''re working on and stuff like that. But I am merely hopeful this work isn''t going into a black hole... Anyway. Thanks for listening (I hope.) ttyl
Jim Klimov
2011-Jul-15 03:56 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-15 6:21, Daniel Carosone ?????:> um, this is what xargs -P is for ...Thanks for the hint. True, I don''t often use xargs. However from the man pages, I don''t see a "-P" option on OpenSolaris boxes of different releases, and there is only a "-p" (prompt) mode. I am not eager to enter "yes" 400000 times ;) The way I had this script in practice, I could enter "RM" once and it worked till the box hung. Even then, a watchdog script could often have it rebooted without my interaction so it could continue in the next lifetime ;)> > -- > Dan. > > On Thu, Jul 14, 2011 at 07:24:52PM +0400, Jim Klimov wrote: >> 2011-07-14 15:48, Frank Van Damme ?????: >>> It seems counter-intuitive - you''d say: concurrent disk access makes >>> things only slower - , but it turns out to be true. I''m deleting a >>> dozen times faster than before. How completely ridiculous. Thank you >>> :-) >> Well, look at it this way: it is not only about singular disk accesses >> (i.e. unlike other FSes, you do not in-place modify a directory entry), >> with ZFS COW it is about rewriting a tree of block pointers, with any >> new writes going into free (unreferenced ATM) disk blocks anyway. >> >> So by hoarding writes you have a chance to reduce mechanical >> IOPS required for your tasks. Until you run out of RAM ;) >> >> Just in case it helps, to quickly fire up removals of the specific >> directory >> after yet another reboot of the box, and not overwhelm it with hundreds >> of thousands queued "rm"processes either, I made this script as /bin/RM: >> >> ==>> #!/bin/sh >> >> SLEEP=10 >> [ x"$1" != x ]&& SLEEP=$1 >> >> A=0 >> # To rm small files: find ... -size -10 >> find /export/OLD/PATH/TO/REMOVE -type f | while read LINE; do >> du -hs "$LINE" >> rm -f "$LINE"& >> A=$(($A+1)) >> [ "$A" -ge 100 ]&& ( date; while [ `ps -ef | grep -wc rm` -gt 50 ]; do >> echo "Sleep $SLEEP..."; ps -ef | grep -wc rm ; sleep $SLEEP; ps -ef >> | grep -wc rm; >> done >> date )&& A="`ps -ef | grep -wc rm`" >> done ; date >> ==>> >> Essentially, after firing up 100 "rm attempts" it waits for the "rm" >> process count to go below 50, then goes on. Sizing may vary >> between systems, phase of the moon and computer''s attitude. >> Sometimes I had 700 processes stacked and processed quickly. >> Sometimes it hung on 50... >> >> HTH, >> //Jim >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/e81efe1f/attachment.html>
Daniel Carosone
2011-Jul-15 03:59 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
On Fri, Jul 15, 2011 at 07:56:25AM +0400, Jim Klimov wrote:> 2011-07-15 6:21, Daniel Carosone ?????: >> um, this is what xargs -P is for ... > > Thanks for the hint. True, I don''t often use xargs. > > However from the man pages, I don''t see a "-P" option > on OpenSolaris boxes of different releases, and there > is only a "-p" (prompt) mode. I am not eager to enter > "yes" 400000 times ;)you want the /usr/gnu/{bin,share/man} version, at least in this case. -- Dan. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 194 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/0785ff52/attachment.bin>
Frank Van Damme
2011-Jul-15 06:35 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
Op 15-07-11 04:27, Edward Ned Harvey schreef:> Is anyone from Oracle reading this? I understand if you can''t say what > you''re working on and stuff like that. But I am merely hopeful this work > isn''t going into a black hole... > > Anyway. Thanks for listening (I hope.) ttylIf they aren''t, maybe someone from an open source Solaris version is :) -- No part of this copyright message may be reproduced, read or seen, dead or alive or by any means, including but not limited to telepathy without the benevolence of the author.
phil.harman@gmail.com
2011-Jul-15 07:10 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
If you clone zones from a golden image using ZFS cloning, you get fast, efficient dedup for free. Sparse root always was a horrible hack! ----- Reply message ----- From: "Jim Klimov" <jimklimov at cos.ru> To: Cc: <zfs-discuss at opensolaris.org> Subject: [zfs-discuss] Summary: Dedup memory and performance (again, again) Date: Tue, Jul 12, 2011 14:05 This dedup discussion (and my own bad expreience) have also left me with another grim thought: some time ago sparse-root zone support was ripped out of OpenSolaris. Among the published rationales were transition to IPS and the assumption that most people used them to save on disk space (notion about saving RAM on shared objects was somehow dismissed). Regarding the disk savings, it was said that dedup would solve the problem, at least for those systems which use dedup on zoneroot dataset (and preferably that would be in the rpool, too). On one hand, storing zoneroots in the rpool was never practical for us because we tend to keep the rpool small and un-clobbered, and on the other hand, now adding dedup to rpool would seem like shooting oneself in the foot with a salt-loaded shotgun. Maybe it won''t kill, but would hurt a lot and for a long time. On the third hand ;) with a small rpool hosting zoneroots as well, the DDT would reasonably be small too, and may actually boost performance while saving space. But lots of attention should now be paid to seperating /opt, parts of /var and stuff into delegated datasets from a larger datapool. And software like Sun JES which installs into a full-root zone''s /usr might overwhelm a small rpool as well. Anyhow, Edward, is there a test for this scenario - i.e. a 10Gb pool with lots of non-unique data in small blocks? Thanks, //Jim _______________________________________________ zfs-discuss mailing list zfs-discuss at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110715/b98846be/attachment-0001.html>
Jim Klimov
2011-Jul-15 10:19 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
2011-07-15 11:10, phil.harman at gmail.com ?????:> If you clone zones from a golden image using ZFS cloning, you get > fast, efficient dedup for free. Sparse root always was a horrible hack!Sounds like a holy war is flaming up ;) From what I heard, sparse root zones with shared common system libraries allowed to save not only on disk space but also on RAM. Can''t vouch, never tested extensively myself. Cloning of golden zones is of course used in our systems. But this approach breaks badly upon any major systems update (i.e. LiveUpgrade to a new release) - many of the binaries change, and you either suddenly have the zones (wanting to) consume many gigabytes of disk space which are not there on a small rpool or a busy data pool, or you have to make a new golden image, clone a new set of zones and reinstall/migrate all applications and settings. True, this is a no-brainer for zones running a single task like an /opt/tomcat directory which can be tossed around to any OS, but becomes tedious for software with many packages and complicated settings, especially if (in some extremity) it was homebrewn and/or home-compiled and unpackaged ;) I am not the first (or probably last) to write about inconvenience of zone upgrades which loses the cloning benefit, and much of the same is true for upgrading cloned/deduped VM golden images as well, where the golden image is just some common baseline OS but the clones all run different software. And it is this different software which makes them useful and unique, and too distinct to maintain a dozen of golden images efficiently (i.e. there might be just 2 or 3 clones of each gold). But in general, the problem is there - you either accept that your OS images in effect won''t be deduped, much or at all, after some lifespan involving OS upgrades, or you don''t update them often (which may be inacceptable for security and/or paranoia types of deployments), or you use some trickery to update frequently and not lose much disk space, such as automation of software and configs migration from one clone (of old gold) to another clone (of new gold). Dedup was a promising variant in this context, unless it kills performance and/or stability... which was the subject of this thread, with Edward''s research into performance of current dedup implementation (and perhaps some baseline to test whether real improvements appear in the future). And in terms of performance there''s some surprise in Edward''s findings regarding i.e. reads from the deduped data. For infrequent maintenance (i.e. monthly upgrades) zoneroots (OS image part) would be read-mostly and write performance of dedup may not matter much. If the updates must pour in often for whatever reason, then write and delete performance of dedup may begin to matter. Sorry about straying the discussion into zones - they, their performance and coping with changes introduced during lifetime (see OS upgrades), are one good example for discussion of dedup, and its one application which may be commonly useful on any server or workstation, not only on on hardware built for dedicated storage. Sparse-root vs. full-root zones, or disk images of VMs; are they stuffed in one rpool or spread between rpool and data pools - that detail is not actually the point of the thread. Actual useability of dedup for savings and gains on these tasks (preferably working also on low-mid-range boxes, where adding a good enterprise SSD would double the server cost - not only on those big good systems with tens of GB of RAM), and hopefully simplifying the system configuration and maintenance - that is indeed the point in question. //Jim
Mike Gerdts
2011-Jul-15 14:55 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
On Fri, Jul 15, 2011 at 5:19 AM, Jim Klimov <jimklimov at cos.ru> wrote:> 2011-07-15 11:10, phil.harman at gmail.com ?????: >> >> If you clone zones from a golden image using ZFS cloning, you get fast, >> efficient dedup for free. Sparse root always was a horrible hack! > > Sounds like a holy war is flaming up ;) > > From what I heard, sparse root zones with shared common > system libraries allowed to save not only on disk space but > also on RAM. Can''t vouch, never tested extensively myself.There may be some benefit to that, I''d argue that most of the time there''s not that much. Using what is surely an imperfect way of measuring, I took a look a zone on a Solaris 10 box that I happen to be logged into. I found it is using about 52 MB of memory in mappings of executables and libraries. By disabling webconsole (a java program that has a RSS size of 100+ MB) the shared mappings drop to 40 MB. # cd /proc # pmap -xa * | grep r.x | grep -v '' anon '' | grep -v '' stack '' | grep -v '' heap '' | sort -u | nawk ''{ t+= $3 } END { print t / 1024, "MB" }'' pmap: cannot examine 22427: system process 40.3281 MB If you are running the same large application (large executable + libraries resident in memory) in many zones, you may have additional benefit. Solaris 10 was released in 2005, meaning that sparse root zones were conceived sometime in the years leading up to that. In that time, the entry level servers have gone from 1 - 2 GB of memory (e.g. a V210 or V240) to 12 - 16+ GB of memory (X2270 M2, T3-1). Further, large systems tend to have NUMA characteristics that challenge the logic of trying to maintain only one copy of hot read-only executable pages. It just doesn''t make sense to constrain the design of zones around something that is going to save 0.3% of the memory of an entry level server. Even in 2005, I''m not so sure it was a strong argument. Disk space is another issue. Jim does a fine job of describing the issues around that.> Cloning of golden zones is of course used in our systems. > But this approach breaks badly upon any major systems > update (i.e. LiveUpgrade to a new release) - many of the > binaries change, and you either suddenly have the zones > (wanting to) consume many gigabytes of disk space which > are not there on a small rpool or a busy data pool, or you > have to make a new golden image, clone a new set of > zones and reinstall/migrate all applications and settings. > > True, this is a no-brainer for zones running a single task > like an /opt/tomcat directory which can be tossed around > to any OS, but becomes tedious for software with many > packages and complicated settings, especially if (in some > extremity) it was homebrewn and/or home-compiled and > unpackaged ;) > > I am not the first (or probably last) to write about inconvenience > of zone upgrades which loses the cloning benefit, and much > of the same is true for upgrading cloned/deduped VM golden > images as well, where the golden image is just some common > baseline OS but the clones all run different software. And it is > this different software which makes them useful and unique, > and too distinct to maintain a dozen of golden images efficiently > (i.e. there might be just 2 or 3 clones of each gold). > > But in general, the problem is there - you either accept that > your OS images in effect won''t be deduped, much or at all, > after some lifespan involving OS upgrades, or you don''t > update them often (which may be inacceptable for security > and/or paranoia types of deployments), or you use some > trickery to update frequently and not lose much disk space, > such as automation of software and configs migration > from one clone (of old gold) to another clone (of new gold). > > Dedup was a promising variant in this context, unless it > kills performance and/or stability... which was the subject > of this thread, with Edward''s research into performance > of current dedup implementation (and perhaps some > baseline to test whether real improvements appear in > the future). > > And in terms of performance there''s some surprise in > Edward''s findings regarding i.e. reads from the deduped > data. For infrequent maintenance (i.e. monthly upgrades) > zoneroots (OS image part) would be read-mostly and write > performance of dedup may not matter much. If the updates > must pour in often for whatever reason, then write and > delete performance of dedup may begin to matter. > > Sorry about straying the discussion into zones - they, > their performance and coping with changes introduced > during lifetime (see OS upgrades), are one good example > for discussion of dedup, and its one application which > may be commonly useful on any server or workstation, > not only on on hardware built for dedicated storage. > > Sparse-root vs. full-root zones, or disk images of VMs; > are they stuffed in one rpool or spread between rpool and > data pools - that detail is not actually the point of the thread. > > Actual useability of dedup for savings and gains on these > tasks (preferably working also on low-mid-range boxes, > where adding a good enterprise SSD would double the > server cost - not only on those big good systems with > tens of GB of RAM), and hopefully simplifying the system > configuration and maintenance - that is indeed the point > in question. > > //Jim > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-- Mike Gerdts http://mgerdts.blogspot.com/
Ian Collins
2011-Jul-23 08:01 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
On 07/10/11 04:04 AM, Edward Ned Harvey wrote:> > There were a lot of useful details put into the thread "Summary: Dedup > and L2ARC memory requirements" > > Please refer to that thread as necessary... After much discussion > leading up to that thread, I thought I had enough understanding to > make dedup useful, but then in practice, it didn''t work out. Now I''ve > done a lot more work on it, reduced it all to practice, and I finally > feel I can draw up conclusions that are actually useful: > > I am testing on a Sun Oracle server, X4270, 1 Xeon 4-core 2.4Ghz, 24G > ram, 12 disks ea 2T sas 7.2krpm. Solaris 11 express snv_151a >Can you provide more details of your tests? I''m currently testing a couple of slightly better configured X4270s (2 CPU, 96GB RAM and a FLASH accelerator card) using real data from an existing server. So far, I haven''t seen the levels of performance fall off you report. I currently have about 5TB of uncompressed data in the pool (stripe of 5 mirrors) and throughput is similar to the existing, Solaris 10, servers. The pool dedup ratio is 1.7, so there''s a good mix of unique and duplicate blocks. -- Ian.
Edward Ned Harvey
2011-Jul-23 13:17 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: Ian Collins [mailto:ian at ianshome.com] > Sent: Saturday, July 23, 2011 4:02 AM > > Can you provide more details of your tests?Here''s everything: http://dl.dropbox.com/u/543241/dedup%20tests/dedup%20tests.zip In particular: Under the "work server" directory. The basic concept goes like this: Find some amount of data that takes approx 10 sec to write. I don''t know the size, I just kept increasing a block counter till got times I felt were reasonable, so let''s suppose it''s 10G. Time Write that much without dedup (all unique). Remove the file. Time Write that much with dedup (sha256, no verify) (all unique). Remove the file. Write 10x that much with dedup (all unique). Don''t remove the file. Repeat. So I''m getting comparisons of write speeds for 10G files, sampling at 100G intervals. For a 6x performance degradation, it would be 7 sec to write without dedup, and 40-45sec to write with dedup. I am doing fflush() and fsync() at the end of every file write, to ensure results are not skewed by write buffering.
Roberto Waltman
2011-Jul-24 16:21 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
Edward Ned Harvey wrote: > So I''m getting comparisons of write speeds for 10G files, sampling at 100G> intervals. For a 6x performance degradation, it would be 7 sec to write > without dedup, and 40-45sec to write with dedup.For a totally unscientific data point: The HW: Server - Supermicro "server" motherboard. Intel 920 CPU. 6 GB memory. 1 x 16 GB SSD as a boot device. 8 x 2TB "green" (5400 RPM?) hard drives. The disks configured with 3 equal size partitions, all p1''s in one raidz2 pool, all p2''s in another, all p3''s in another. (Done to improve performance by limiting head movement when most of the disk activity is in one pool) The SW: the last release of Open Solaris. (Current at the time, I have since moved to Solaris 11) The test: backup an almost full 750Gb external hard disk formatted as a single NTFS volume. The disk was connected via eSATA to a fast computer (also a supermicro + I920) running Ubuntu. The Ubuntu machine had access to the file server via NFS. The NFS-exported file system was created new for this backup, with dedup enabled, encryption and compression disabled, atime=off. This was the first (and last) time I tried enabling dedup. From previous similar transfers, (without dedup), I expected the backup to be finished in a few hours overnight, with the bottlenecks being the NTFS-3G driver in Ubuntu and the 100Mbit ethernet connection. It took more than EIGHT DAYS, without any other activity going on both machines. (My) conclusion: Running low on storage? get more/bigger disks. -- Roberto Waltman
Nico Williams
2011-Jul-24 18:42 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
On Jul 9, 2011 1:56 PM, "Edward Ned Harvey" < opensolarisisdeadlongliveopensolaris at nedharvey.com> wrote:> > Given the abysmal performance, I have to assume there is a significant > number of "overhead" reads or writes in order to maintain the DDT for each > "actual" block write operation. Something I didn''t mention in the other > email is that I also tracked iostat throughout the whole operation. It''s > all writes (or at least 99.9% writes.) So I am forced to conclude it''s a > bunch of small DDT maintenance writes taking place and incurring accesstime> penalties in addition to each intended single block access time penalty. > > The nature of the DDT is that it''s a bunch of small blocks, that tend tobe> scattered randomly, and require maintenance in order to do anything else. > This sounds like precisely the usage pattern that benefits from lowlatency> devices such as SSD''s.The DDT should be written to in COW fashion, and asynchronously, so there should be no access time penalty. Or so ISTM it should be. Dedup is necessarily slower for writing because of the deduplication table lookups. Those are synchronous lookups, but for async writes you''d think that total write throughput would only be affected by a) the additional read load (which is zero in your case) and b) any inability to put together large transactions due to the high latency of each logical write, but (b) shouldn''t happen, particularly if the DDT fits in RAM or L2ARC, as it does in your case. So, at first glance my guess is ZFS is leaving dedup write performance on the table most likely due to implementation reasons, not design reasons. Nico -- -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20110724/5d06def3/attachment.html>
Ian Collins
2011-Jul-25 00:38 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
On 07/25/11 04:21 AM, Roberto Waltman wrote:> Edward Ned Harvey wrote: > > So I''m getting comparisons of write speeds for 10G files, sampling > at 100G >> intervals. For a 6x performance degradation, it would be 7 sec to write >> without dedup, and 40-45sec to write with dedup. > For a totally unscientific data point: > > The HW: Server - Supermicro "server" motherboard. > Intel 920 CPU. > 6 GB memory. > 1 x 16 GB SSD as a boot device. > 8 x 2TB "green" (5400 RPM?) hard drives. > The disks configured with 3 equal size partitions, all p1''s in one > raidz2 pool, all p2''s in another, all p3''s in another. > (Done to improve performance by limiting head movement when most of the > disk activity is in one pool) > > The SW: the last release of Open Solaris. (Current at the time, I have > since moved to Solaris 11) > > The test: backup an almost full 750Gb external hard disk formatted as a > single NTFS volume. The disk was connected via eSATA to a fast computer > (also a supermicro + I920) running Ubuntu. > The Ubuntu machine had access to the file server via NFS. > The NFS-exported file system was created new for this backup, with dedup > enabled, encryption and compression disabled, atime=off. This was the > first (and last) time I tried enabling dedup. > > From previous similar transfers, (without dedup), I expected the backup > to be finished in a few hours overnight, with the bottlenecks being the > NTFS-3G driver in Ubuntu and the 100Mbit ethernet connection. > > It took more than EIGHT DAYS, without any other activity going on both > machines. > > (My) conclusion: Running low on storage? get more/bigger disks. >Add to that: if running dedup, get plenty of RAM and cache. I''m still seeing similar performance on my test system with and without dedup enabled. Snapshot deletion appears slightly slower, but I have yet to run timed tests. -- Ian.
Edward Ned Harvey
2011-Jul-25 13:31 UTC
[zfs-discuss] Summary: Dedup memory and performance (again, again)
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Ian Collins > > Add to that: if running dedup, get plenty of RAM and cache.Add plenty RAM. And tweak your arc_meta_limit. You can at least get dedup performance that''s on the same order of magnitude as performance without dedup. Cache devices don''t really help dedup very much - Because each DDT stored in ARC/L2ARC takes 376 bytes, and each reference to an L2ARC entry requires 176 bytes of ARC. So in order to prevent an individual DDT entry from being evicted to disk, you must either keep the 376 bytes in ARC, or evict it to L2ARC and keep 176 bytes. This is a very small payload. A good payload would be to evict a 128k block from ARC into L2ARC, keeping the 176 bytes only in ARC.