I haven''t seen much discussion on how deduplication affects performance. I''ve enabled dudup on my 4-disk raidz array and have seen a significant drop in write throughput, from about 100 MB/s to 3 MB/s. I can''t imagine such a decrease is normal.> # zpool iostat nest 1 (with dedup enabled): > ... > nest 1.05T 411G 91 18 197K 2.35M > nest 1.05T 411G 147 15 443K 1.98M > nest 1.05T 411G 82 28 174K 3.59M> # zpool iostat nest 1 (with dedup disabled): > ... > nest 1.05T 410G 0 787 0 96.9M > nest 1.05T 410G 1 899 253K 95.0M > nest 1.05T 409G 0 533 0 48.5MI do notice when dedup is enabled that the drives sound like they are constantly seeking. iostat shows average service times around 20 ms which is normal for my drives and prstat shows that my processor and memory aren''t a bottleneck. What could cause such a marked decrease in throughput? Is anyone else experiencing similar effects? Thanks, James
On Fri, Jan 08, 2010 at 10:00:14AM -0800, James Lee wrote:> I haven''t seen much discussion on how deduplication affects performance. > I''ve enabled dudup on my 4-disk raidz array and have seen a significant > drop in write throughput, from about 100 MB/s to 3 MB/s. I can''t > imagine such a decrease is normal.Seems like I''ve seen other posts with similar numbers (maybe 9MB/s or so?). Sounded like adding SSD for caching really improved performance however.> > > # zpool iostat nest 1 (with dedup enabled): > > ... > > nest 1.05T 411G 91 18 197K 2.35M > > nest 1.05T 411G 147 15 443K 1.98M > > nest 1.05T 411G 82 28 174K 3.59M > > > # zpool iostat nest 1 (with dedup disabled): > > ... > > nest 1.05T 410G 0 787 0 96.9M > > nest 1.05T 410G 1 899 253K 95.0M > > nest 1.05T 409G 0 533 0 48.5M > > I do notice when dedup is enabled that the drives sound like they are > constantly seeking. iostat shows average service times around 20 ms > which is normal for my drives and prstat shows that my processor and > memory aren''t a bottleneck. What could cause such a marked decrease in > throughput? Is anyone else experiencing similar effects? > > Thanks, > > JamesRay
See the reads on the pool with the low I/O ? I suspect reading the DDT causes the writes to slow down. See this bug http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566. It seems to give some backgrounds. Can you test setting the "primarycache=metadata" on the volume you test ? This would be my initial test. My suggestion would be that it may improve the situation because your ARC can be better utilized for DDT (this does not make much sence for production without a SSD cache, because you practially disable all caches for reading without a L2ARC (aka SSD)!) As I read the bug report above - it seems the if the DDT (deduplication table) does not fit into memory or dropped from there the DDT has to be read from disk causing massive random I/O. -- This message posted from opensolaris.org
James Lee wrote:> I haven''t seen much discussion on how deduplication affects performance. > I''ve enabled dudup on my 4-disk raidz array and have seen a significant > drop in write throughput, from about 100 MB/s to 3 MB/s. I can''t > imagine such a decrease is normal. > >What is you data? I''ve found data that lends its self to deduplication writes slightly faster while data that does not (video, iso images) writes dramatically slower. So I turn dedupe (and compression) off for filesystems containing "random" data. -- Ian.
On Fri, Jan 8, 2010 at 1:44 PM, Ian Collins <ian at ianshome.com> wrote:> James Lee wrote: > >> I haven''t seen much discussion on how deduplication affects performance. >> I''ve enabled dudup on my 4-disk raidz array and have seen a significant >> drop in write throughput, from about 100 MB/s to 3 MB/s. I can''t >> imagine such a decrease is normal. >> >> >> > What is you data? > > I have seen the same, fsstat reports 4-7 seconds of small writes thenbursts of 40-80MB/s but without dedup i see 80-150MB/s writes on my 4x 500GB sata drives, split between two controllers. 6GB of ram, and about 1.5TB of storage with 1.2TB used. if I disable dedup, speed goes backup. While doing dedup writes zfs destroy pool/filesystem takes about 100x time as usual even if the pool is that is being destroyed is empty reports say its far worse when over 100GB of data is on a drive. my dedup ratio for the pool is 1.15x. Read performance seems about the same or slightly faster I didn''t really benchmark this work load since my clients seem to be the bottleneck. As money is tight at the moment i don''t have the funds for a SSD to test with, but have disk space on non-utilized disk to try but haven''t researched the effect of adding and removing (if possible) l2arc or zil log slices on a pool. it would be great to enable a 5-50GB slice off a sata drive to use as logging device for greater performance. James Dickens uadmin.blogspot.com I''ve found data that lends its self to deduplication writes slightly faster> while data that does not (video, iso images) writes dramatically slower. So > I turn dedupe (and compression) off for filesystems containing "random" > data. > > -- > Ian. > > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100108/63aac08d/attachment.html>
On 01/08/2010 02:42 PM, Lutz Schumann wrote:> See the reads on the pool with the low I/O ? I suspect reading the > DDT causes the writes to slow down. > > See this bug > http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6913566. > It seems to give some backgrounds. > > Can you test setting the "primarycache=metadata" on the volume you > test ? This would be my initial test. My suggestion would be that it > may improve the situation because your ARC can be better utilized for > DDT (this does not make much sence for production without a SSD > cache, because you practially disable all caches for reading without > a L2ARC (aka SSD)!) > > As I read the bug report above - it seems the if the DDT > (deduplication table) does not fit into memory or dropped from there > the DDT has to be read from disk causing massive random I/O.The symptoms described in that bug report do match up with mine. I have also experienced long hang times (>1hr) destroying a dataset while the disk just thrashes. I tried setting "primarycache=metadata", but that did not help. I pulled the DDT statistics for my pool, but don''t know how to determine its physical size-on-disk from that. If deduplication ends up requiring a separate sort-of log device, that will be a real shame.> # zdb -DD nest > DDT-sha256-zap-duplicate: 780321 entries, size 338 on disk, 174 in core > DDT-sha256-zap-unique: 6188123 entries, size 335 on disk, 164 in core > > DDT histogram (aggregated over all DDTs): > > bucket allocated referenced > ______ ______________________________ ______________________________ > refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE > ------ ------ ----- ----- ----- ------ ----- ----- ----- > 1 5.90M 752G 729G 729G 5.90M 752G 729G 729G > 2 756K 94.0G 93.7G 93.6G 1.48M 188G 187G 187G > 4 5.36K 152M 80.3M 81.5M 22.4K 618M 325M 330M > 8 258 4.05M 1.93M 2.00M 2.43K 36.7M 16.3M 16.9M > 16 30 434K 42K 50.9K 597 10.2M 824K 1003K > 32 5 255K 65.5K 66.6K 204 10.5M 3.26M 3.30M > 64 20 2.02M 906K 910K 1.41K 141M 62.0M 62.2M > 128 4 2K 2K 2.99K 723 362K 362K 541K > 256 1 512 512 766 277 138K 138K 207K > 512 2 1K 1K 1.50K 1.62K 830K 830K 1.21M > Total 6.65M 846G 823G 823G 7.41M 941G 917G 917G > > dedup = 1.11, compress = 1.03, copies = 1.00, dedup * compress / copies = 1.14
We''re having to split data to multiple pools if we enable dedup, 1+ TB pools each (one 6x750gb is particularly bad). The timeouts cause COMSTAR / iSCSI to fail, Windows clients are dropping the persistent targets due to timeouts (> 15 seconds it seems). This is causing bigger problems. Disabling dedup is an option, but it shouldn''t be *THAT* much load I wouldn''t think. Having it on a cache drive is reasonable, however if this is required OpenSolaris should add something like DDTCacheDevice so we can dedicate a device to it seperate from the secondcache. I''ll drop in a 150gb cache drive tonight to see if it improves things. Steve Radich www.BitShop.com -- This message posted from opensolaris.org
I should note that trying zfs set primarycache=metadata tank1 took a few minutes. Seems changing what is cached in ram would be instant (we don''t need to flush out from ram the data, just don''t put it back in ram again). During this disk i/o seemed slow, could have been unrelated. -- This message posted from opensolaris.org
http://www.bitshop.com/Blogs/tabid/95/EntryId/78/Bug-in-OpenSolaris-SMB-Server-causes-slow-disk-i-o-always.aspx This explains just how major of a bug this issue is IMHO - The SMB slowdown from Windows 2003 is doing something odd in the Kernel I think now from the symptoms - See the tests for rsync performance. Our file move used to bring the server to almost unusable (in fact some SAN clients would say iSCSI host disappeared and shutdown). Now during the copy / load on the disks the iSCSI clients are insanely fast - Only difference is server/smb is disabled. I think ZFS De-Dup just made it appear worse. -- This message posted from opensolaris.org
>>>>> "srbi" == Steve Radich, BitShop, Inc <stever at bitshop.com> writes:srbi> http://www.bitshop.com/Blogs/tabid/95/EntryId/78/Bug-in-OpenSolaris-SMB-Server-causes-slow-disk-i-o-always.aspx I''m having trouble understanding many things in here like ``our file move'''' (moving what from where to where with what protocol?) and ``with SMB running'''' (with the server enabled on Solaris, with filesystems mounted, with activity on the mountpoints? what does running mean?) and ``RAID-0/stripe reads is the slow point'''' (what does this mean? How did you determine which part of the stack is limiting the observed speed? This is normally quite difficult and requires comparing several experiments, not doing just one experiment like ``a file move between zfs pools''''.). What is ``bytes the negotiated protocol allows''''? mtu, mss, window size? Can you show us in what tool you see one number and where you see the other number that''s too big? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 304 bytes Desc: not available URL: <http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100324/b4f60a1e/attachment.bin>