Hi all I''ve been doing a lot of testing with dedup and concluded it''s not really ready for production. If something fails, it can render the pool unuseless for hours or maybe days, perhaps due to single-threded stuff in zfs. There is also very little data available in the docs (though I''ve from what I''ve got on this list) on how much memory one should have for deduping an xTiB dataset. Does anyone know how the status is for dedup now? In 134 it doesn''t work very well, but is it better in ON140 etc? Best regards roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Roy Sigurd Karlsbakk wrote:> Hi all > > I''ve been doing a lot of testing with dedup and concluded it''s not really ready for production. If something fails, it can render the pool unuseless for hours or maybe days, perhaps due to single-threded stuff in zfs. There is also very little data available in the docs (though I''ve from what I''ve got on this list) on how much memory one should have for deduping an xTiB dataset. >I think it was Richard a month or so ago that had a good post about about how much space the Dedup Table entry would be (it was in some discussion where I ask about it). I can''t remember what it was (a hundred bytes?) per DDT entry, but one had to remember that each entry was for a slab, which can vary in size (512 bytes to 128k). So, there''s no good generic formula for X bytes in RAM per Y TB space. You can compute a rough guess if you know what kind of data and the general usage pattern is for the pool (basically, you need to take a stab at how big you think the average slab size is). Also, remember that if you have a /very/ good dedup ratio, then you will have a smaller DDT for a given X size pool, vs a pool with poor dedup ratios. Unfortunately, there''s no magic bullet, though if you can dig up Richard''s post, you should be able to take a guess, and not be off more than x2 or so. Also, remember you only need to hold the DDT in L2ARC, not in actual RAM, so buy that SSD, young man! As far as failures, well, I can''t speak to that specifically. Though, do realize that not having sufficient L2ARC/RAM to hold the DDT does mean that you spend an awful amount of time reading pool metadata, which really hurts performance (not to mention can cripple deleting of any sort...)> Does anyone know how the status is for dedup now? In 134 it doesn''t work very well, but is it better in ON140 etc? > >Honestly, I don''t see it being much different over the last couple of builds. The limitations are still there, but given those ones, I''ve found it works well. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Hi, its getting better, I believe its no longer single threaded after 135? (http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6922161) but still waiting for major bug fix, http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924824 It should be fixed before Release afaik. Yours Markus Kovero
----- "Erik Trimble" <erik.trimble at oracle.com> skrev:> Roy Sigurd Karlsbakk wrote: > > Hi all > > > > I''ve been doing a lot of testing with dedup and concluded it''s not > really ready for production. If something fails, it can render the > pool unuseless for hours or maybe days, perhaps due to single-threded > stuff in zfs. There is also very little data available in the docs > (though I''ve from what I''ve got on this list) on how much memory one > should have for deduping an xTiB dataset. > > > I think it was Richard a month or so ago that had a good post about > about how much space the Dedup Table entry would be (it was in some > discussion where I ask about it). I can''t remember what it was (a > hundred bytes?) per DDT entry, but one had to remember that each entry150 bytes per block IIRC, but still, it''d be nice to have this in the official ZFS docs. Let''s hope this is added soon roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Erik Trimble wrote:> Roy Sigurd Karlsbakk wrote: >> Hi all >> >> I''ve been doing a lot of testing with dedup and concluded it''s not >> really ready for production. If something fails, it can render the >> pool unuseless for hours or maybe days, perhaps due to single-threded >> stuff in zfs. There is also very little data available in the docs >> (though I''ve from what I''ve got on this list) on how much memory one >> should have for deduping an xTiB dataset. >> > I think it was Richard a month or so ago that had a good post about > about how much space the Dedup Table entry would be (it was in some > discussion where I ask about it). I can''t remember what it was (a > hundred bytes?) per DDT entry, but one had to remember that each entry > was for a slab, which can vary in size (512 bytes to 128k). So, > there''s no good generic formula for X bytes in RAM per Y TB space. > You can compute a rough guess if you know what kind of data and the > general usage pattern is for the pool (basically, you need to take a > stab at how big you think the average slab size is). Also, remember > that if you have a /very/ good dedup ratio, then you will have a > smaller DDT for a given X size pool, vs a pool with poor dedup ratios. > Unfortunately, there''s no magic bullet, though if you can dig up > Richard''s post, you should be able to take a guess, and not be off > more than x2 or so. > Also, remember you only need to hold the DDT in L2ARC, not in actual > RAM, so buy that SSD, young man! > > As far as failures, well, I can''t speak to that specifically. Though, > do realize that not having sufficient L2ARC/RAM to hold the DDT does > mean that you spend an awful amount of time reading pool metadata, > which really hurts performance (not to mention can cripple deleting of > any sort...) >Here''s Richard Elling''s post in the "dedup and memory/l2arc requirements" thread where he presents a worst case DDT size upper bound: http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039516.html ------start of copy------ You can estimate the amount of disk space needed for the deduplication table and the expected deduplication ratio by using "zdb -S poolname" on your existing pool. Be patient, for an existing pool with lots of objects, this can take some time to run. # ptime zdb -S zwimming Simulated DDT histogram: bucket allocated referenced ______ ______________________________ ______________________________ refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE ------ ------ ----- ----- ----- ------ ----- ----- ----- 1 2.27M 239G 188G 194G 2.27M 239G 188G 194G 2 327K 34.3G 27.8G 28.1G 698K 73.3G 59.2G 59.9G 4 30.1K 2.91G 2.10G 2.11G 152K 14.9G 10.6G 10.6G 8 7.73K 691M 529M 529M 74.5K 6.25G 4.79G 4.80G 16 673 43.7M 25.8M 25.9M 13.1K 822M 492M 494M 32 197 12.3M 7.02M 7.03M 7.66K 480M 269M 270M 64 47 1.27M 626K 626K 3.86K 103M 51.2M 51.2M 128 22 908K 250K 251K 3.71K 150M 40.3M 40.3M 256 7 302K 48K 53.7K 2.27K 88.6M 17.3M 19.5M 512 4 131K 7.50K 7.75K 2.74K 102M 5.62M 5.79M 2K 1 2K 2K 2K 3.23K 6.47M 6.47M 6.47M 8K 1 128K 5K 5K 13.9K 1.74G 69.5M 69.5M Total 2.63M 277G 218G 225G 3.22M 337G 263G 270G dedup = 1.20, compress = 1.28, copies = 1.03, dedup * compress / copies = 1.50 real 8:02.391932786 user 1:24.231855093 sys 15.193256108 In this file system, 2.75 million blocks are allocated. The in-core size of a DDT entry is approximately 250 bytes. So the math is pretty simple: in-core size = 2.63M * 250 = 657.5 MB If your dedup ratio is 1.0, then this number will scale linearly with size. If the dedup rate > 1.0, then this number will not scale linearly, it will be less. So you can use the linear scale as a worst-case approximation. -- richard ------end of copy------
----- "Haudy Kazemi" <kaze0010 at umn.edu> skrev:> In this file system, 2.75 million blocks are allocated. The in-core > size > of a DDT entry is approximately 250 bytes. So the math is pretty > simple: > in-core size = 2.63M * 250 = 657.5 MB > > If your dedup ratio is 1.0, then this number will scale linearly with > size. > If the dedup rate > 1.0, then this number will not scale linearly, it > will be > less. So you can use the linear scale as a worst-case approximation.How large was this filesystem? Are there any good ways of planning memory or SSDs for this? roy -- Roy Sigurd Karlsbakk (+47) 97542685 roy at karlsbakk.net http://blogg.karlsbakk.net/ -- I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk.
Roy Sigurd Karlsbakk wrote:> ----- "Haudy Kazemi" <kaze0010 at umn.edu> skrev: > >> In this file system, 2.75 million blocks are allocated. The in-core >> size >> of a DDT entry is approximately 250 bytes. So the math is pretty >> simple: >> in-core size = 2.63M * 250 = 657.5 MB >> >> If your dedup ratio is 1.0, then this number will scale linearly with >> size. >> If the dedup rate > 1.0, then this number will not scale linearly, it >> will be >> less. So you can use the linear scale as a worst-case approximation. > > How large was this filesystem? > > Are there any good ways of planning memory or SSDs for this? > > royIf you mean figuring out how big memory should be BEFORE you write any data, You need to guesstimate the average block size for the files you are storing in the zpool, which is highly data-dependent. In general, consider that zfs will write a file of size X using a block size of Y where Y a power of 2 and the minimum amount needed such that X < Y, up to a maximum of Y=128k. So, look at your (potential) data, and consider how big files are. DDT requirements for RAM/L2ARC would be: 250 bytes * # blocks So, let''s say I''m considering a 1TB pool, where I think I''m going to be storing 200GB worth of MP3s, 200GB of source code, 200GB of misc Office docs, 200GB of various JPEG image files from my 8 megapixel camera. (don''t want more than 80% full!) Assumed block sizes & thus number of blocks for: Data Block Size # Blocks per 200GB MP3 128k ~1.6 million Source Code 1k ~200 million Office docs 32k ~6.5 million Pictures 4k ~52 million Thus, total number of blocks you''ll need = ~260 million DDT tables size = 260 million * 250 bytes = 65GB Note that the source code takes up 20% of the space, but requires 80% of the DDT entries. Given that the above is the worst case for that file mix (actual dedup/compression will lower the total block count), I would use it for the max L2ARC size you want. RAM sizing is dependent on the size of your *active* working set of files; I''d want enough RAM to cache both all my writes and my most commonly-read files into RAM all at once. -- Erik Trimble Java System Support Mailstop: usca22-123 Phone: x17195 Santa Clara, CA Timezone: US/Pacific (GMT-0800)
Roy Sigurd Karlsbakk wrote:> Hi all > > I''ve been doing a lot of testing with dedup and concluded it''s not really ready for production. If something fails, it can render the pool unuseless for hours or maybe days, perhaps due to single-threded stuff in zfs. There is also very little data available in the docs (though I''ve from what I''ve got on this list) on how much memory one should have for deduping an xTiB dataset. > > Does anyone know how the status is for dedup now? In 134 it doesn''t work very well, but is it better in ON140 etc? > > Best regards > > roy > -- > Roy Sigurd Karlsbakk > (+47) 97542685 > roy at karlsbakk.net > http://blogg.karlsbakk.net/ > -- > I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og relevante synonymer p? norsk. > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discussI just integrated a performance improvement for dedup which will dramatically help when the dedup table does not fit in memory. For more details take a look at: 6938089 dedup-induced latency causes FC initiator logouts/FC port resets This will improve performance for such tasks as rm-ing files in a dedup enabled dataset, and destroying a dedup enabled dataset. It''s still a best practice to size your system accordingly such that the dedup table can stay resident in the ARC or L2ARC. - George