thr3ads.net - zfs discuss - [zfs-discuss] dedup status [May 2010]

If this information is useful, please help other people find it:
Share via:

Roy Sigurd Karlsbakk

2010-May-15 18:40 UTC

[zfs-discuss] dedup status

Hi all

I''ve been doing a lot of testing with dedup and concluded it''s
not really ready for production. If something fails, it can render the pool
unuseless for hours or maybe days, perhaps due to single-threded stuff in zfs.
There is also very little data available in the docs (though I''ve from
what I''ve got on this list) on how much memory one should have for
deduping an xTiB dataset.

Does anyone know how the status is for dedup now? In 134 it doesn''t
work very well, but is it better in ON140 etc?

Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Erik Trimble

2010-May-16 04:51 UTC

head link

[zfs-discuss] dedup status

Roy Sigurd Karlsbakk wrote:> Hi all
>
> I''ve been doing a lot of testing with dedup and concluded
it''s not really ready for production. If something fails, it can render
the pool unuseless for hours or maybe days, perhaps due to single-threded stuff
in zfs. There is also very little data available in the docs (though
I''ve from what I''ve got on this list) on how much memory one
should have for deduping an xTiB dataset.
>   I think it was Richard a month or so ago that had a good post about 
about how much space the Dedup Table entry would be (it was in some 
discussion where I ask about it).  I can''t remember what it was (a 
hundred bytes?) per DDT entry, but one had to remember that each entry 
was for a slab, which can vary in size (512 bytes to 128k).  So,
there''s
no good generic formula for X bytes in RAM per Y TB space.  You can 
compute a rough guess if you know what kind of data and the general 
usage pattern is for the  pool (basically, you need to take a stab at  
how big you think the average slab size is).   Also, remember that if 
you have a /very/ good dedup ratio, then you will have a smaller DDT for 
a given X size pool, vs a pool with poor dedup ratios.   

Unfortunately, there''s no magic bullet, though if you can dig up 
Richard''s post, you should be able to take a guess, and not be off more
than x2 or so.  

Also, remember you only need to hold the DDT in L2ARC, not in actual 
RAM, so buy that SSD, young man!

As far as failures, well, I can''t speak to that specifically. Though,
do
realize that not having sufficient L2ARC/RAM to hold the DDT does mean 
that you spend an awful amount of time reading pool metadata, which 
really hurts performance (not to mention can cripple deleting of any 
sort...)



> Does anyone know how the status is for dedup now? In 134 it
doesn''t work very well, but is it better in ON140 etc?
>
>   Honestly, I don''t see it being much different over the last couple of 
builds.  The limitations are still there, but given those ones, I''ve 
found it works well.



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Markus Kovero

2010-May-16 08:31 UTC

head link

[zfs-discuss] dedup status

Hi, its getting better, I believe its no longer single threaded after 135?
(http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6922161)
but still waiting for major bug fix,
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6924824

It should be fixed before Release afaik.

Yours
Markus Kovero

Roy Sigurd Karlsbakk

2010-May-16 09:02 UTC

head link

[zfs-discuss] dedup status

----- "Erik Trimble" <erik.trimble at oracle.com> skrev:
> Roy Sigurd Karlsbakk wrote:
> > Hi all
> >
> > I''ve been doing a lot of testing with dedup and concluded
it''s not
> really ready for production. If something fails, it can render the
> pool unuseless for hours or maybe days, perhaps due to single-threded
> stuff in zfs. There is also very little data available in the docs
> (though I''ve from what I''ve got on this list) on how much
memory one
> should have for deduping an xTiB dataset.
> >   
> I think it was Richard a month or so ago that had a good post about 
> about how much space the Dedup Table entry would be (it was in some 
> discussion where I ask about it).  I can''t remember what it was (a
> hundred bytes?) per DDT entry, but one had to remember that each entry
150 bytes per block IIRC, but still, it''d be nice to have this in the
official ZFS docs. Let''s hope this is added soon

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Haudy Kazemi

2010-May-16 15:25 UTC

head link

[zfs-discuss] dedup status

Erik Trimble wrote:> Roy Sigurd Karlsbakk wrote:
>> Hi all
>>
>> I''ve been doing a lot of testing with dedup and concluded
it''s not
>> really ready for production. If something fails, it can render the 
>> pool unuseless for hours or maybe days, perhaps due to single-threded 
>> stuff in zfs. There is also very little data available in the docs 
>> (though I''ve from what I''ve got on this list) on how
much memory one
>> should have for deduping an xTiB dataset.
>>   
> I think it was Richard a month or so ago that had a good post about 
> about how much space the Dedup Table entry would be (it was in some 
> discussion where I ask about it).  I can''t remember what it was (a
> hundred bytes?) per DDT entry, but one had to remember that each entry 
> was for a slab, which can vary in size (512 bytes to 128k).  So, 
> there''s no good generic formula for X bytes in RAM per Y TB space.
> You can compute a rough guess if you know what kind of data and the 
> general usage pattern is for the  pool (basically, you need to take a 
> stab at  how big you think the average slab size is).   Also, remember 
> that if you have a /very/ good dedup ratio, then you will have a 
> smaller DDT for a given X size pool, vs a pool with poor dedup ratios.  
> Unfortunately, there''s no magic bullet, though if you can dig up 
> Richard''s post, you should be able to take a guess, and not be off
> more than x2 or so. 
> Also, remember you only need to hold the DDT in L2ARC, not in actual 
> RAM, so buy that SSD, young man!
>
> As far as failures, well, I can''t speak to that specifically.
Though,
> do realize that not having sufficient L2ARC/RAM to hold the DDT does 
> mean that you spend an awful amount of time reading pool metadata, 
> which really hurts performance (not to mention can cripple deleting of 
> any sort...)
>Here''s Richard Elling''s post in the "dedup and
memory/l2arc
requirements" thread where he presents a worst case DDT size upper bound:
 http://mail.opensolaris.org/pipermail/zfs-discuss/2010-April/039516.html

------start of copy------

You can estimate the amount of disk space needed for the deduplication table

and the expected deduplication ratio by using "zdb -S poolname" on
your existing
pool.  Be patient, for an existing pool with lots of objects, this can take some
time to run.

# ptime zdb -S zwimming
Simulated DDT histogram:

bucket              allocated                       referenced          
______   ______________________________   ______________________________
refcnt   blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE
------   ------   -----   -----   -----   ------   -----   -----   -----
     1    2.27M    239G    188G    194G    2.27M    239G    188G    194G
     2     327K   34.3G   27.8G   28.1G     698K   73.3G   59.2G   59.9G
     4    30.1K   2.91G   2.10G   2.11G     152K   14.9G   10.6G   10.6G
     8    7.73K    691M    529M    529M    74.5K   6.25G   4.79G   4.80G
    16      673   43.7M   25.8M   25.9M    13.1K    822M    492M    494M
    32      197   12.3M   7.02M   7.03M    7.66K    480M    269M    270M
    64       47   1.27M    626K    626K    3.86K    103M   51.2M   51.2M
   128       22    908K    250K    251K    3.71K    150M   40.3M   40.3M
   256        7    302K     48K   53.7K    2.27K   88.6M   17.3M   19.5M
   512        4    131K   7.50K   7.75K    2.74K    102M   5.62M   5.79M
    2K        1      2K      2K      2K    3.23K   6.47M   6.47M   6.47M
    8K        1    128K      5K      5K    13.9K   1.74G   69.5M   69.5M
 Total    2.63M    277G    218G    225G    3.22M    337G    263G    270G

dedup = 1.20, compress = 1.28, copies = 1.03, dedup * compress / copies = 1.50


real     8:02.391932786
user     1:24.231855093
sys        15.193256108

In this file system, 2.75 million blocks are allocated. The in-core size
of a DDT entry is approximately 250 bytes.  So the math is pretty simple:
	in-core size = 2.63M * 250 = 657.5 MB

If your dedup ratio is 1.0, then this number will scale linearly with size.
If the dedup rate > 1.0, then this number will not scale linearly, it will be
less. So you can use the linear scale as a worst-case approximation.
 -- richard

------end of copy------

Roy Sigurd Karlsbakk

2010-May-16 15:26 UTC

head link

[zfs-discuss] dedup status

----- "Haudy Kazemi" <kaze0010 at umn.edu> skrev:
> In this file system, 2.75 million blocks are allocated. The in-core
> size
> of a DDT entry is approximately 250 bytes.  So the math is pretty
> simple:
> 	in-core size = 2.63M * 250 = 657.5 MB
> 
> If your dedup ratio is 1.0, then this number will scale linearly with
> size.
> If the dedup rate > 1.0, then this number will not scale linearly, it
> will be
> less. So you can use the linear scale as a worst-case approximation.
How large was this filesystem?

Are there any good ways of planning memory or SSDs for this?

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Erik Trimble

2010-May-16 16:56 UTC

head link

[zfs-discuss] dedup status

Roy Sigurd Karlsbakk wrote:> ----- "Haudy Kazemi" <kaze0010 at umn.edu> skrev:
>
>> In this file system, 2.75 million blocks are allocated. The in-core
>> size
>> of a DDT entry is approximately 250 bytes.  So the math is pretty
>> simple:
>> 	in-core size = 2.63M * 250 = 657.5 MB
>>
>> If your dedup ratio is 1.0, then this number will scale linearly with
>> size.
>> If the dedup rate > 1.0, then this number will not scale linearly,
it
>> will be
>> less. So you can use the linear scale as a worst-case approximation.
>
> How large was this filesystem?
>
> Are there any good ways of planning memory or SSDs for this?
>
> royIf you mean figuring out how big memory should be BEFORE you write any 
data, You need to guesstimate the average block size for the files you 
are storing in the zpool, which is highly data-dependent.  In general, 
consider that zfs will write a file of size X using a block size of Y 
where Y a power of 2 and the minimum amount needed such that X < Y, up 
to a maximum of Y=128k.  So, look at your (potential) data,  and 
consider how big files are. 

DDT requirements for RAM/L2ARC would be:  250 bytes * # blocks


So, let''s say I''m considering a 1TB pool, where I think
I''m going to be
storing 200GB worth of MP3s, 200GB of source code, 200GB of misc Office 
docs, 200GB of various JPEG image files from my 8 megapixel camera.  
(don''t want more than 80% full!)

Assumed block sizes & thus number of blocks for:
    Data             Block Size       # Blocks per 200GB  
    MP3              128k             ~1.6 million
    Source Code      1k               ~200 million
    Office docs      32k              ~6.5 million
    Pictures         4k               ~52 million

Thus, total number of blocks you''ll need = ~260 million

DDT tables size = 260 million * 250 bytes = 65GB


Note that the source code takes up 20% of the space, but requires 80% of 
the DDT entries.


Given that the above is the worst case for that file mix (actual 
dedup/compression will lower the total block count), I would use it for 
the max L2ARC size you want. 

RAM sizing is dependent on the size of your *active* working set of 
files; I''d want enough RAM to cache both all my writes and my most 
commonly-read files into RAM all at once.



-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

George Wilson

2010-May-20 14:02 UTC

head link

[zfs-discuss] dedup status

Roy Sigurd Karlsbakk wrote:> Hi all
> 
> I''ve been doing a lot of testing with dedup and concluded
it''s not really ready for production. If something fails, it can render
the pool unuseless for hours or maybe days, perhaps due to single-threded stuff
in zfs. There is also very little data available in the docs (though
I''ve from what I''ve got on this list) on how much memory one
should have for deduping an xTiB dataset.
> 
> Does anyone know how the status is for dedup now? In 134 it
doesn''t work very well, but is it better in ON140 etc?
> 
> Best regards
> 
> roy
> --
> Roy Sigurd Karlsbakk
> (+47) 97542685
> roy at karlsbakk.net
> http://blogg.karlsbakk.net/
> --
> I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det
er et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

I just integrated a performance improvement for dedup which will 
dramatically help when the dedup table does not fit in memory. For more 
details take a look at:

6938089 dedup-induced latency causes FC initiator logouts/FC port resets

This will improve performance for such tasks as rm-ing files in a dedup 
enabled dataset, and destroying a dedup enabled dataset. It''s still a 
best practice to size your system accordingly such that the dedup table 
can stay resident in the ARC or L2ARC.

- George

zfs discuss - May 2010 - dedup status

[zfs-discuss] dedup status

[zfs-discuss] dedup status

[zfs-discuss] dedup status

[zfs-discuss] dedup status

[zfs-discuss] dedup status

[zfs-discuss] dedup status

[zfs-discuss] dedup status

[zfs-discuss] dedup status