thr3ads.net - zfs discuss - [zfs-discuss] Debunking the dedup memory myth [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Edward Ned Harvey

2010-Jul-10 00:00 UTC

[zfs-discuss] Debunking the dedup memory myth

Whenever somebody asks the question, "How much memory do I need to dedup X
terabytes filesystem," the standard answer is "as much as you can
afford to
buy."  This is true and correct, but I don''t believe it''s
the best we can
do.  Because "as much as you can buy" is a true assessment for memory
in
*any* situation.

 

To improve knowledge in this area, I think the question just needs to be
asked differently.  "How much *extra* memory is required for X terabytes,
with dedup enabled versus disabled?"

 

I hope somebody knows more about this than me.  I expect the answer will be
something like ...

 

The default ZFS block size is 128K.  If you have a filesystem with 128G
used, that means you are consuming 1,048,576 blocks, each of which must be
checksummed.  ZFS uses adler32 and sha256, which means 4bytes and 32bytes
...  36 bytes * 1M blocks = an extra 36 Mbytes and some fluff consumed by
enabling dedup.

 

I suspect my numbers are off, because 36Mbytes seems impossibly small.  But
I hope some sort of similar (and more correct) logic will apply.  ;-)

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/91e21512/attachment.html>

Brandon High

2010-Jul-10 00:18 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey <solaris2 at
nedharvey.com>wrote:
>  The default ZFS block size is 128K.  If you have a filesystem with 128G
> used, that means you are consuming 1,048,576 blocks, each of which must be
> checksummed.  ZFS uses adler32 and sha256, which means 4bytes and 32bytes
> ...  36 bytes * 1M blocks = an extra 36 Mbytes and some fluff consumed by
> enabling dedup.
>
>
>
> I suspect my numbers are off, because 36Mbytes seems impossibly small.  But
> I hope some sort of similar (and more correct) logic will apply.  ;-)
>
I think that DDT entries are a little bigger than what you''re using.
The
size seems to range between 150 and 250 bytes depending on how it''s
calculated, call it 200b each. Your 128G dataset would require closer to
200M (+/- 25%) for the DDT if your data was completely unique. 1TB of unique
data would require 600M - 1000M for the DDT.

The numbers are fuzzy of course, and assum only 128k blocks. Lots of small
files will increase the memory cost of dedupe, and using it on a zvol that
has the default block size (8k) would require 16 times the memory.

-B

-- 
Brandon High : bhigh at freaks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/ee385451/attachment.html>

Erik Trimble

2010-Jul-10 01:40 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On 7/9/2010 5:18 PM, Brandon High wrote:> On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey 
> <solaris2 at nedharvey.com <mailto:solaris2 at nedharvey.com>>
wrote:
>
>     The default ZFS block size is 128K.  If you have a filesystem with
>     128G used, that means you are consuming 1,048,576 blocks, each of
>     which must be checksummed.  ZFS uses adler32 and sha256, which
>     means 4bytes and 32bytes ...  36 bytes * 1M blocks = an extra 36
>     Mbytes and some fluff consumed by enabling dedup.
>
>     I suspect my numbers are off, because 36Mbytes seems impossibly
>     small.  But I hope some sort of similar (and more correct) logic
>     will apply.  ;-)
>
>
> I think that DDT entries are a little bigger than what you''re
using.
> The size seems to range between 150 and 250 bytes depending on how 
> it''s calculated, call it 200b each. Your 128G dataset would
require
> closer to 200M (+/- 25%) for the DDT if your data was completely 
> unique. 1TB of unique data would require 600M - 1000M for the DDT.
>
> The numbers are fuzzy of course, and assum only 128k blocks. Lots of 
> small files will increase the memory cost of dedupe, and using it on a 
> zvol that has the default block size (8k) would require 16 times the 
> memory.
>
> -B
>

Go back and read several threads last month about ZFS/L2ARC memory usage 
for dedup. In particular, I''ve been quite specific about how to 
calculate estimated DDT size.  Richard has also been quite good at 
giving size estimates (as well as explaining how to see current block 
size usage in a filesystem).


The structure in question is this one:

ddt_entry

http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108

I''d have to fire up an IDE to track down all the sizes of the ddt_entry
structure''s members, but I feel comfortable using Richard''s
270
bytes-per-entry estimate.


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/b3da6852/attachment.html>

Neil Perrin

2010-Jul-10 02:09 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On 07/09/10 19:40, Erik Trimble wrote:> On 7/9/2010 5:18 PM, Brandon High wrote:
>> On Fri, Jul 9, 2010 at 5:00 PM, Edward Ned Harvey 
>> <solaris2 at nedharvey.com <mailto:solaris2 at
nedharvey.com>> wrote:
>>
>>     The default ZFS block size is 128K.  If you have a filesystem
>>     with 128G used, that means you are consuming 1,048,576 blocks,
>>     each of which must be checksummed.  ZFS uses adler32 and sha256,
>>     which means 4bytes and 32bytes ...  36 bytes * 1M blocks = an
>>     extra 36 Mbytes and some fluff consumed by enabling dedup.
>>
>>      
>>
>>     I suspect my numbers are off, because 36Mbytes seems impossibly
>>     small.  But I hope some sort of similar (and more correct) logic
>>     will apply.  ;-)
>>
>>
>> I think that DDT entries are a little bigger than what you''re
using.
>> The size seems to range between 150 and 250 bytes depending on how 
>> it''s calculated, call it 200b each. Your 128G dataset would
require
>> closer to 200M (+/- 25%) for the DDT if your data was completely 
>> unique. 1TB of unique data would require 600M - 1000M for the DDT.
>>
>> The numbers are fuzzy of course, and assum only 128k blocks. Lots of 
>> small files will increase the memory cost of dedupe, and using it on 
>> a zvol that has the default block size (8k) would require 16 times 
>> the memory.
>>
>> -B
>>
>
>
> Go back and read several threads last month about ZFS/L2ARC memory 
> usage for dedup. In particular, I''ve been quite specific about how
to
> calculate estimated DDT size.  Richard has also been quite good at 
> giving size estimates (as well as explaining how to see current block 
> size usage in a filesystem).
>
>
> The structure in question is this one:
>
> ddt_entry
>
>
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/sys/ddt.h#108
>
> I''d have to fire up an IDE to track down all the sizes of the 
> ddt_entry structure''s members, but I feel comfortable using
Richard''s
> 270 bytes-per-entry estimate.
>
It must have grown a bit because on 64 bit x86 a ddt_entry is currently 
0x178 = 376 bytes  :

# mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic 
cpu_ms.AuthenticAMD.15 uppc pcplusmp scsi_vhci zfs sata sd ip hook neti 
sockfs arp usba fctl random cpc fcip nfs lofs ufs logindmux ptm sppp ipc ]
 > ::sizeof struct ddt_entry
sizeof (struct ddt_entry) = 0x178


-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/1038ddc0/attachment.html>

Brandon High

2010-Jul-10 06:10 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On Fri, Jul 9, 2010 at 5:18 PM, Brandon High <bhigh at freaks.com> wrote:
> I think that DDT entries are a little bigger than what you''re
using. The
> size seems to range between 150 and 250 bytes depending on how
it''s
> calculated, call it 200b each. Your 128G dataset would require closer to
> 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of
unique
> data would require 600M - 1000M for the DDT.
>
Using 376b per entry, it''s 376M for 128G of unique data, or just under
3GB
for 1TB of unique data.

A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the
DDT. Ouch.

-B

-- 
Brandon High : bhigh at freaks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/76b01ba6/attachment.html>

Richard Elling

2010-Jul-10 12:24 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On Jul 9, 2010, at 11:10 PM, Brandon High wrote:
> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High <bhigh at freaks.com>
wrote:
> I think that DDT entries are a little bigger than what you''re
using. The size seems to range between 150 and 250 bytes depending on how
it''s calculated, call it 200b each. Your 128G dataset would require
closer to 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of
unique data would require 600M - 1000M for the DDT.
> 
> Using 376b per entry, it''s 376M for 128G of unique data, or just
under 3GB for 1TB of unique data.
4% seems to be a pretty good SWAG.
> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold the
DDT. Ouch.
... or more than 300GB for 512-byte records.

The performance issue is that DDT access tends to be random. This implies that
if you don''t have a lot of RAM and your pool has poor random read I/O
performance,
then you will not be impressed with dedup performance. In other words, trying to
dedup lots of data on a small DRAM machine using big, slow pool HDDs will not
set
any benchmark records. By contrast, using SSDs for the pool can demonstrate good
random read performance. As the price per bit of HDDs continues to drop, the
value
of deduping pools using HDDs also drops.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

Erik Trimble

2010-Jul-10 12:33 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On 7/10/2010 5:24 AM, Richard Elling wrote:> On Jul 9, 2010, at 11:10 PM, Brandon High wrote:
>
>    
>> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High<bhigh at freaks.com>
wrote:
>> I think that DDT entries are a little bigger than what you''re
using. The size seems to range between 150 and 250 bytes depending on how
it''s calculated, call it 200b each. Your 128G dataset would require
closer to 200M (+/- 25%) for the DDT if your data was completely unique. 1TB of
unique data would require 600M - 1000M for the DDT.
>>
>> Using 376b per entry, it''s 376M for 128G of unique data, or
just under 3GB for 1TB of unique data.
>>      
> 4% seems to be a pretty good SWAG.
>
>    
>> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold
the DDT. Ouch.
>>      
> ... or more than 300GB for 512-byte records.
>
> The performance issue is that DDT access tends to be random. This implies
that
> if you don''t have a lot of RAM and your pool has poor random read
I/O performance,
> then you will not be impressed with dedup performance. In other words,
trying to
> dedup lots of data on a small DRAM machine using big, slow pool HDDs will
not set
> any benchmark records. By contrast, using SSDs for the pool can demonstrate
good
> random read performance. As the price per bit of HDDs continues to drop,
the value
> of deduping pools using HDDs also drops.
>   -- richard
>
>    
Which brings up an interesting idea:   if I have a pool with good random 
I/O  (perhaps made from SSDs, or even one of those nifty Oracle F5100 
things),  I would probably not want to have a DDT created, or at least 
have one that was very significantly abbreviated.   What capability does 
ZFS have for recognizing that we won''t need a full DDT created for 
high-I/O-speed pools?  Particularly with the fact that such pools would 
almost certainly be heavy candidates for dedup (the $/GB being 
significantly higher than other mediums, and thus space being at a 
premium) ?

I''m not up on exactly how the DDT gets built and referenced to 
understand how this might happen.  But, I can certainly see it as being 
useful to tell ZFS (perhaps through a pool property?) that building an 
in-ARC DDT isn''t really needed.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Richard Elling

2010-Jul-10 13:15 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On Jul 10, 2010, at 5:33 AM, Erik Trimble wrote:
> On 7/10/2010 5:24 AM, Richard Elling wrote:
>> On Jul 9, 2010, at 11:10 PM, Brandon High wrote:
>> 
>>   
>>> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High<bhigh at
freaks.com>  wrote:
>>> I think that DDT entries are a little bigger than what
you''re using. The size seems to range between 150 and 250 bytes
depending on how it''s calculated, call it 200b each. Your 128G dataset
would require closer to 200M (+/- 25%) for the DDT if your data was completely
unique. 1TB of unique data would require 600M - 1000M for the DDT.
>>> 
>>> Using 376b per entry, it''s 376M for 128G of unique data,
or just under 3GB for 1TB of unique data.
>>>     
>> 4% seems to be a pretty good SWAG.
>> 
>>   
>>> A 1TB zvol with 8k blocks would require almost 24GB of memory to
hold the DDT. Ouch.
>>>     
>> ... or more than 300GB for 512-byte records.
>> 
>> The performance issue is that DDT access tends to be random. This
implies that
>> if you don''t have a lot of RAM and your pool has poor random
read I/O performance,
>> then you will not be impressed with dedup performance. In other words,
trying to
>> dedup lots of data on a small DRAM machine using big, slow pool HDDs
will not set
>> any benchmark records. By contrast, using SSDs for the pool can
demonstrate good
>> random read performance. As the price per bit of HDDs continues to
drop, the value
>> of deduping pools using HDDs also drops.
>>  -- richard
>> 
>>   
> 
> Which brings up an interesting idea:   if I have a pool with good random
I/O  (perhaps made from SSDs, or even one of those nifty Oracle F5100 things), 
I would probably not want to have a DDT created, or at least have one that was
very significantly abbreviated.   What capability does ZFS have for recognizing
that we won''t need a full DDT created for high-I/O-speed pools? 
Particularly with the fact that such pools would almost certainly be heavy
candidates for dedup (the $/GB being significantly higher than other mediums,
and thus space being at a premium) ?
Methinks it is impossible to build a complete DDT, we''ll run out of
atoms... maybe if
we can use strings?  :-)  Think of it as a very, very sparse array.  Otherwise
it
is managed just like other metadata.
 -- richard

-- 
Richard Elling
richard at nexenta.com   +1-760-896-4422
ZFS and NexentaStor training, Rotterdam, July 13-15, 2010
http://nexenta-rotterdam.eventbrite.com/

Brandon High

2010-Jul-10 17:14 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble <erik.trimble at
oracle.com>wrote:
> Which brings up an interesting idea:   if I have a pool with good random
> I/O  (perhaps made from SSDs, or even one of those nifty Oracle F5100
> things),  I would probably not want to have a DDT created, or at least have
> one that was very significantly abbreviated.   What capability does ZFS
have
> for recognizing that we won''t need a full DDT created for
high-I/O-speed
> pools?  Particularly with the fact that such pools would almost certainly
be
> heavy candidates for dedup (the $/GB being significantly higher than other
> mediums, and thus space being at a premium) ?
>
I''m not exactly sure what problem you''re trying to solve.
Dedup is to save
space, not accelerate i/o. While the DDT is pool-wide, only data that''s
added to datasets with dedup enabled will create entries in the DDT. If
there''s data that you don''t want to dedup, then don''t
add it to a pool with
dedup enabled.

I''m not up on exactly how the DDT gets built and referenced to
understand> how this might happen.  But, I can certainly see it as being useful to tell
> ZFS (perhaps through a pool property?) that building an in-ARC DDT
isn''t
> really needed.
>
The DDT is in the pool, not in the ARC. Because it''s frequently
accessed,
some / most of it will reside in the ARC.

-B

-- 
Brandon High : bhigh at freaks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100710/efc5d5e3/attachment.html>

Edward Ned Harvey

2010-Jul-10 18:46 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

> From: Richard Elling [mailto:richard at nexenta.com]
> 
> 4% seems to be a pretty good SWAG.
Is the above "4%" wrong, or am I wrong?

Suppose 200bytes to 400bytes, per 128Kbyte block ... 
200/131072 = 0.0015 = 0.15%
400/131072 = 0.003 = 0.3%
which would mean for 100G unique data = 153M to 312M ram.

Around 3G ram for 1Tb unique data, assuming default 128K block

Next question:

Correct me if I''m wrong, if you have a lot of duplicated data, then
dedup
increases the probability of arc/ram cache hit.  So dedup allows you to
stretch your disk, and also stretch your ram cache.  Which also benefits
performance.

Edward Ned Harvey

2010-Jul-10 19:19 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Brandon High
> 
> Dedup is to
> save space, not accelerate i/o. 
I''m going to have to disagree with you there.  Dedup is a type of
compression.  Compression can be used for storage savings, and/or
acceleration.  Fast and lightweight compression algorithms (lzop, v.42bis,
v.44) are usually used in-line for acceleration, while a compute-expensive
algorithms (bzip2, lzma, gzip) are usually used for space savings and rarely
for acceleration (except when transmitting data across a slow channel).

Most general-purpose lossless compression algorithms (and certainly most of
the ones I just mentioned) achieve compression by reducing duplicated data.
There are special purpose lossless (flac etc) and lossy (jpg, mp3 etc) which
do other techniques.  But general purpose compression might possibly even be
exclusively algorithms for reduction of repeated data.

Unless I''m somehow mistaken, the performance benefit of dedup comes
from the
fact that it increases cache hits.  Instead of having to read a thousand
duplicate blocks from different sectors of disks, you read it once, and the
other 999 have all been stored "same as" the original block, so
it''s 999
cache hits and unnecessary to read disk again.

Roy Sigurd Karlsbakk

2010-Jul-10 19:38 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

> > 4% seems to be a pretty good SWAG.
> 
> Is the above "4%" wrong, or am I wrong?
> 
> Suppose 200bytes to 400bytes, per 128Kbyte block ...
> 200/131072 = 0.0015 = 0.15%
> 400/131072 = 0.003 = 0.3%
> which would mean for 100G unique data = 153M to 312M ram.
> 
> Around 3G ram for 1Tb unique data, assuming default 128K block
Recodsize means maximum block size. Smaller files will be stored in smaller
blocks. With lots of files of different sizes, the block size will generally be
smaller than the recordsize set for ZFS.
> Next question:
> 
> Correct me if I''m wrong, if you have a lot of duplicated data,
then
> dedup
> increases the probability of arc/ram cache hit. So dedup allows you to
> stretch your disk, and also stretch your ram cache. Which also
> benefits performance.
Theoretically, yes, but there will be an overhead in cpu/memory that can reduce
this benefit to a penalty.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Edward Ned Harvey

2010-Jul-10 23:09 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

> From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net]
> > increases the probability of arc/ram cache hit. So dedup allows you
> to
> > stretch your disk, and also stretch your ram cache. Which also
> > benefits performance.
> 
> Theoretically, yes, but there will be an overhead in cpu/memory that
> can reduce this benefit to a penalty.
That''s why a really fast compression algorithm is used in-line, in
hopes that the time cost of compression is smaller than the performance gain of
compression.  Take for example, v.42bis and v.44 which was used to accelerate
56K modems.  (Probably still are, if you actually have a modem somewhere.  ;-)

Nowadays we have faster communication channels; in fact when talking about dedup
we''re talking about local disk speed, which is really fast.  But we
also have fast processors, and the algorithm in question can be really fast.

I recently benchmarked lzop, gzip, bzip2, and lzma for some important data on
our fileserver that I would call "typical."  No matter what I did,
lzop was so ridiculously light weight that I could never get lzop up to 100%
cpu.  Even reading data 100% from cache and filtering through lzop to /dev/null,
the kernel overhead of reading ram cache was higher than the cpu overhead to
compress.

For the data in question, lzop compressed to 70%, gzip compressed to 42%, bzip
32%, and lzma something like 16%.  bzip2 was the slowest (by a factor of 4). 
lzma -1 and gzip --fast were closely matched in speed but not compression.  So
the compression of lzop was really weak for the data in question, but it
contributed no significant cpu overhead.  The point is:  It''s
absolutely possible to compress quickly, if you have a fast algorithm, and gain
performance.  I''m boldly assuming dedup performs this fast.  It would
be nice to actually measure and prove it.

Garrett D''Amore

2010-Jul-11 02:46 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

Even the most expensive decompression algorithms generally run
significantly faster than I/O to disk -- at least when real disks are
involved.  So, as long as you don''t run out of CPU and have to wait for
CPU to be available for decompression, the decompression will win.  The
same concept is true for dedup, although I don''t necessarily think of
dedup as a form of compression (others might reasonably do so though.)

	- Garrett

On Sat, 2010-07-10 at 19:09 -0400, Edward Ned Harvey
wrote:> > From: Roy Sigurd Karlsbakk [mailto:roy at karlsbakk.net]
> > > increases the probability of arc/ram cache hit. So dedup allows
you
> > to
> > > stretch your disk, and also stretch your ram cache. Which also
> > > benefits performance.
> > 
> > Theoretically, yes, but there will be an overhead in cpu/memory that
> > can reduce this benefit to a penalty.
> 
> That''s why a really fast compression algorithm is used in-line, in
hopes that the time cost of compression is smaller than the performance gain of
compression.  Take for example, v.42bis and v.44 which was used to accelerate
56K modems.  (Probably still are, if you actually have a modem somewhere.  ;-)
> 
> Nowadays we have faster communication channels; in fact when talking about
dedup we''re talking about local disk speed, which is really fast.  But
we also have fast processors, and the algorithm in question can be really fast.
> 
> I recently benchmarked lzop, gzip, bzip2, and lzma for some important data
on our fileserver that I would call "typical."  No matter what I did,
lzop was so ridiculously light weight that I could never get lzop up to 100%
cpu.  Even reading data 100% from cache and filtering through lzop to /dev/null,
the kernel overhead of reading ram cache was higher than the cpu overhead to
compress.
> 
> For the data in question, lzop compressed to 70%, gzip compressed to 42%,
bzip 32%, and lzma something like 16%.  bzip2 was the slowest (by a factor of
4).  lzma -1 and gzip --fast were closely matched in speed but not compression. 
So the compression of lzop was really weak for the data in question, but it
contributed no significant cpu overhead.  The point is:  It''s
absolutely possible to compress quickly, if you have a fast algorithm, and gain
performance.  I''m boldly assuming dedup performs this fast.  It would
be nice to actually measure and prove it.
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Erik Trimble

2010-Jul-11 04:39 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On 7/10/2010 10:14 AM, Brandon High wrote:> On Sat, Jul 10, 2010 at 5:33 AM, Erik Trimble <erik.trimble at
oracle.com
> <mailto:erik.trimble at oracle.com>> wrote:
>
>     Which brings up an interesting idea:   if I have a pool with good
>     random I/O  (perhaps made from SSDs, or even one of those nifty
>     Oracle F5100 things),  I would probably not want to have a DDT
>     created, or at least have one that was very significantly
>     abbreviated.   What capability does ZFS have for recognizing that
>     we won''t need a full DDT created for high-I/O-speed pools?
>      Particularly with the fact that such pools would almost certainly
>     be heavy candidates for dedup (the $/GB being significantly higher
>     than other mediums, and thus space being at a premium) ?
>
>
> I''m not exactly sure what problem you''re trying to solve.
Dedup is to
> save space, not accelerate i/o. While the DDT is pool-wide, only data 
> that''s added to datasets with dedup enabled will create entries in
the
> DDT. If there''s data that you don''t want to dedup, then
don''t add it
> to a pool with dedup enabled.
>
What I''m talking about here is that caching the DDT in the ARC takes a 
non-trivial amount of space (as we''ve discovered). For a pool
consisting
of backing store with access times very close to that of main memory, 
there''s no real benefit from caching it in the ARC/L2ARC, so it would
be
useful if the DDT was simply kept somewhere on the actual backing store, 
and there was some way to tell ZFS to look there exclusively, and not 
try to build/store a DDT in ARC.


>     I''m not up on exactly how the DDT gets built and referenced to
>     understand how this might happen.  But, I can certainly see it as
>     being useful to tell ZFS (perhaps through a pool property?) that
>     building an in-ARC DDT isn''t really needed.
>
>
> The DDT is in the pool, not in the ARC. Because it''s frequently 
> accessed, some / most of it will reside in the ARC.
>
> -B
>
> -- 
> Brandon High : bhigh at freaks.com <mailto:bhigh at freaks.com>
Are you sure? I was under the impression that the DDT had to be built 
from info in the pool, but that what we call the DDT only exists in the 
ARC.  That''s my understanding from reading the ddt.h and ddt.c files - 
that the ''ddt_enty'' and ''ddt'' structures
exist in RAM/ARC/L2ARC, but not
on disk. Those two are built using the ''ddt_key'' and
''ddt_bookmark''
structures on disk.

Am I missing something?

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100710/7c650fb8/attachment.html>

Richard L. Hamilton

2010-Jul-18 23:18 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

> Even the most expensive decompression algorithms
> generally run
> significantly faster than I/O to disk -- at least
> when real disks are
> involved.  So, as long as you don''t run out of CPU
> and have to wait for
> CPU to be available for decompression, the
> decompression will win.  The
> same concept is true for dedup, although I don''t
> necessarily think of
> dedup as a form of compression (others might
> reasonably do so though.)
Effectively, dedup is a form of compression of the
filesystem rather than any single file, but one
oriented to not interfering with access to any of what
may be sharing blocks.

I would imagine that if it''s read-mostly, it''s a win, but
otherwise it costs more than it saves.  Even more conventional
compression tends to be more resource intensive than decompression...

What I''m wondering is when dedup is a better value than compression.
Most obviously, when there are a lot of identical blocks across different
files; but I''m not sure how often that happens, aside from maybe
blocks of zeros (which may well be sparse anyway).
-- 
This message posted from opensolaris.org

Erik Trimble

2010-Jul-18 23:49 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On 7/18/2010 4:18 PM, Richard L. Hamilton wrote:>> Even the most expensive decompression algorithms
>> generally run
>> significantly faster than I/O to disk -- at least
>> when real disks are
>> involved.  So, as long as you don''t run out of CPU
>> and have to wait for
>> CPU to be available for decompression, the
>> decompression will win.  The
>> same concept is true for dedup, although I don''t
>> necessarily think of
>> dedup as a form of compression (others might
>> reasonably do so though.)
>>      
> Effectively, dedup is a form of compression of the
> filesystem rather than any single file, but one
> oriented to not interfering with access to any of what
> may be sharing blocks.
>
> I would imagine that if it''s read-mostly, it''s a win, but
> otherwise it costs more than it saves.  Even more conventional
> compression tends to be more resource intensive than decompression...
>
> What I''m wondering is when dedup is a better value than
compression.
> Most obviously, when there are a lot of identical blocks across different
> files; but I''m not sure how often that happens, aside from maybe
> blocks of zeros (which may well be sparse anyway).
>    
 From my own experience, a dedup "win" is much more
data-usage-dependent
than compression.

Compression seems to be of general use across the vast majority of data 
I''ve encountered - with the sole big exception of media file servers 
(where the data is already compressed pictures, audio, or video).  It 
seems to be of general utility, since I''ve always got spare CPU cycles,
and it''s really not very "expensive" in terms of CPU in most
cases. Of
course, the *value* of compression varies according to the data (i.e. 
how much it will compress), but that doesn''t matter for *utility* for 
the most part.

Dedup, on the other hand, currently has a very steep price in terms of 
needed ARC/L2ARC/RAM, so it''s much harder to justify in those cases 
where it only provides modest benefits. Additionally, we''re still in
the
development side of dedup (IMHO), so I can''t really make a full 
evaluation of Dedup concept, as many of its issues today are 
implementation-related, not concept-related.   All that said, Dedup has 
a showcase use case where it is of *massive* benefit:  hosting Virtual 
Machines.  For a machine hosting only 100 VM data stores, I can see 99% 
space savings. And, I see a significant performance boost, since I can 
cache that one VM image in RAM easily.   There''s other places where 
Dedup seems modestly useful these days (one is in the afore-mentioned 
media-file server, which you''d be surprised how much duplication there 
is), but it''s *much* harder to pre-determine dedup''s utility
for a given
dataset, unless you have highly detailed knowledge of that dataset''s 
composition.

I''ll admit to not being a big fan of the Dedup concept originally (go 
back a couple of years here on this list), but, given that the world is 
marching straight to Virtualization as fast as we can go, I''m a convert
now.

 From my perspective, here''s a couple of things that I think would help
improve dedup''s utility for me:

(a) fix the outstanding issues in the current implementation (duh!).

(b) add the ability to store the entire DDT in the backing store, and 
not have to construct it in ARC from disk-resident info (this would be 
of great help where backing store = SSD or RAM-based things)

(c) be able to "test-dedup" a given filesystem.  I''d like ZFS
to be able
to look at a filesystem and tell me how much dedup I''d get out of it, 
WITHOUT having to actually create a dedup-enabled filesystem and copy 
the data to it.  While it would be nice to be able to simply turn on 
dedup for a filesystem, and have ZFS dedup the existing data there 
(in-place, without copying), I realize the implementation is hard given 
how things currently work, and frankly, that''s of much lower priority 
for me than being able to test-dedup a dataset.

(d) increase the slab (record) size significantly, to at least 1MB or 
more. I daresay the primary way VM images are stored these days are as 
single, large files (though iSCSI volumes are coming up fast), and as 
such, I''ve got 20G files which would really, really, benefit from
having
a much larger slab size.

(e) and, of course, seeing if there''s some way we can cut down on 
dedup''s piggy DDT size.  :-)

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

Garrett D''Amore

2010-Jul-19 00:33 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On Sun, 2010-07-18 at 16:18 -0700, Richard L. Hamilton wrote:
> 
> I would imagine that if it''s read-mostly, it''s a win, but
> otherwise it costs more than it saves.  Even more conventional
> compression tends to be more resource intensive than decompression...
> 
> What I''m wondering is when dedup is a better value than
compression.
> Most obviously, when there are a lot of identical blocks across different
> files; but I''m not sure how often that happens, aside from maybe
> blocks of zeros (which may well be sparse anyway).
Shared/identical blocks come into play in several specific scenarios:

1) Multiple VMs, cloud.  If you have multiple guest OS'' installed,
they''re going to benefit heavily from dedup.  Even Zones can benefit
here.

2) Situations with lots of copies of large amounts of data where only
some of the data is different between each copy.  The classic example is
a Solaris build server, hosting dozens or even hundreds, of copies of
the Solaris tree, each being worked on by different developers.
Typically the developer is working on something less than 1% of the
total source code, so the other 99% can be shared via dedup.

For general purpose usage, e.g. hosting your music or movie collection,
I doubt that dedup offers any real advantage.  If I were talking about
deploying dedup, I''d only use it in situations like the two I
mentioned,
and not for just a general purpose storage server.  For general purpose
applications I think compression is better.  (Though I think dedup will
have higher savings -- significantly so -- in the particular situation
where you know you lots and lots of duplicate/redundant data.)

Note also that dedup actually does some things where your duplicated
data may gain an effective increase in redundancy/security, because it
does make sure that the data that is deduped has higher redundancy than
non-deduped data.  (This sounds counterintuitive, but as long as you
have at least 3 copies of the duplicated data, its a net win.)

Btw, compression on top of dedup may actually kill your benefit of
dedup.   My hypothesis (unproven, admittedly) is that because many
compression algos actually cause small permutations of data to
significantly change the bit values (even just by changing their offset
in the binary) in the overall compressed object, it can seriously defeat
dedup''s efficacy.

	- Garrett

Haudy Kazemi

2010-Jul-19 05:30 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

Brandon High wrote:> On Fri, Jul 9, 2010 at 5:18 PM, Brandon High <bhigh at freaks.com 
> <mailto:bhigh at freaks.com>> wrote:
>
>     I think that DDT entries are a little bigger than what you''re
>     using. The size seems to range between 150 and 250 bytes depending
>     on how it''s calculated, call it 200b each. Your 128G dataset
would
>     require closer to 200M (+/- 25%) for the DDT if your data was
>     completely unique. 1TB of unique data would require 600M - 1000M
>     for the DDT.
>
>
> Using 376b per entry, it''s 376M for 128G of unique data, or just
under
> 3GB for 1TB of unique data.
>
> A 1TB zvol with 8k blocks would require almost 24GB of memory to hold 
> the DDT. Ouch.
>
> -B

To reduce RAM requirements, consider an offline or idle time dedupe.  I 
suggested a variation of this in regards to compress a while ago, 
probably on this list.

In either case, you have the system write the data whichever way is fastest.

If there is enough unused CPU power, run maximum compression, otherwise 
use fast compression.  If new data type specific compression algorithms 
are added, attempt compression with those as well (e.g. lossless JPEG 
recompression that can save 20-25% space).  Store the block in whichever 
compression format works best.

If there is enough RAM to maintain a live dedupe table, dedupe right away.

If CPU and RAM pressures are too high, defer dedupe and compression to a 
periodic scrub (or some other new periodically run command).  In the 
deferred case, the dedupe table entries could be generated as blocks are 
filled/change and then kept on disk.  Periodically that table would be 
quicksorted by the hash, and then any duplicates would be found next to 
each other.  The blocks for the duplicates would be looked up, verified 
as truly identical, and then re-written (probably also using BP 
rewrite).  Quicksort is parallelable and sorting a multi-gigabyte table 
is a plausible operation, even on disk.  Quicksort 100mb pieces of it in 
RAM and iterate until the whole table ends up sorted.

The end result of all this idle time compression and deduping is that 
the initially allocated storage space becomes the upper bound storage 
requirement, and that the data will end up packing tighter over time.  
The phrasing on bulk packaged items comes to mind: "Contents may have 
settled during shipping".

Now a theoretical question about dedupe...what about the interaction 
with defragmentation (this also probably needs BP rewrite)?  The first 
file will be completely defragmented, but the second file that is a 
slight variation of the first will have at least two fragments (the 
deduped portion, and the unique portion).  Probably the performance 
impact will be minor as long as each fragment is a decent minimum size 
(multiple MB).

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100719/2f21870b/attachment.html>

Edward Ned Harvey

2010-Jul-20 03:41 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Richard L. Hamilton
> 
> I would imagine that if it''s read-mostly, it''s a win, but
> otherwise it costs more than it saves.  Even more conventional
> compression tends to be more resource intensive than decompression...
I would imagine it''s *easier* to have a win when it''s
read-mostly, but the
expense of computing checksums is going to be done either way, with or
without dedup.  The only extra cost dedup adds is to maintain a hash tree of
some kind, to see if some block has already been stored on disk.  So ... of
course I''m speaking hypothetically and haven''t been proven ...
I think dedup
will accelerate the system in nearly all use cases. 

The main exception is whenever you have highly non-duplicated data.  I think
the cost of dedup CPU power is tiny little small, but in the case of highly
non-duplicated data, even that little expense is a waste.

> What I''m wondering is when dedup is a better value than
compression.
Whenever files have internal repetition, compression will be better.
Whenever the repetition crosses file barriers, dedup will be better.

> Most obviously, when there are a lot of identical blocks across
> different
> files; but I''m not sure how often that happens, aside from maybe
> blocks of zeros (which may well be sparse anyway).
I think the main value here is when there are more than one copy of some
files in the filesystem.  For example:

In subversion, there are two copies of every file in your working directory.
Every file has a corresponding "base" copy located in the .svn
directory.

If you have a lot of developers ... software or whatever ... who have all
checked out the same project, and they''re all working on it in their
home
directories ...  All of those copies get essentially cut down to 1.

Combine the developers with subversion ... You would have 2x copies of every
file, in every person''s home dir = ... a lot of copies of the same
files ...
All cut down to 1.

You build some package from source code.  somefile.c becomes somefile.o, and
then the linker takes somefile.o and a bunch of other .o files and mashes
them all together to make "finalproduct" executable file.  Well, that
executable is just copies of all these .o files mashed together.  So again
... cut it all down to 1.  And multiply by the number of developers who are
all doing the same thing in their home dirs.

Others have mentioned VM''s, when VM''s are duplicated ... I
don''t personally
duplicate many VM''s, so it doesn''t matter to me ... but I see
this value for
others ...

Robert Milkowski

2010-Jul-20 07:52 UTC

head link

[zfs-discuss] Debunking the dedup memory myth

On 20/07/2010 04:41, Edward Ned Harvey wrote:>> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
>> bounces at opensolaris.org] On Behalf Of Richard L. Hamilton
>>
>> I would imagine that if it''s read-mostly, it''s a win,
but
>> otherwise it costs more than it saves.  Even more conventional
>> compression tends to be more resource intensive than decompression...
>>      
> I would imagine it''s *easier* to have a win when it''s
read-mostly, but the
> expense of computing checksums is going to be done either way, with or
> without dedup.  The only extra cost dedup adds is to maintain a hash tree
of
> some kind, to see if some block has already been stored on disk.  So ... of
> course I''m speaking hypothetically and haven''t been
proven ... I think dedup
> will accelerate the system in nearly all use cases.
>
> The main exception is whenever you have highly non-duplicated data.  I
think
> the cost of dedup CPU power is tiny little small, but in the case of highly
> non-duplicated data, even that little expense is a waste.
>
>    
Please note that by default ZFS uses fletcher4 checksums but dedup 
currently allows only for sha256 which are more CPU intensive. Also from 
a performance point of view there will be a sudden drop in write 
performance the moment DDT can''t fit entirely in a memory. L2ARC could 
mitigate the impact though.

Then there will be less memory available for data caching due to extra 
memory requirements for DDT.
(however please note that IIRC DDT is treated as meta data and by 
default there is a limit of meta-data cache size to be no bigger than 
20% of ARC - there is a bug open for it, I haven''t checked if
it''s been
fixed yet or not).
>> What I''m wondering is when dedup is a better value than
compression.
>>      
> Whenever files have internal repetition, compression will be better.
> Whenever the repetition crosses file barriers, dedup will be better.
>
>    
Not necessarily. Compression in ZFS works only within a single fs block 
scope.
So for example if you have a large file with most of its block identical 
dedup should "compress" the file much better than a compression. Also 
please note that you can use both: compression and dedup at the same time.


-- 
Robert Milkowski
http://milek.blogspot.com

zfs discuss - Jul 2010 - Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth

[zfs-discuss] Debunking the dedup memory myth