thr3ads.net - zfs discuss - [zfs-discuss] compressratio vs. dedupratio [Dec 2009]

If this information is useful, please help other people find it:
Share via:

Robert Milkowski

2009-Dec-12 22:06 UTC

[zfs-discuss] compressratio vs. dedupratio

Hi,

The compressratio property seems to be a ratio of compression for a 
given dataset calculated in such a way so all data in it (compressed or 
not) is taken into account.
The dedupratio property on the other hand seems to be taking into 
account only dedupped data in a pool.
So for example if there is already 1TB of data before dedup=on and then 
dedup is set to on and 3 small identical files are copied in the 
dedupratio will be 3. IMHO it is misleading as it suggest that on 
average a ratio of 3 was achieved in a pool which is not true.

Is it by design or is it a bug?
If it is by design then having an another property which would give a 
ratio of dedup in relation to all data in a pool (dedupped or not) would 
be useful.


Example (snv 129):


milek at r600:/rpool/tmp# mkfile 200m file1
milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1

milek at r600:/rpool/tmp# ls -l /var/adm/messages
-rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages
milek at r600:/rpool/tmp# cp /var/adm/messages /test/
milek at r600:/rpool/tmp# sync
milek at r600:/rpool/tmp# zfs get compressratio test
NAME  PROPERTY       VALUE  SOURCE
test  compressratio  1.00x  -


milek at r600:/rpool/tmp# zfs set compression=gzip test
milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
milek at r600:/rpool/tmp# sync
milek at r600:/rpool/tmp# zfs get compressratio test
NAME  PROPERTY       VALUE  SOURCE
test  compressratio  1.27x  -


milek at r600:/rpool/tmp# zfs get compressratio test
NAME  PROPERTY       VALUE  SOURCE
test  compressratio  1.24x  -





milek at r600:/rpool/tmp# zpool destroy test
milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1
milek at r600:/rpool/tmp# zpool get dedupratio test
NAME  PROPERTY    VALUE  SOURCE
test  dedupratio  1.00x  -


milek at r600:/rpool/tmp# cp /var/adm/messages /test/
milek at r600:/rpool/tmp# sync
milek at r600:/rpool/tmp# zpool get dedupratio test
NAME  PROPERTY    VALUE  SOURCE
test  dedupratio  1.00x  -

milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
milek at r600:/rpool/tmp# sync
milek at r600:/rpool/tmp# zpool get dedupratio test
NAME  PROPERTY    VALUE  SOURCE
test  dedupratio  1.00x  -
milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.2
milek at r600:/rpool/tmp# sync
milek at r600:/rpool/tmp# zpool get dedupratio test
NAME  PROPERTY    VALUE  SOURCE
test  dedupratio  2.00x  -

milek at r600:/rpool/tmp# rm /test/messages
milek at r600:/rpool/tmp# sync
milek at r600:/rpool/tmp# zpool get dedupratio test
NAME  PROPERTY    VALUE  SOURCE
test  dedupratio  2.00x  -







-- 
Robert Milkowski
http://milek.blogspot.com

Jeff Bonwick

2009-Dec-13 11:44 UTC

head link

[zfs-discuss] compressratio vs. dedupratio

It is by design.  The idea is to report the dedup ratio for the data
you''ve actually attempted to dedup.  To get a
''diluted'' dedup ratio
of the sort you describe, just compare the space used by all datasets
to the space allocated in the pool.  For example, on my desktop,
I have a pool called ''builds'' with dedup enabled on some
datasets:

$ zfs get used builds
NAME    PROPERTY  VALUE  SOURCE
builds  used      81.0G  -
$ zpool get allocated builds
NAME    PROPERTY   VALUE  SOURCE
builds  allocated  47.4G  -

Thus my diluted dedup ratio is 81.0 / 47.4 = 1.71.

Jeff

On Sat, Dec 12, 2009 at 10:06:49PM +0000, Robert Milkowski
wrote:> Hi,
> 
> The compressratio property seems to be a ratio of compression for a 
> given dataset calculated in such a way so all data in it (compressed or 
> not) is taken into account.
> The dedupratio property on the other hand seems to be taking into 
> account only dedupped data in a pool.
> So for example if there is already 1TB of data before dedup=on and then 
> dedup is set to on and 3 small identical files are copied in the 
> dedupratio will be 3. IMHO it is misleading as it suggest that on 
> average a ratio of 3 was achieved in a pool which is not true.
> 
> Is it by design or is it a bug?
> If it is by design then having an another property which would give a 
> ratio of dedup in relation to all data in a pool (dedupped or not) would 
> be useful.
> 
> 
> Example (snv 129):
> 
> 
> milek at r600:/rpool/tmp# mkfile 200m file1
> milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1
> 
> milek at r600:/rpool/tmp# ls -l /var/adm/messages
> -rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages
> milek at r600:/rpool/tmp# cp /var/adm/messages /test/
> milek at r600:/rpool/tmp# sync
> milek at r600:/rpool/tmp# zfs get compressratio test
> NAME  PROPERTY       VALUE  SOURCE
> test  compressratio  1.00x  -
> 
> 
> milek at r600:/rpool/tmp# zfs set compression=gzip test
> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
> milek at r600:/rpool/tmp# sync
> milek at r600:/rpool/tmp# zfs get compressratio test
> NAME  PROPERTY       VALUE  SOURCE
> test  compressratio  1.27x  -
> 
> 
> milek at r600:/rpool/tmp# zfs get compressratio test
> NAME  PROPERTY       VALUE  SOURCE
> test  compressratio  1.24x  -
> 
> 
> 
> 
> 
> milek at r600:/rpool/tmp# zpool destroy test
> milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1
> milek at r600:/rpool/tmp# zpool get dedupratio test
> NAME  PROPERTY    VALUE  SOURCE
> test  dedupratio  1.00x  -
> 
> 
> milek at r600:/rpool/tmp# cp /var/adm/messages /test/
> milek at r600:/rpool/tmp# sync
> milek at r600:/rpool/tmp# zpool get dedupratio test
> NAME  PROPERTY    VALUE  SOURCE
> test  dedupratio  1.00x  -
> 
> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
> milek at r600:/rpool/tmp# sync
> milek at r600:/rpool/tmp# zpool get dedupratio test
> NAME  PROPERTY    VALUE  SOURCE
> test  dedupratio  1.00x  -
> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.2
> milek at r600:/rpool/tmp# sync
> milek at r600:/rpool/tmp# zpool get dedupratio test
> NAME  PROPERTY    VALUE  SOURCE
> test  dedupratio  2.00x  -
> 
> milek at r600:/rpool/tmp# rm /test/messages
> milek at r600:/rpool/tmp# sync
> milek at r600:/rpool/tmp# zpool get dedupratio test
> NAME  PROPERTY    VALUE  SOURCE
> test  dedupratio  2.00x  -
> 
> 
> 
> 
> 
> 
> 
> -- 
> Robert Milkowski
> http://milek.blogspot.com
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Robert Milkowski

2009-Dec-13 14:52 UTC

head link

[zfs-discuss] compressratio vs. dedupratio

Thank you.
However I think it should be more clearly stated in zpool(1M) perhaps 
even referring to compressratio and explaining that this one is 
different, plus information as shown below how to get a dedupratio which 
is similar in meaning to compressratio.



On 13/12/2009 11:44, Jeff Bonwick wrote:> It is by design.  The idea is to report the dedup ratio for the data
> you''ve actually attempted to dedup.  To get a
''diluted'' dedup ratio
> of the sort you describe, just compare the space used by all datasets
> to the space allocated in the pool.  For example, on my desktop,
> I have a pool called ''builds'' with dedup enabled on some
datasets:
>
> $ zfs get used builds
> NAME    PROPERTY  VALUE  SOURCE
> builds  used      81.0G  -
> $ zpool get allocated builds
> NAME    PROPERTY   VALUE  SOURCE
> builds  allocated  47.4G  -
>
> Thus my diluted dedup ratio is 81.0 / 47.4 = 1.71.
>
> Jeff
>
> On Sat, Dec 12, 2009 at 10:06:49PM +0000, Robert Milkowski wrote:
>    
>> Hi,
>>
>> The compressratio property seems to be a ratio of compression for a
>> given dataset calculated in such a way so all data in it (compressed or
>> not) is taken into account.
>> The dedupratio property on the other hand seems to be taking into
>> account only dedupped data in a pool.
>> So for example if there is already 1TB of data before dedup=on and then
>> dedup is set to on and 3 small identical files are copied in the
>> dedupratio will be 3. IMHO it is misleading as it suggest that on
>> average a ratio of 3 was achieved in a pool which is not true.
>>
>> Is it by design or is it a bug?
>> If it is by design then having an another property which would give a
>> ratio of dedup in relation to all data in a pool (dedupped or not)
would
>> be useful.
>>
>>
>> Example (snv 129):
>>
>>
>> milek at r600:/rpool/tmp# mkfile 200m file1
>> milek at r600:/rpool/tmp# zpool create -O atime=off test
/rpool/tmp/file1
>>
>> milek at r600:/rpool/tmp# ls -l /var/adm/messages
>> -rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages
>> milek at r600:/rpool/tmp# cp /var/adm/messages /test/
>> milek at r600:/rpool/tmp# sync
>> milek at r600:/rpool/tmp# zfs get compressratio test
>> NAME  PROPERTY       VALUE  SOURCE
>> test  compressratio  1.00x  -
>>
>>
>> milek at r600:/rpool/tmp# zfs set compression=gzip test
>> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
>> milek at r600:/rpool/tmp# sync
>> milek at r600:/rpool/tmp# zfs get compressratio test
>> NAME  PROPERTY       VALUE  SOURCE
>> test  compressratio  1.27x  -
>>
>>
>> milek at r600:/rpool/tmp# zfs get compressratio test
>> NAME  PROPERTY       VALUE  SOURCE
>> test  compressratio  1.24x  -
>>
>>
>>
>>
>>
>> milek at r600:/rpool/tmp# zpool destroy test
>> milek at r600:/rpool/tmp# zpool create -O atime=off test
/rpool/tmp/file1
>> milek at r600:/rpool/tmp# zpool get dedupratio test
>> NAME  PROPERTY    VALUE  SOURCE
>> test  dedupratio  1.00x  -
>>
>>
>> milek at r600:/rpool/tmp# cp /var/adm/messages /test/
>> milek at r600:/rpool/tmp# sync
>> milek at r600:/rpool/tmp# zpool get dedupratio test
>> NAME  PROPERTY    VALUE  SOURCE
>> test  dedupratio  1.00x  -
>>
>> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1
>> milek at r600:/rpool/tmp# sync
>> milek at r600:/rpool/tmp# zpool get dedupratio test
>> NAME  PROPERTY    VALUE  SOURCE
>> test  dedupratio  1.00x  -
>> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.2
>> milek at r600:/rpool/tmp# sync
>> milek at r600:/rpool/tmp# zpool get dedupratio test
>> NAME  PROPERTY    VALUE  SOURCE
>> test  dedupratio  2.00x  -
>>
>> milek at r600:/rpool/tmp# rm /test/messages
>> milek at r600:/rpool/tmp# sync
>> milek at r600:/rpool/tmp# zpool get dedupratio test
>> NAME  PROPERTY    VALUE  SOURCE
>> test  dedupratio  2.00x  -
>>
>>
>>
>>
>>
>>
>>
>> -- 
>> Robert Milkowski
>> http://milek.blogspot.com
>>
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>      
>

Craig S. Bell

2009-Dec-14 21:54 UTC

head link

[zfs-discuss] compressratio vs. dedupratio

I am also accustomed to seeing diluted properties such as compressratio.  IMHO
it could be useful (or perhaps just familiar) to see a diluted dedup ratio for
the pool, or maybe see the size / percentage of data used to arrive at
dedupratio.

As Jeff points out, there is enough data available to calculate this.  Would it
be meaningful enough to present a diluted ratio property?  IOW, would that tell
me anything than I don''t get from simply using "available" as
my fuel gauge?

This is probably a larger topic:  What additional statistics would be genuinely
useful to the admin when there is space interaction between datasets.  As we
have seen, some commands are less objective with dedup:

http://www.c0t0d0s0.org/index.php?url=archives/6168-df-considered-problematic.html
http://blogs.sun.com/jsavit/entry/deduplication_now_in_zfs

Thanks everyone...   -cheers, CSB
-- 
This message posted from opensolaris.org

Mike Gerdts

2009-Dec-15 00:39 UTC

head link

[zfs-discuss] compressratio vs. dedupratio

On Mon, Dec 14, 2009 at 3:54 PM, Craig S. Bell <cbell at standard.com>
wrote:> I am also accustomed to seeing diluted properties such as compressratio.
?IMHO it could be useful (or perhaps just familiar) to see a diluted dedup ratio
for the pool, or maybe see the size / percentage of data used to arrive at
dedupratio.
>
> As Jeff points out, there is enough data available to calculate this.
?Would it be meaningful enough to present a diluted ratio property? ?IOW, would
that tell me anything than I don''t get from simply using
"available" as my fuel gauge?
>
> This is probably a larger topic: ?What additional statistics would be
genuinely useful to the admin when there is space interaction between datasets.
?As we have seen, some commands are less objective with dedup:
I was recently confused when doing mkfile (or was it dd if=/dev/zero
...) and found that even though blocks were compressed away to
nothing, the compressratio did not increase.  For example:

# perl -e ''print "a" x 100000000'' > /test/a
# zfs get compressratio test
NAME  PROPERTY       VALUE  SOURCE
test  compressratio  7.87x  -

However if I put null characters into the same file:

# dd if=/dev/zero of=a bs=100000000 count=1
1+0 records in
1+0 records out
# zfs get compressratio test
NAME  PROPERTY       VALUE  SOURCE
test  compressratio  1.00x  -

I understand that a block is not allocated if it contains all zero''s,
but that would seem to contribute to a higher compressratio rather
than a lower compressratio.

If I disable compression and enable dedup, does it count deduplicated
blocks of zeros toward the dedupratio?

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Craig S. Bell

2009-Dec-15 08:31 UTC

head link

[zfs-discuss] compressratio vs. dedupratio

Mike, I believe that ZFS treats runs of zeros as holes in a sparse file, rather
than as regular data.  So they aren''t really present to be counted for
compressratio.

http://blogs.sun.com/bonwick/entry/seek_hole_and_seek_data
http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/017565.html
-- 
This message posted from opensolaris.org

Mike Gerdts

2009-Dec-15 14:56 UTC

head link

[zfs-discuss] compressratio vs. dedupratio

On Tue, Dec 15, 2009 at 2:31 AM, Craig S. Bell <cbell at standard.com>
wrote:> Mike, I believe that ZFS treats runs of zeros as holes in a sparse file,
rather than as regular data. ?So they aren''t really present to be
counted for compressratio.
>
> http://blogs.sun.com/bonwick/entry/seek_hole_and_seek_data
> http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/017565.html
But it only does so when compression is enabled, as such I would
expect that compression would claim this as a win.  Without it,
someone may assume that they aren''t getting much benefit from
compression, turn it off, then run into problems down the road because
sparseness that develops in files never turns into free space.

Also, I would expect that:

- If a file is created via a write to every block that it would be
accounted for as non-sparse (regardless of compression=<on|off|...>)
- If a file is sparse because the program that created the file used
seek() or similar to skip past blocks, it should be accounted for as
sparse (regardless of compression).
- If a program overwrites a block with zeros to a file where it should
not be considered sparse.

In the below example, I would expect that writing 100MB of
''\0'' would
contribute as much to compressratio as 100 MB of ''a''.  Notice
that a
block of zeros does not turn into a sparse file with compression=off.

# zfs create test/on
# zfs create test/off
# zfs set compression=off test/off

# zfs get compression test/on test/off
NAME      PROPERTY     VALUE     SOURCE
test/off  compression  off       local
test/on   compression  on        inherited from test

# mkfile 100m on/100m off/100m

# ls -l o*/100m
-rw------T   1 root     root     104857600 Dec 15 14:27 off/100m
-rw------T   1 root     root     104857600 Dec 15 14:27 on/100m

# du -h o*/100m
 100M   off/100m
   0K   on/100m

# perl -e ''print "a" x 100000000'' > on/a
# perl -e ''print "a" x 100000000'' > off/a
# sync

# ls -l */a
-rw-r--r--   1 root     root     100000000 Dec 15 14:35 off/a
-rw-r--r--   1 root     root     100000000 Dec 15 14:35 on/a

# du -h */a
  95M   off/a
 3.4M   on/a

# zfs get compressratio test/on test/off
NAME      PROPERTY       VALUE  SOURCE
test/off  compressratio  1.00x  -
test/on   compressratio  28.27x  -

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Craig S. Bell

2009-Dec-15 19:05 UTC

head link

[zfs-discuss] compressratio vs. dedupratio

I take your point Mike.  Yes, this seems to be an inconsistency in accounting. 
I have simply become accustomed to this (esp. when dealing with virtual disk
images), so I just don''t think about it, but it *is* harder to balance
accounts.

For instance, if my guest cleans up it''s vdisk by writing zeroes over
free space, then I expect "used" in the backing dataset to shrink, but
the compressratio doesn''t change very much.  The new zero bytes
effectively go to "available".

This leads back to my larger question -- is it possible to understand more about
what pool space is doing, given the currently available properties?  I think a
couple of additions might be helpful for the admin to visualize what''s
going on.

In particular, there should enough information visible to apply a consistent
view of the pool (either diluted or concentrated) within the same listing.  IMHO
the outermost view should consider all space, including the sparse file holes.

To put it another way, I agree that there should be a way to see where the
zeroes went.  I don''t need to see every sparse-file hole, but some
ratio or value should allow me to better estimate how "available" will
change the next time I write.

--------

Inapt analogy follows:

This is like fueling your vehicle.  You don''t get perfect information
on quantity (fuel expands when it''s hot out), and then you increase
speed, climb a hill, open a window, get stuck in traffic, &c.  How far can
you safely drive?

You don''t know exactly when the warning light will come on, but you
watch the needle ("available") at intervals.  This is easy for the
average driver, and most tend not to get stranded. even while observing only one
relative property.

So my guess is that -- with the information currently available -- monitoring a
large zpool could become a bit less precise, and more along the lines of
"is it close to the big E yet? Okay then find a filling station, or a place
to stop".

IMHO enhancements to these properties would help many admins more accurately
understanding what will happen next to the available space, so that they
don''t exhaust the available space faster than they expected or
intended.  -c
-- 
This message posted from opensolaris.org

zfs discuss - Dec 2009 - compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio

[zfs-discuss] compressratio vs. dedupratio