Hi, The compressratio property seems to be a ratio of compression for a given dataset calculated in such a way so all data in it (compressed or not) is taken into account. The dedupratio property on the other hand seems to be taking into account only dedupped data in a pool. So for example if there is already 1TB of data before dedup=on and then dedup is set to on and 3 small identical files are copied in the dedupratio will be 3. IMHO it is misleading as it suggest that on average a ratio of 3 was achieved in a pool which is not true. Is it by design or is it a bug? If it is by design then having an another property which would give a ratio of dedup in relation to all data in a pool (dedupped or not) would be useful. Example (snv 129): milek at r600:/rpool/tmp# mkfile 200m file1 milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 milek at r600:/rpool/tmp# ls -l /var/adm/messages -rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages milek at r600:/rpool/tmp# cp /var/adm/messages /test/ milek at r600:/rpool/tmp# sync milek at r600:/rpool/tmp# zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.00x - milek at r600:/rpool/tmp# zfs set compression=gzip test milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 milek at r600:/rpool/tmp# sync milek at r600:/rpool/tmp# zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.27x - milek at r600:/rpool/tmp# zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.24x - milek at r600:/rpool/tmp# zpool destroy test milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 milek at r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTY VALUE SOURCE test dedupratio 1.00x - milek at r600:/rpool/tmp# cp /var/adm/messages /test/ milek at r600:/rpool/tmp# sync milek at r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTY VALUE SOURCE test dedupratio 1.00x - milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 milek at r600:/rpool/tmp# sync milek at r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTY VALUE SOURCE test dedupratio 1.00x - milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.2 milek at r600:/rpool/tmp# sync milek at r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTY VALUE SOURCE test dedupratio 2.00x - milek at r600:/rpool/tmp# rm /test/messages milek at r600:/rpool/tmp# sync milek at r600:/rpool/tmp# zpool get dedupratio test NAME PROPERTY VALUE SOURCE test dedupratio 2.00x - -- Robert Milkowski http://milek.blogspot.com
It is by design. The idea is to report the dedup ratio for the data you''ve actually attempted to dedup. To get a ''diluted'' dedup ratio of the sort you describe, just compare the space used by all datasets to the space allocated in the pool. For example, on my desktop, I have a pool called ''builds'' with dedup enabled on some datasets: $ zfs get used builds NAME PROPERTY VALUE SOURCE builds used 81.0G - $ zpool get allocated builds NAME PROPERTY VALUE SOURCE builds allocated 47.4G - Thus my diluted dedup ratio is 81.0 / 47.4 = 1.71. Jeff On Sat, Dec 12, 2009 at 10:06:49PM +0000, Robert Milkowski wrote:> Hi, > > The compressratio property seems to be a ratio of compression for a > given dataset calculated in such a way so all data in it (compressed or > not) is taken into account. > The dedupratio property on the other hand seems to be taking into > account only dedupped data in a pool. > So for example if there is already 1TB of data before dedup=on and then > dedup is set to on and 3 small identical files are copied in the > dedupratio will be 3. IMHO it is misleading as it suggest that on > average a ratio of 3 was achieved in a pool which is not true. > > Is it by design or is it a bug? > If it is by design then having an another property which would give a > ratio of dedup in relation to all data in a pool (dedupped or not) would > be useful. > > > Example (snv 129): > > > milek at r600:/rpool/tmp# mkfile 200m file1 > milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 > > milek at r600:/rpool/tmp# ls -l /var/adm/messages > -rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages > milek at r600:/rpool/tmp# cp /var/adm/messages /test/ > milek at r600:/rpool/tmp# sync > milek at r600:/rpool/tmp# zfs get compressratio test > NAME PROPERTY VALUE SOURCE > test compressratio 1.00x - > > > milek at r600:/rpool/tmp# zfs set compression=gzip test > milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 > milek at r600:/rpool/tmp# sync > milek at r600:/rpool/tmp# zfs get compressratio test > NAME PROPERTY VALUE SOURCE > test compressratio 1.27x - > > > milek at r600:/rpool/tmp# zfs get compressratio test > NAME PROPERTY VALUE SOURCE > test compressratio 1.24x - > > > > > > milek at r600:/rpool/tmp# zpool destroy test > milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 > milek at r600:/rpool/tmp# zpool get dedupratio test > NAME PROPERTY VALUE SOURCE > test dedupratio 1.00x - > > > milek at r600:/rpool/tmp# cp /var/adm/messages /test/ > milek at r600:/rpool/tmp# sync > milek at r600:/rpool/tmp# zpool get dedupratio test > NAME PROPERTY VALUE SOURCE > test dedupratio 1.00x - > > milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 > milek at r600:/rpool/tmp# sync > milek at r600:/rpool/tmp# zpool get dedupratio test > NAME PROPERTY VALUE SOURCE > test dedupratio 1.00x - > milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.2 > milek at r600:/rpool/tmp# sync > milek at r600:/rpool/tmp# zpool get dedupratio test > NAME PROPERTY VALUE SOURCE > test dedupratio 2.00x - > > milek at r600:/rpool/tmp# rm /test/messages > milek at r600:/rpool/tmp# sync > milek at r600:/rpool/tmp# zpool get dedupratio test > NAME PROPERTY VALUE SOURCE > test dedupratio 2.00x - > > > > > > > > -- > Robert Milkowski > http://milek.blogspot.com > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Thank you. However I think it should be more clearly stated in zpool(1M) perhaps even referring to compressratio and explaining that this one is different, plus information as shown below how to get a dedupratio which is similar in meaning to compressratio. On 13/12/2009 11:44, Jeff Bonwick wrote:> It is by design. The idea is to report the dedup ratio for the data > you''ve actually attempted to dedup. To get a ''diluted'' dedup ratio > of the sort you describe, just compare the space used by all datasets > to the space allocated in the pool. For example, on my desktop, > I have a pool called ''builds'' with dedup enabled on some datasets: > > $ zfs get used builds > NAME PROPERTY VALUE SOURCE > builds used 81.0G - > $ zpool get allocated builds > NAME PROPERTY VALUE SOURCE > builds allocated 47.4G - > > Thus my diluted dedup ratio is 81.0 / 47.4 = 1.71. > > Jeff > > On Sat, Dec 12, 2009 at 10:06:49PM +0000, Robert Milkowski wrote: > >> Hi, >> >> The compressratio property seems to be a ratio of compression for a >> given dataset calculated in such a way so all data in it (compressed or >> not) is taken into account. >> The dedupratio property on the other hand seems to be taking into >> account only dedupped data in a pool. >> So for example if there is already 1TB of data before dedup=on and then >> dedup is set to on and 3 small identical files are copied in the >> dedupratio will be 3. IMHO it is misleading as it suggest that on >> average a ratio of 3 was achieved in a pool which is not true. >> >> Is it by design or is it a bug? >> If it is by design then having an another property which would give a >> ratio of dedup in relation to all data in a pool (dedupped or not) would >> be useful. >> >> >> Example (snv 129): >> >> >> milek at r600:/rpool/tmp# mkfile 200m file1 >> milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 >> >> milek at r600:/rpool/tmp# ls -l /var/adm/messages >> -rw-r--r-- 1 root root 70993 2009-12-12 21:50 /var/adm/messages >> milek at r600:/rpool/tmp# cp /var/adm/messages /test/ >> milek at r600:/rpool/tmp# sync >> milek at r600:/rpool/tmp# zfs get compressratio test >> NAME PROPERTY VALUE SOURCE >> test compressratio 1.00x - >> >> >> milek at r600:/rpool/tmp# zfs set compression=gzip test >> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 >> milek at r600:/rpool/tmp# sync >> milek at r600:/rpool/tmp# zfs get compressratio test >> NAME PROPERTY VALUE SOURCE >> test compressratio 1.27x - >> >> >> milek at r600:/rpool/tmp# zfs get compressratio test >> NAME PROPERTY VALUE SOURCE >> test compressratio 1.24x - >> >> >> >> >> >> milek at r600:/rpool/tmp# zpool destroy test >> milek at r600:/rpool/tmp# zpool create -O atime=off test /rpool/tmp/file1 >> milek at r600:/rpool/tmp# zpool get dedupratio test >> NAME PROPERTY VALUE SOURCE >> test dedupratio 1.00x - >> >> >> milek at r600:/rpool/tmp# cp /var/adm/messages /test/ >> milek at r600:/rpool/tmp# sync >> milek at r600:/rpool/tmp# zpool get dedupratio test >> NAME PROPERTY VALUE SOURCE >> test dedupratio 1.00x - >> >> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.1 >> milek at r600:/rpool/tmp# sync >> milek at r600:/rpool/tmp# zpool get dedupratio test >> NAME PROPERTY VALUE SOURCE >> test dedupratio 1.00x - >> milek at r600:/rpool/tmp# cp /var/adm/messages /test/messages.2 >> milek at r600:/rpool/tmp# sync >> milek at r600:/rpool/tmp# zpool get dedupratio test >> NAME PROPERTY VALUE SOURCE >> test dedupratio 2.00x - >> >> milek at r600:/rpool/tmp# rm /test/messages >> milek at r600:/rpool/tmp# sync >> milek at r600:/rpool/tmp# zpool get dedupratio test >> NAME PROPERTY VALUE SOURCE >> test dedupratio 2.00x - >> >> >> >> >> >> >> >> -- >> Robert Milkowski >> http://milek.blogspot.com >> >> _______________________________________________ >> zfs-discuss mailing list >> zfs-discuss at opensolaris.org >> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss >> >
I am also accustomed to seeing diluted properties such as compressratio. IMHO it could be useful (or perhaps just familiar) to see a diluted dedup ratio for the pool, or maybe see the size / percentage of data used to arrive at dedupratio. As Jeff points out, there is enough data available to calculate this. Would it be meaningful enough to present a diluted ratio property? IOW, would that tell me anything than I don''t get from simply using "available" as my fuel gauge? This is probably a larger topic: What additional statistics would be genuinely useful to the admin when there is space interaction between datasets. As we have seen, some commands are less objective with dedup: http://www.c0t0d0s0.org/index.php?url=archives/6168-df-considered-problematic.html http://blogs.sun.com/jsavit/entry/deduplication_now_in_zfs Thanks everyone... -cheers, CSB -- This message posted from opensolaris.org
On Mon, Dec 14, 2009 at 3:54 PM, Craig S. Bell <cbell at standard.com> wrote:> I am also accustomed to seeing diluted properties such as compressratio. ?IMHO it could be useful (or perhaps just familiar) to see a diluted dedup ratio for the pool, or maybe see the size / percentage of data used to arrive at dedupratio. > > As Jeff points out, there is enough data available to calculate this. ?Would it be meaningful enough to present a diluted ratio property? ?IOW, would that tell me anything than I don''t get from simply using "available" as my fuel gauge? > > This is probably a larger topic: ?What additional statistics would be genuinely useful to the admin when there is space interaction between datasets. ?As we have seen, some commands are less objective with dedup:I was recently confused when doing mkfile (or was it dd if=/dev/zero ...) and found that even though blocks were compressed away to nothing, the compressratio did not increase. For example: # perl -e ''print "a" x 100000000'' > /test/a # zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 7.87x - However if I put null characters into the same file: # dd if=/dev/zero of=a bs=100000000 count=1 1+0 records in 1+0 records out # zfs get compressratio test NAME PROPERTY VALUE SOURCE test compressratio 1.00x - I understand that a block is not allocated if it contains all zero''s, but that would seem to contribute to a higher compressratio rather than a lower compressratio. If I disable compression and enable dedup, does it count deduplicated blocks of zeros toward the dedupratio? -- Mike Gerdts http://mgerdts.blogspot.com/
Mike, I believe that ZFS treats runs of zeros as holes in a sparse file, rather than as regular data. So they aren''t really present to be counted for compressratio. http://blogs.sun.com/bonwick/entry/seek_hole_and_seek_data http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/017565.html -- This message posted from opensolaris.org
On Tue, Dec 15, 2009 at 2:31 AM, Craig S. Bell <cbell at standard.com> wrote:> Mike, I believe that ZFS treats runs of zeros as holes in a sparse file, rather than as regular data. ?So they aren''t really present to be counted for compressratio. > > http://blogs.sun.com/bonwick/entry/seek_hole_and_seek_data > http://mail.opensolaris.org/pipermail/zfs-discuss/2008-April/017565.htmlBut it only does so when compression is enabled, as such I would expect that compression would claim this as a win. Without it, someone may assume that they aren''t getting much benefit from compression, turn it off, then run into problems down the road because sparseness that develops in files never turns into free space. Also, I would expect that: - If a file is created via a write to every block that it would be accounted for as non-sparse (regardless of compression=<on|off|...>) - If a file is sparse because the program that created the file used seek() or similar to skip past blocks, it should be accounted for as sparse (regardless of compression). - If a program overwrites a block with zeros to a file where it should not be considered sparse. In the below example, I would expect that writing 100MB of ''\0'' would contribute as much to compressratio as 100 MB of ''a''. Notice that a block of zeros does not turn into a sparse file with compression=off. # zfs create test/on # zfs create test/off # zfs set compression=off test/off # zfs get compression test/on test/off NAME PROPERTY VALUE SOURCE test/off compression off local test/on compression on inherited from test # mkfile 100m on/100m off/100m # ls -l o*/100m -rw------T 1 root root 104857600 Dec 15 14:27 off/100m -rw------T 1 root root 104857600 Dec 15 14:27 on/100m # du -h o*/100m 100M off/100m 0K on/100m # perl -e ''print "a" x 100000000'' > on/a # perl -e ''print "a" x 100000000'' > off/a # sync # ls -l */a -rw-r--r-- 1 root root 100000000 Dec 15 14:35 off/a -rw-r--r-- 1 root root 100000000 Dec 15 14:35 on/a # du -h */a 95M off/a 3.4M on/a # zfs get compressratio test/on test/off NAME PROPERTY VALUE SOURCE test/off compressratio 1.00x - test/on compressratio 28.27x - -- Mike Gerdts http://mgerdts.blogspot.com/
I take your point Mike. Yes, this seems to be an inconsistency in accounting. I have simply become accustomed to this (esp. when dealing with virtual disk images), so I just don''t think about it, but it *is* harder to balance accounts. For instance, if my guest cleans up it''s vdisk by writing zeroes over free space, then I expect "used" in the backing dataset to shrink, but the compressratio doesn''t change very much. The new zero bytes effectively go to "available". This leads back to my larger question -- is it possible to understand more about what pool space is doing, given the currently available properties? I think a couple of additions might be helpful for the admin to visualize what''s going on. In particular, there should enough information visible to apply a consistent view of the pool (either diluted or concentrated) within the same listing. IMHO the outermost view should consider all space, including the sparse file holes. To put it another way, I agree that there should be a way to see where the zeroes went. I don''t need to see every sparse-file hole, but some ratio or value should allow me to better estimate how "available" will change the next time I write. -------- Inapt analogy follows: This is like fueling your vehicle. You don''t get perfect information on quantity (fuel expands when it''s hot out), and then you increase speed, climb a hill, open a window, get stuck in traffic, &c. How far can you safely drive? You don''t know exactly when the warning light will come on, but you watch the needle ("available") at intervals. This is easy for the average driver, and most tend not to get stranded. even while observing only one relative property. So my guess is that -- with the information currently available -- monitoring a large zpool could become a bit less precise, and more along the lines of "is it close to the big E yet? Okay then find a filling station, or a place to stop". IMHO enhancements to these properties would help many admins more accurately understanding what will happen next to the available space, so that they don''t exhaust the available space faster than they expected or intended. -c -- This message posted from opensolaris.org