thr3ads.net - zfs discuss - [zfs-discuss] dedup accounting anomaly / dedup experiments [Jul 2010]

If this information is useful, please help other people find it:
Share via:

Lutz Schumann

2010-Jul-01 08:33 UTC

[zfs-discuss] dedup accounting anomaly / dedup experiments

Hello list, 

I wanted to test deduplication a little and did a experiment. 

My question was:  can I dedupe infinite or is ther a upper limit ? 

So for that I did a very basic test.
-  I created a ramdisk-pool (1GB)
- enabled dedup and 
- wrote zeros to it (in one single file) until an error is returned. 

The size of the pool was 1046 MB, I was able to write 62 GB to it then it says
"no space left on device". The block size was 128k, so I was able to
write 507.000 blocks to the pool.

With this device beeing full, I see the following: 

1) zfs list reports that no space is left (AVAIL=0)
2) zpool reports that the dedup factor was ~507.000x
3) zpool reports also that 8,6 MB of space were allocated in the pool (0% used)

So for me it looks like there is something broken in ZFS accounting with dedupe.

- zpool and zfs usage free space reporting do not align
- the real deduplication factor was not 507.000 (meaning I would have been able
to write 507.000x1GB = a lot to the pool)
- when calculating 1046 MB / 507000 = 2.1 KB, somehow for each block  of 128k,
2,1 KB of data bas been written (assuming zfs list is correct). What is this ?
Metadata ? Meaning that I have aprox 1.6 % of Meatadata in ZFS (1/(128k/2,1k)) ?

I repeatet the same thing for a recordsize of 32k. The funny thing is: 
- Also 60 GB could be written before "no space left"
- 31 MB of space were alloated in the pool (zpool list)

The version of the pool is 25.

During the experiment I could nicely see:
- that performance on ramdisk is CPU bound doing ~125 MB /sec per Core. 
- performance scales linearly with adding CPU cores. (125 MB/s cor 1core, 253
Mb/s for 2core, 408 MB/s for 4core).
- that the upper size of the deduplication table is blocks * ~150 Byte,
indipendent of the dedupe factor
- the ddt does not grow for deduplicatable blocks (zdb -D)
- performance goes down factor of ~4 when switching from allocation policy of
"closest" to "best fit" (when the pool fills rate drops from
250 MB/s to 67 MB/s. I suspect even worse results for spinning media because of
the head movements (>10x slow down).

Anyone knowing why the dedup factor is wrong ? Any insights on what has actually
been written (compressed meta data, deduped meta data .. etc.) would be greatly
appreshiated.

Regards, 
Robert
-- 
This message posted from opensolaris.org

Will Murnane

2010-Jul-02 07:14 UTC

head link

[zfs-discuss] dedup accounting anomaly / dedup experiments

On Thu, Jul 1, 2010 at 04:33, Lutz Schumann <presales at
storageconcepts.de> wrote:> Hello list,
>
> I wanted to test deduplication a little and did a experiment.
>
> My question was: ?can I dedupe infinite or is ther a upper limit ?
>
> So for that I did a very basic test.
> - ?I created a ramdisk-pool (1GB)
> - enabled dedup and
> - wrote zeros to it (in one single file) until an error is returned.I don''t know about the rest of your test, but writing zeroes to a ZFS
filesystem is probably not a very good test, because ZFS recognizes
these blocks of zeroes and doesn''t actually write anything.  Unless
maybe encryption is on, but maybe not even then.

I''d write a little program that initializes 128k of memory to a
particular pattern, then writes it to disk until it gets ENOSPC (or
some other error code, I suppose).  That should force the first block
to actually be written, and all the others to point to it.
> During the experiment I could nicely see:
> - that performance on ramdisk is CPU bound doing ~125 MB /sec per Core.Ouch!  That''s all?

Will

Lutz Schumann

2010-Jul-02 16:56 UTC

head link

[zfs-discuss] dedup accounting anomaly / dedup experiments

Hi, 
> I don''t know about the rest of your test, but writing
> zeroes to a ZFS
> filesystem is probably not a very good test, because
> ZFS recognizes
> these blocks of zeroes and doesn''t actually write
> anything.  Unless
> maybe encryption is on, but maybe not even then.
Not true. If I want ZFS to write zeros, ZFS does write zeros. You can simply
check this by doing a dd. ZFS does not filter "zero" writes.
> > During the experiment I could nicely see:
> > - that performance on ramdisk is CPU bound doing
> ~125 MB /sec per Core.
> Ouch!  That''s all?
Per Core ! So for 8 Core System this is ~1 GB / sec - More then mosts disks can
handle. If you use 50% (must save some for scrub) you are ok.

Robert
-- 
This message posted from opensolaris.org

Darren J Moffat

2010-Jul-02 18:45 UTC

head link

[zfs-discuss] dedup accounting anomaly / dedup experiments

On 02/07/2010 17:56, Lutz Schumann wrote:>> I don''t know about the rest of your test, but writing
>> zeroes to a ZFS
>> filesystem is probably not a very good test, because
>> ZFS recognizes
>> these blocks of zeroes and doesn''t actually write
>> anything.  Unless
>> maybe encryption is on, but maybe not even then.
>
> Not true. If I want ZFS to write zeros, ZFS does write zeros. You can
simply check this by doing a dd. ZFS does not filter "zero" writes.
Actually it does if you have compression turned on and the blocks 
compress away to 0 bytes.

See 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/zio.c#zio_write_bp_init

Specifically line 1005:

   1005 	if (psize == 0) {
   1006 		zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
   1007  } else {

-- 
Darren J Moffat

Lutz Schumann

2010-Jul-03 08:44 UTC

head link

[zfs-discuss] dedup accounting anomaly / dedup experiments

> Actually it does if you have compression turned on
> and the blocks 
> compress away to 0 bytes.
> 
> See 
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/
> usr/src/uts/common/fs/zfs/zio.c#zio_write_bp_init
> 
> Specifically line 1005:
> 
>    1005 	if (psize == 0) {
> 1006 		zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
>    1007  } else {
>  
Interesting, did quick test: 
Writing zeros async: Dedup=on, Compression=off -> 60 MB / sec
Writing zeros async: Dedup=off, Compression=on -> 480 MB /sec
Writing zeros async: Dedup=off, Compression=off -> 12 MB / sec

It seems if I do a sync write with dedup=off, compress=on, I get 12 MB / sec.

In both cases I see disk /IO. So for me itlooks like meta data writes - the I/O
beeing limited to 480 Mb/s seems to be limited by the async meta data updates
(this is a VM, so slow).

So does that mean that zero data blocks are not written, but meta data blocks ?

Robert
-- 
This message posted from opensolaris.org

Brandon High

2010-Jul-09 08:45 UTC

head link

[zfs-discuss] dedup accounting anomaly / dedup experiments

On Thu, Jul 1, 2010 at 1:33 AM, Lutz Schumann
<presales at storageconcepts.de>wrote:>
> Anyone knowing why the dedup factor is wrong ? Any insights on what has
> actually been written (compressed meta data, deduped meta data .. etc.)
> would be greatly appreshiated.
>
Metadata and ditto blocks. Even with dedup, zfs will write multiple copies
of blocks after reaching a certain threshold.

-B

-- 
Brandon High : bhigh at freaks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/928d7ea9/attachment.html>

zfs discuss - Jul 2010 - dedup accounting anomaly / dedup experiments

[zfs-discuss] dedup accounting anomaly / dedup experiments

[zfs-discuss] dedup accounting anomaly / dedup experiments

[zfs-discuss] dedup accounting anomaly / dedup experiments

[zfs-discuss] dedup accounting anomaly / dedup experiments

[zfs-discuss] dedup accounting anomaly / dedup experiments

[zfs-discuss] dedup accounting anomaly / dedup experiments