thr3ads.net - zfs discuss - [zfs-discuss] Deduplication and ISO files [Jun 2010]

If this information is useful, please help other people find it:
Share via:

Ray Van Dolson

2010-Jun-04 16:30 UTC

[zfs-discuss] Deduplication and ISO files

I''m running zpool version 23 (via ZFS fuse on Linux) and have a zpool
with deduplication turned on.

I am testing how well deduplication will work for the storage of many,
similar ISO files and so far am seeing unexpected results (or perhaps
my expectations are wrong).

The ISO''s I''m testing with are the 32-bit and 64-bit versions
of the
RHEL5 DVD ISO''s.  While both have their differences, they do contain a
lot of similar data as well.

If I explode both ISO files and copy them to my ZFS filesystem I see
about a 1.24x dedup ratio.

However, if I have only the ISO files on the ZFS filesystem, the ratio
is 1.00x -- no savings at all.

Does this make sense?  I''m going to experiment with other combinations
of ISO files as well...

Thanks,
Ray

Brandon High

2010-Jun-04 18:16 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson <rvandolson at esri.com>
wrote:> The ISO''s I''m testing with are the 32-bit and 64-bit
versions of the
> RHEL5 DVD ISO''s. ?While both have their differences, they do
contain a
> lot of similar data as well.
Similar != identical.

Dedup works on blocks in zfs, so unless the iso files have identical
data aligned at 128k boundaries you won''t see any savings.
> If I explode both ISO files and copy them to my ZFS filesystem I see
> about a 1.24x dedup ratio.
Each file starts a new block, so the identical files can be deduped.

-B

-- 
Brandon High : bhigh at freaks.com

Ray Van Dolson

2010-Jun-04 19:37 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Fri, Jun 04, 2010 at 11:16:40AM -0700, Brandon High
wrote:> On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson <rvandolson at
esri.com> wrote:
> > The ISO''s I''m testing with are the 32-bit and 64-bit
versions of the
> > RHEL5 DVD ISO''s. ?While both have their differences, they do
contain a
> > lot of similar data as well.
> 
> Similar != identical.
> 
> Dedup works on blocks in zfs, so unless the iso files have identical
> data aligned at 128k boundaries you won''t see any savings.
> 
> > If I explode both ISO files and copy them to my ZFS filesystem I see
> > about a 1.24x dedup ratio.
> 
> Each file starts a new block, so the identical files can be deduped.
> 
> -B
Makes sense.  So, as someone else suggested, decreasing my block size
may improve the deduplication ratio.

recordsize I presume is the value to tweak?

Thanks,
Ray

Roy Sigurd Karlsbakk

2010-Jun-04 19:47 UTC

head link

[zfs-discuss] Deduplication and ISO files

> Makes sense.  So, as someone else suggested, decreasing my block size
> may improve the deduplication ratio.
> 
> recordsize I presume is the value to tweak?
It is, but keep in mind that zfs will need about 150 bytes for each block. 1TB
with 128k blocks will need about 1GB memory for the index to stay in RAM. 64k
blocks, the double, et cetera...

l2arc will help a lot if memory is low

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Nicolas Williams

2010-Jun-04 19:56 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Fri, Jun 04, 2010 at 12:37:01PM -0700, Ray Van Dolson
wrote:> On Fri, Jun 04, 2010 at 11:16:40AM -0700, Brandon High wrote:
> > On Fri, Jun 4, 2010 at 9:30 AM, Ray Van Dolson <rvandolson at
esri.com> wrote:
> > > The ISO''s I''m testing with are the 32-bit and
64-bit versions of the
> > > RHEL5 DVD ISO''s. ?While both have their differences,
they do contain a
> > > lot of similar data as well.
> > 
> > Similar != identical.
> > 
> > Dedup works on blocks in zfs, so unless the iso files have identical
> > data aligned at 128k boundaries you won''t see any savings.
> > 
> > > If I explode both ISO files and copy them to my ZFS filesystem I
see
> > > about a 1.24x dedup ratio.
> > 
> > Each file starts a new block, so the identical files can be deduped.
> > 
> > -B
> 
> Makes sense.  So, as someone else suggested, decreasing my block size
> may improve the deduplication ratio.
> 
> recordsize I presume is the value to tweak?
Yes, but I''d not expect that much commonality between 32-bit and 64-bit
Linux ISOs...

Do the same check again with the ISOs "exploded", as you say.

Nico
--

Brandon High

2010-Jun-04 20:03 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson <rvandolson at esri.com>
wrote:> Makes sense. ?So, as someone else suggested, decreasing my block size
> may improve the deduplication ratio.
It might. It might make your performance tank, too.

Decreasing the block size increases the size of the dedup table (DDT).
Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT
gets too large to fit in memory, it will have to be read from disk,
which will destroy any sort of write performance (although a L2ARC on
SSD can help)

If you move to 64k blocks, you''ll double the DDT size and may not
actually increase your ratio. Moving to 8k blocks will increase your
DDT by a factor of 16, and still may not help.

Changing the recordsize will not affect files that are already in the
dataset. You''ll have to recopy them to re-write with the smaller block
size.

-B

-- 
Brandon High : bhigh at freaks.com

Ray Van Dolson

2010-Jun-04 20:10 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Fri, Jun 04, 2010 at 01:03:32PM -0700, Brandon High
wrote:> On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson <rvandolson at
esri.com> wrote:
> > Makes sense. ?So, as someone else suggested, decreasing my block size
> > may improve the deduplication ratio.
> 
> It might. It might make your performance tank, too.
> 
> Decreasing the block size increases the size of the dedup table (DDT).
> Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT
> gets too large to fit in memory, it will have to be read from disk,
> which will destroy any sort of write performance (although a L2ARC on
> SSD can help)
> 
> If you move to 64k blocks, you''ll double the DDT size and may not
> actually increase your ratio. Moving to 8k blocks will increase your
> DDT by a factor of 16, and still may not help.
> 
> Changing the recordsize will not affect files that are already in the
> dataset. You''ll have to recopy them to re-write with the smaller
block
> size.
> 
> -B
Gotcha.  Just trying to make sure I understand how all this works, and
if I _would_ in fact see an improvement in dedupe-ratio by tweaking the
recordsize with our data-set.

Once we know that we can decide if it''s worth the extra costs in
RAM/L2ARC.

Thanks all.

Ray

Victor Latushkin

2010-Jun-04 20:51 UTC

head link

[zfs-discuss] Deduplication and ISO files

On 05.06.10 00:10, Ray Van Dolson wrote:> On Fri, Jun 04, 2010 at 01:03:32PM -0700, Brandon High wrote:
>> On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson <rvandolson at
esri.com> wrote:
>>> Makes sense.  So, as someone else suggested, decreasing my block
size
>>> may improve the deduplication ratio.
>> It might. It might make your performance tank, too.
>>
>> Decreasing the block size increases the size of the dedup table (DDT).
>> Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT
>> gets too large to fit in memory, it will have to be read from disk,
>> which will destroy any sort of write performance (although a L2ARC on
>> SSD can help)
>>
>> If you move to 64k blocks, you''ll double the DDT size and may
not
>> actually increase your ratio. Moving to 8k blocks will increase your
>> DDT by a factor of 16, and still may not help.
>>
>> Changing the recordsize will not affect files that are already in the
>> dataset. You''ll have to recopy them to re-write with the
smaller block
>> size.
>>
>> -B
> 
> Gotcha.  Just trying to make sure I understand how all this works, and
> if I _would_ in fact see an improvement in dedupe-ratio by tweaking the
> recordsize with our data-set.
> 
You can use zdb -S to assess how effective deduplication can be without actually
turning it on your pool.

regards
victor

Roy Sigurd Karlsbakk

2010-Jun-06 10:26 UTC

head link

[zfs-discuss] Deduplication and ISO files

----- "Brandon High" <bhigh at freaks.com>
skrev:> Decreasing the block size increases the size of the dedup table
> (DDT).
> Every entry in the DDT uses somewhere around 250-270 bytes.
Are you sure it''s that high? I was told it''s ~150 per block,
or ~1,2GB per terabytes of storage with only 128k blocks
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Brandon High

2010-Jun-06 17:46 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Sun, Jun 6, 2010 at 3:26 AM, Roy Sigurd Karlsbakk <roy at
karlsbakk.net> wrote:> ----- "Brandon High" <bhigh at freaks.com> skrev:
>> Decreasing the block size increases the size of the dedup table
>> (DDT).
>> Every entry in the DDT uses somewhere around 250-270 bytes.
>
> Are you sure it''s that high? I was told it''s ~150 per
block, or ~1,2GB per terabytes of storage with only 128k blocks
No, that''s the number that stuck in my head though.

Even if the DDT entry is smaller, the point I was making doesn''t
change. A smaller record size will increase dedup performance at the
cost of a larger DDT. Once the DDT is too large to fit in the ARC,
performance tanks.

-B

-- 
Brandon High : bhigh at freaks.com

Roy Sigurd Karlsbakk

2010-Jun-06 18:01 UTC

head link

[zfs-discuss] Deduplication and ISO files

----- "Brandon High" <bhigh at freaks.com> skrev:
> On Sun, Jun 6, 2010 at 3:26 AM, Roy Sigurd Karlsbakk
> <roy at karlsbakk.net> wrote:
> > ----- "Brandon High" <bhigh at freaks.com> skrev:
> >> Decreasing the block size increases the size of the dedup table
> >> (DDT).
> >> Every entry in the DDT uses somewhere around 250-270 bytes.
> >
> > Are you sure it''s that high? I was told it''s ~150
per block, or
> ~1,2GB per terabytes of storage with only 128k blocks
> 
> No, that''s the number that stuck in my head though.
> 
> Even if the DDT entry is smaller, the point I was making doesn''t
> change. A smaller record size will increase dedup performance at the
> cost of a larger DDT. Once the DDT is too large to fit in the ARC,
> performance tanks.
I cannot but agree with that - if you want to use dedup, (a) wait for the next
release (since there are several rather bad bugs in 134) and (b) get a truckload
of SSDs for L2ARC to make the system usable.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Brandon High

2010-Jun-06 22:04 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Sun, Jun 6, 2010 at 10:46 AM, Brandon High <bhigh at freaks.com>
wrote:> No, that''s the number that stuck in my head though.
Here''s a reference from Richard Elling:
(http://mail.opensolaris.org/pipermail/zfs-discuss/2010-March/038018.html)
"Around 270 bytes, or one 512 byte sector."

-B

-- 
Brandon High : bhigh at freaks.com

Roy Sigurd Karlsbakk

2010-Jun-07 08:06 UTC

head link

[zfs-discuss] Deduplication and ISO files

----- "Brandon High" <bhigh at freaks.com> skrev:
> On Sun, Jun 6, 2010 at 10:46 AM, Brandon High <bhigh at freaks.com>
> wrote:
> > No, that''s the number that stuck in my head though.
> 
> Here''s a reference from Richard Elling:
> (http://mail.opensolaris.org/pipermail/zfs-discuss/2010-March/038018.html)
> "Around 270 bytes, or one 512 byte sector."
I guess this then means you''ll need to change the 1GiB per 1TiB
deduplicated to 2GiB per 1TiB and way more with smaller blocks...
 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Roy Sigurd Karlsbakk

2010-Jun-07 17:59 UTC

head link

[zfs-discuss] Deduplication and ISO files

----- "Ray Van Dolson" <rvandolson at esri.com> skrev:
> FYI;
> 
> With 4K recordsize, I am seeing 1.26x dedupe ratio between the RHEL
> 5.4
> ISO and the RHEL 5.5 ISO file.
> 
> However, it took about 33 minutes to copy the 2.9GB ISO file onto the
> filesystem. :)  Definitely would need more RAM in this setup...
> 
> Ray
With 4k recordsize, you won''t have enough memory slots for the memory.
Grab a few X25-Ms or something to do the buffering

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 97542685
roy at karlsbakk.net
http://blogg.karlsbakk.net/
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er
et element?rt imperativ for alle pedagoger ? unng? eksessiv anvendelse av
idiomer med fremmed opprinnelse. I de fleste tilfeller eksisterer adekvate og
relevante synonymer p? norsk.

Ray Van Dolson

2010-Jun-07 18:02 UTC

head link

[zfs-discuss] Deduplication and ISO files

On Fri, Jun 04, 2010 at 01:10:44PM -0700, Ray Van Dolson
wrote:> On Fri, Jun 04, 2010 at 01:03:32PM -0700, Brandon High wrote:
> > On Fri, Jun 4, 2010 at 12:37 PM, Ray Van Dolson <rvandolson at
esri.com> wrote:
> > > Makes sense. ?So, as someone else suggested, decreasing my block
size
> > > may improve the deduplication ratio.
> > 
> > It might. It might make your performance tank, too.
> > 
> > Decreasing the block size increases the size of the dedup table (DDT).
> > Every entry in the DDT uses somewhere around 250-270 bytes. If the DDT
> > gets too large to fit in memory, it will have to be read from disk,
> > which will destroy any sort of write performance (although a L2ARC on
> > SSD can help)
> > 
> > If you move to 64k blocks, you''ll double the DDT size and may
not
> > actually increase your ratio. Moving to 8k blocks will increase your
> > DDT by a factor of 16, and still may not help.
> > 
> > Changing the recordsize will not affect files that are already in the
> > dataset. You''ll have to recopy them to re-write with the
smaller block
> > size.
> > 
> > -B
> 
> Gotcha.  Just trying to make sure I understand how all this works, and
> if I _would_ in fact see an improvement in dedupe-ratio by tweaking the
> recordsize with our data-set.
> 
> Once we know that we can decide if it''s worth the extra costs in
> RAM/L2ARC.
> 
> Thanks all.
FYI;

With 4K recordsize, I am seeing 1.26x dedupe ratio between the RHEL 5.4
ISO and the RHEL 5.5 ISO file.

However, it took about 33 minutes to copy the 2.9GB ISO file onto the
filesystem. :)  Definitely would need more RAM in this setup...

Ray

zfs discuss - Jun 2010 - Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files

[zfs-discuss] Deduplication and ISO files