thr3ads.net - zfs discuss - [zfs-discuss] ZFS deduplication [Jul 2008]

If this information is useful, please help other people find it:
Share via:

Mertol Ozyoney

2008-Jul-07 20:57 UTC

[zfs-discuss] ZFS deduplication

Hi All ;

 

Is there any hope for deduplication on ZFS ? 

 

Mertol

 

 


 <http://www.sun.com/> http://www.sun.com/emrkt/sigs/6g_top.gif

Mertol Ozyoney 
Storage Practice - Sales Manager

Sun Microsystems, TR
Istanbul TR
Phone +902123352200
Mobile +905339310752
Fax +902123352222
Email mertol.ozyoney at sun.com

 

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080707/2a494ab3/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.gif
Type: image/gif
Size: 1257 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080707/2a494ab3/attachment.gif>

Neil Perrin

2008-Jul-07 21:25 UTC

head link

[zfs-discuss] ZFS deduplication

Mertol,

Yes, dedup is certainly on our list and has been actively
discussed recently, so there''s hope and some forward progress.
It would be interesting to see where it fits into our customers
priorities for ZFS. We have a long laundry list of projects.
In addition there''s bug fixes & performance changes that customers
are demanding.

Neil.

Mertol Ozyoney wrote:> Hi All ;
> 
>  
> 
> Is there any hope for deduplication on ZFS ?
> 
>  
> 
> Mertol
> 
>  
> 
>  
> 
> http://www.sun.com/emrkt/sigs/6g_top.gif <http://www.sun.com/>
> 
> 	
> 
> *Mertol Ozyoney *
> Storage Practice - Sales Manager
> 
> *Sun Microsystems, TR*
> Istanbul TR
> Phone +902123352200
> Mobile +905339310752
> Fax +902123352222
> Email mertol.ozyoney at sun.com
> 
>  
> 
>  
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Charles Soto

2008-Jul-07 23:56 UTC

head link

[zfs-discuss] ZFS deduplication

A really smart nexus for dedup is right when archiving takes place.  For
systems like EMC Centera, dedup is basically a byproduct of checksumming.
Two files with similar metadata that have the same hash?  They''re
identical.

Charles

On 7/7/08 4:25 PM, "Neil Perrin" <Neil.Perrin at Sun.COM> wrote:
> Mertol,
> 
> Yes, dedup is certainly on our list and has been actively
> discussed recently, so there''s hope and some forward progress.
> It would be interesting to see where it fits into our customers
> priorities for ZFS. We have a long laundry list of projects.
> In addition there''s bug fixes & performance changes that
customers
> are demanding.
> 
> Neil.

Nathan Kroenert

2008-Jul-08 00:00 UTC

head link

[zfs-discuss] ZFS deduplication

Even better would be using the ZFS block checksums (assuming we are only 
summing the data, not it''s position or time :)...

Then we could have two files that have 90% the same blocks, and still 
get some dedup value... ;)

Nathan.

Charles Soto wrote:> A really smart nexus for dedup is right when archiving takes place.  For
> systems like EMC Centera, dedup is basically a byproduct of checksumming.
> Two files with similar metadata that have the same hash?  They''re
identical.
> 
> Charles
> 
> 
> On 7/7/08 4:25 PM, "Neil Perrin" <Neil.Perrin at Sun.COM>
wrote:
> 
>> Mertol,
>>
>> Yes, dedup is certainly on our list and has been actively
>> discussed recently, so there''s hope and some forward progress.
>> It would be interesting to see where it fits into our customers
>> priorities for ZFS. We have a long laundry list of projects.
>> In addition there''s bug fixes & performance changes that
customers
>> are demanding.
>>
>> Neil.
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jonathan Loran

2008-Jul-08 00:26 UTC

head link

[zfs-discuss] ZFS deduplication

Neil Perrin wrote:> Mertol,
>
> Yes, dedup is certainly on our list and has been actively
> discussed recently, so there''s hope and some forward progress.
> It would be interesting to see where it fits into our customers
> priorities for ZFS. We have a long laundry list of projects.
> In addition there''s bug fixes & performance changes that
customers
> are demanding.
>
> Neil.
>
>   
I want to cast my vote for getting dedup on ZFS.  One place we currently 
use ZFS is as nearline storage for backup data.  I have a 16TB server 
that provides a file store for an EMC Networker server.  I''m seeing a 
compressratio of 1.73, which is mighty impressive, since we also use 
native EMC compression during the backups.  But with dedup, we should 
see way more.  Here at UCB  SSL, we have demoed and investigated various 
dedup products, hardware and software, but they are all steep on the ROI 
curve.  I would be very excited to see block level ZFS deduplication 
roll out.  Especially since we already have the infrastructure in place 
using Solaris/ZFS.

Cheers,

Jon

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

Bob Friesenhahn

2008-Jul-08 00:40 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, 8 Jul 2008, Nathan Kroenert wrote:
> Even better would be using the ZFS block checksums (assuming we are only
> summing the data, not it''s position or time :)...
>
> Then we could have two files that have 90% the same blocks, and still
> get some dedup value... ;)
It seems that the hard problem is not if ZFS has the structure to 
support it (the implementation seems pretty obvious), but rather that 
ZFS is supposed to be able to scale to extremely large sizes.  If you 
have a petabyte of storage in the pool, then the data structure to 
keep track of block similarity could grow exceedingly large.  The 
block checksums are designed to be as random as possible so their 
value does not suggest anything regarding the similarity of the data 
unless the values are identical.  The checksums have enough bits and 
randomness that binary trees would not scale.

Except for the special case of backups or cloned server footprints, 
it does not seem that data deduplication is going to save the 90% (or 
more) space that Quantum claims at 
http://www.quantum.com/Solutions/datadeduplication/Index.aspx.

ZFS clones already provide a form of data deduplication.

The actual benefit of data deduplication to an enterprise seems 
negligible unless the backup system directly supports it.  In the 
enterprise the cost of storage has more to do with backing up the data 
than the amount of storage media consumed.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Jul-08 00:56 UTC

head link

[zfs-discuss] ZFS deduplication

On Mon, 7 Jul 2008, Jonathan Loran wrote:> use ZFS is as nearline storage for backup data.  I have a 16TB server
> that provides a file store for an EMC Networker server.  I''m
seeing a
> compressratio of 1.73, which is mighty impressive, since we also use
> native EMC compression during the backups.  But with dedup, we should
> see way more.  Here at UCB  SSL, we have demoed and investigated various
I was going to say something smart about how zfs could contribute to 
improved serialized compression.  However, I retract that and think 
that when one starts with order, it is best to preserve order and not 
attempt to re-create order once things have devolved into utter chaos.

This deduplication technology seems similar to the Microsoft adds I 
see on TV which advertise how their new technology saves the customer 
20% of the 500% additional cost incurred by Microsoft''s previous 
technology (which was itself a band-aid to a previous technology). 
Sun/Solaris should be about being smarter rather than working harder.

If data devolution is a problem, it is most likely that the solution 
is to investigate the root causes and provide solutions which do not 
lead to devolution.

For example, if Berkely has 30,000 students which all require a home 
directory with similar stuff, perhaps they can be initialized using 
ZFS clones so that there is little waste of space until a student 
modifies an existing file.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Brian Hechinger

2008-Jul-08 01:07 UTC

head link

[zfs-discuss] ZFS deduplication

On Mon, Jul 07, 2008 at 07:56:26PM -0500, Bob Friesenhahn
wrote:> 
> This deduplication technology seems similar to the Microsoft adds I 
> see on TV which advertise how their new technology saves the customer 
Quantum''s claim of 20:1 just doesn''t jive in my head, either,
for some
reason.

-brian

Mike Gerdts

2008-Jul-08 01:27 UTC

head link

[zfs-discuss] ZFS deduplication

On Mon, Jul 7, 2008 at 7:40 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> The actual benefit of data deduplication to an enterprise seems
> negligible unless the backup system directly supports it. ?In the
> enterprise the cost of storage has more to do with backing up the data
> than the amount of storage media consumed.
Real data...

I did a survey of about 120 (mostly sparse) zone roots installed over
an 18 month period and used for normal enterprise activity. Each zone
root is installed into its own SVM soft partition with a strong effort
to isolate application data elsewhere. Each zone''s /var (including
/var/tmp) was included in the survey. My mechanism involved
calculating the md5 checksum of every 4 KB block from the SVM raw
device. This size was chosen because it is the fixed block size of
the player in the market that does deduplication of live data today.

My results were that I found that I had 75% duplicate data - with no
special effort to minimize duplicate data. If other techniques were
applied to minimize duplicate data (e.g. periodic write of zeros over
free space, extend file system to do the same for freed blocks, mount
with noatime, etc.) or full root zones (or LDoms) were the subject of
the test I would expect a higher level of duplication.

Supposition...

As I have considered deduplication for application data I see several
things happen in various areas.

- Multiple business application areas use very similar software.

When looking at various applications that directly (conscious
choice) or indirectly (embedded in some other product) use various
web servers, application servers, databases, etc. each application
administrator uses the same installation media to perform an
installation into a private (but commonly NFS mounted) area.
Many/most of these applications do a full installation of java which
is a statistically significant size of the installation.

- Maintenance activity creates duplicate data.

When patching, upgrading, or otherwise performing maintenance, it is
common to make a full copy or a fresh installation of the software.
This allows most of the maintenance activity to be performed when
the workload is live as well as rapid fallback by making small
configuration changes. The vast majority of the data in these
multiple versions are identical (e.g. small percentage of jars
updated, maybe a bit of the included documentation, etc.)

- Application distribution tools create duplicate data

Some application-level clustering technologies cause a significant
amount of data to be sent from the administrative server to the
various cluster members. By application server design, this is
duplicate data. If that data all resides on the same highly
redundant storage frame, it could be reduced back down to one (or
fewer copies).

- Multiple development and release trees are duplicate

When various developers check out code from a source code repository
or a single developer has multiple copies to work on different
releases, the checked out code is nearly 100% duplicate and objects
that are created during builds may be highly duplicate.

- Relying on storage-based snapshots and clones is impractical

There tend to be organizational walls between those that manage
storage and those that consume it. As storage is distributed across
a network (NFS, iSCSI, FC) things like delegated datasets and RBAC
are of limited practical use. Due to these factors and likely
others, storage snapshots and clones are only used for the few cases
where there is a huge financial incentive with minimal
administrative effort. Deduplication could be deployed on the back
end to do what clones can''t do due to non-technical reasons.

- Clones diverge permanently but shouldn''t

If I have a 3 GB OS image (inside an 8 GB block device) that I am
patching, there is a reasonable chance that I unzip 500 MB of
patches to the system, apply a the patches, then remove them. If
deduplication is done at the block device level (e.g. iSCSI LUNs
shared from a storage server) the space "uncloned" by extracting the
patches remains per-server used space. Additionally the other space
used by the installed patches remains used. Deduplication can
reclaim the majority of the space.

--
Mike Gerdts
http://mgerdts.blogspot.com/

Maurice Castro

2008-Jul-08 01:38 UTC

head link

[zfs-discuss] ZFS deduplication

I second this, provided we also check that the data is in fact  
identical as well. Checksum collisions are likely given the sizes of  
disks and the sizes of checksums; and some users actually deliberately  
generate data with colliding checksums (researchers and nefarious  
users). Dedup must be absolutely safe and users should decide if they  
want the cost of checking blocks versus the space saving.

	Maurice

On 08/07/2008, at 10:00 AM, Nathan Kroenert wrote:
> Even better would be using the ZFS block checksums (assuming we are  
> only
> summing the data, not it''s position or time :)...
>
> Then we could have two files that have 90% the same blocks, and still
> get some dedup value... ;)
>
> Nathan.
>

Charles Soto

2008-Jul-08 02:13 UTC

head link

[zfs-discuss] ZFS deduplication

Good points.  I see the archival process as a good candidate for adding
dedup because it is essentially doing what a stage/release archiving system
already does - "faking" the existence of data via metadata.  Those
blocks
aren''t actually there, but they''re still
"accessible" because they''re
*somewhere* the system knows about (i.e. the "other twin").

Currently in SAMFS, if I store two identical files on the archiving
filesystem and my policy generates 4 copies, I will have created 8 copies of
the file (albeit with different metadata).  Dedup would help immensely here.
And as archiving (data management) is inherently a "costly" operation,
it''s
used where potentially slower access to data is acceptable.

Another system that comes to mind that utilizes dedup is Xythos WebFS.  As
Bob points out, keeping track of dupes is a chore.  IIRC, WebFS uses a
relational database to track this (among much of its other metadata).

Charles

On 7/7/08 7:40 PM, "Bob Friesenhahn" <bfriesen at
simple.dallas.tx.us> wrote:
> On Tue, 8 Jul 2008, Nathan Kroenert wrote:
> 
>> Even better would be using the ZFS block checksums (assuming we are
only
>> summing the data, not it''s position or time :)...
>> 
>> Then we could have two files that have 90% the same blocks, and still
>> get some dedup value... ;)
> 
> It seems that the hard problem is not if ZFS has the structure to
> support it (the implementation seems pretty obvious), but rather that
> ZFS is supposed to be able to scale to extremely large sizes.  If you
> have a petabyte of storage in the pool, then the data structure to
> keep track of block similarity could grow exceedingly large.  The
> block checksums are designed to be as random as possible so their
> value does not suggest anything regarding the similarity of the data
> unless the values are identical.  The checksums have enough bits and
> randomness that binary trees would not scale.
> 
> Except for the special case of backups or cloned server footprints,
> it does not seem that data deduplication is going to save the 90% (or
> more) space that Quantum claims at
> http://www.quantum.com/Solutions/datadeduplication/Index.aspx.
> 
> ZFS clones already provide a form of data deduplication.
> 
> The actual benefit of data deduplication to an enterprise seems
> negligible unless the backup system directly supports it.  In the
> enterprise the cost of storage has more to do with backing up the data
> than the amount of storage media consumed.
> 
> Bob

Bob Friesenhahn

2008-Jul-08 02:24 UTC

head link

[zfs-discuss] ZFS deduplication

On Mon, 7 Jul 2008, Mike Gerdts wrote:>
> As I have considered deduplication for application data I see several
> things happen in various areas.
You have provided an excellent description of gross inefficiencies in 
the way systems and software are deployed today, resulting in massive 
duplication.  Massive duplication is used to ease service deployment 
and management.  Most of this massive duplication is not technically 
necessary.
>  There tend to be organizational walls between those that manage
>  storage and those that consume it.  As storage is distributed across
>  a network (NFS, iSCSI, FC) things like delegated datasets and RBAC
>  are of limited practical use.  Due to these factors and likely
It seems that deduplication on the server does not provide much 
benefit to the client since the client always sees a duplicate.  It 
does not know that it doesn''t need to cache or copy a block twice 
because it is a duplicate.  Only the server benefits from the 
deduplication except that maybe server-side caching improves and 
provides the client with a bit more performance.

While deduplication can obviously save server storage space, it does 
not seem to help much for backups, and it does not really help the 
user manage all of that data.  It does help the user in terms of less 
raw storage space but there is surely a substantial run-time cost 
associated with the deduplication mechanism.  None of the existing 
applications (based on POSIX standards) has any understanding of 
deduplication so they won''t benefit from it.  If you use tar, cpio, or 
''cp -r'', to copy the contents of a directory tree, they will
transmit
just as much data as before and if the destintation does real-time 
deduplication, then the copy will be slower.  If the copy is to 
another server, then the copy time will be huge, just like before.

Unless the backup system fully understands and has access to the 
filesystem deduplication mechanism, it will be grossly inefficient 
just like before.  Recovery from a backup stored in a sequential (e.g. 
tape) format which does understand deduplication would be quite 
interesting indeed.

Raw storage space is cheap.  Managing the data is what is expensive.

Perhaps deduplication is a response to an issue which should be solved 
elsewhere?

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mike Gerdts

2008-Jul-08 04:03 UTC

head link

[zfs-discuss] ZFS deduplication

On Mon, Jul 7, 2008 at 9:24 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Mon, 7 Jul 2008, Mike Gerdts wrote:
>>  There tend to be organizational walls between those that manage
>>  storage and those that consume it.  As storage is distributed across
>>  a network (NFS, iSCSI, FC) things like delegated datasets and RBAC
>>  are of limited practical use.  Due to these factors and likely
>
> It seems that deduplication on the server does not provide much benefit to
> the client since the client always sees a duplicate.  It does not know that
> it doesn''t need to cache or copy a block twice because it is a
duplicate.
>  Only the server benefits from the deduplication except that maybe
> server-side caching improves and provides the client with a bit more
> performance.
I want the deduplication to happen where it can be most efficient.
Just like with snapshots and clones, the client will have no idea that
multiple metadata sets point to the same data.  If deduplication makes
it so that each GB of perceived storage is cheaper, clients benefit
because the storage provider is (or should be) charging less.
> While deduplication can obviously save server storage space, it does not
> seem to help much for backups, and it does not really help the user manage
> all of that data.  It does help the user in terms of less raw storage space
> but there is surely a substantial run-time cost associated with the
> deduplication mechanism.  None of the existing applications (based on POSIX
> standards) has any understanding of deduplication so they won''t
benefit from
> it.  If you use tar, cpio, or ''cp -r'', to copy the
contents of a directory
> tree, they will transmit just as much data as before and if the
destintation
> does real-time deduplication, then the copy will be slower.  If the copy is
> to another server, then the copy time will be huge, just like before.
I agree.  Follow-on work needs to happen in the backup and especially
restore areas.  The first phase of work in this area is complete when
a full restore of all data (including snapshots and clones) takes the
same amount of space as was occupied during backup.

I suspect that if you take a look at the processor utililzation on
most storage devices you will find that there are lots of times that
the processors are relatively idle.  Deduplication can happen real
time in when the processors are not very busy, but dirty block
analysis should be queued during times of high processor utilization.
If you find that the processor can''t keep up with the deduplication
workload it suggests that your processors aren''t fast/plentiful enough
or you have deduplication enabled on inappropriate data sets.  The
same goes for I/O induced by the dedupe process.

In another message it was suggested that the size of the checksum
employed by zfs is so large that maintaining a database of the
checksums would be too costly.  It may be that a multi-level checksum
scheme is needed.  That is, perhaps the database of checksums uses a
32-bit or 64-bit hash of the 256 bit checksum.  If a hash collision
occurs then normal I/O routines are used for comparing the checksums.
If they are also the same, then compare the data.  It may be that the
intermediate comparison is more overhead than is needed because one
set of data is already in cache and in the worst case an I/O is needed
for the checksum or the data.  Why do two I/O''s if only one is needed?
> Unless the backup system fully understands and has access to the filesystem
> deduplication mechanism, it will be grossly inefficient just like before.
>  Recovery from a backup stored in a sequential (e.g. tape) format which
does
> understand deduplication would be quite interesting indeed.
Right now it is a mess.  Take a look at the situation for restoring
snapshots/clones and you will see that unless you use deduplication
during restore you need to go out and buy a lot of storage to do a
restore or highly duplicate data.
> Raw storage space is cheap.  Managing the data is what is expensive.
The systems that make the raw storage scale to petabytes of fault
tolerant storage are very expensive and sometimes quite complex.
Needing fewer or smaller spindles should mean less energy consumption,
less space, lower MTTR, higher MTTDL, and less complexity in all the
hardware used to string it all together.
>
> Perhaps deduplication is a response to an issue which should be solved
> elsewhere?
Perhaps.  However, I take a look at my backup and restore options for
zfs today and don''t think the POSIX API is the right way to go - at
least as I''ve seen it used so far.  Unless something happens that
makes restores of clones retain the initial space efficiency or
deduplication hides the problem, clones are useless in most
environments.  If this problem is solved via fixing backups and
restores, deduplication seems even more like the next step to take for
storage efficiency.  If it is solved by adding deduplication then we
get the other benefits of deduplication at the same time.

And after typing this message, deduplication is henceforth known as
"d11n".  :)

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Charles Soto

2008-Jul-08 04:07 UTC

head link

[zfs-discuss] ZFS deduplication

Oh, I agree.  Much of the duplication described is clearly the result of
"bad design" in many of our systems.  After all, most of an OS can be
served
off the network (diskless systems etc.).  But much of the dupe I''m
talking
about is less about not using the most efficient system administration
tricks.  Rather, it''s about the fact that software (e.g. Samba) is used
by
people, and people don''t always do things efficiently.

Case in point:  students in one of our courses were hitting their quota by
growing around 8GB per day.  Rather than simply agree that "these kids need
more space," we had a look at the files.  Turns out just about every
student
copied a 600MB file into their own directories, as it was created by another
student to be used as a "template" for many of their projects.  Nobody
understood that they could use the file right where it sat.  Nope. 7GB of
dupe data.  And these students are even familiar with our practice of
putting "class media" on a read-only share (these files serve as
similar
"templates" for their own projects - you can create a full video
project
with just a few MB in your "project file" this way).

So, while much of the situation is caused by "bad data management,"
there
aren''t always systems we can employ that prevent it.  Done right, dedup
can
certainly be "worth it" for my operations.  Yes, teaching the user the
"right thing" is useful, but that user isn''t there to know
how to "manage
data" for my benefit.  They''re there to learn how to be
filmmakers,
journalists, speech pathologists, etc.

Charles

On 7/7/08 9:24 PM, "Bob Friesenhahn" <bfriesen at
simple.dallas.tx.us> wrote:
> On Mon, 7 Jul 2008, Mike Gerdts wrote:
>> 
>> As I have considered deduplication for application data I see several
>> things happen in various areas.
> 
> You have provided an excellent description of gross inefficiencies in
> the way systems and software are deployed today, resulting in massive
> duplication.  Massive duplication is used to ease service deployment
> and management.  Most of this massive duplication is not technically
> necessary.

Mike Gerdts

2008-Jul-08 04:15 UTC

head link

[zfs-discuss] ZFS deduplication

On Mon, Jul 7, 2008 at 11:07 PM, Charles Soto <csoto at mail.utexas.edu>
wrote:> So, while much of the situation is caused by "bad data
management," there
> aren''t always systems we can employ that prevent it.  Done right,
dedup can
> certainly be "worth it" for my operations.  Yes, teaching the
user the
> "right thing" is useful, but that user isn''t there to
know how to "manage
> data" for my benefit.  They''re there to learn how to be
filmmakers,
> journalists, speech pathologists, etc.
Well said.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Tim Spriggs

2008-Jul-08 05:33 UTC

head link

[zfs-discuss] ZFS deduplication

Does anyone know a tool that can look over a dataset and give 
duplication statistics? I''m not looking for something incredibly 
efficient but I''d like to know how much it would actually benefit our 
dataset: HiRISE has a large set of spacecraft data (images) that could 
potentially have large amounts of redundancy, or not. Also, other up and 
coming missions have a large data volume that have a lot of duplicate 
image info and a small budget; with "d11p" in OpenSolaris there is a 
good business case to invest in Sun/OpenSolaris rather than buy the 
cheaper storage (+ linux?) that can simply hold everything as is.

If someone feels like coding a tool up that basically makes a file of 
checksums and counts how many times a particular checksum get''s hit
over
a dataset, I would be willing to run it and provide feedback. :)

-Tim

Charles Soto wrote:> Oh, I agree.  Much of the duplication described is clearly the result of
> "bad design" in many of our systems.  After all, most of an OS
can be served
> off the network (diskless systems etc.).  But much of the dupe I''m
talking
> about is less about not using the most efficient system administration
> tricks.  Rather, it''s about the fact that software (e.g. Samba) is
used by
> people, and people don''t always do things efficiently.
>
> Case in point:  students in one of our courses were hitting their quota by
> growing around 8GB per day.  Rather than simply agree that "these kids
need
> more space," we had a look at the files.  Turns out just about every
student
> copied a 600MB file into their own directories, as it was created by
another
> student to be used as a "template" for many of their projects. 
Nobody
> understood that they could use the file right where it sat.  Nope. 7GB of
> dupe data.  And these students are even familiar with our practice of
> putting "class media" on a read-only share (these files serve as
similar
> "templates" for their own projects - you can create a full video
project
> with just a few MB in your "project file" this way).
>
> So, while much of the situation is caused by "bad data
management," there
> aren''t always systems we can employ that prevent it.  Done right,
dedup can
> certainly be "worth it" for my operations.  Yes, teaching the
user the
> "right thing" is useful, but that user isn''t there to
know how to "manage
> data" for my benefit.  They''re there to learn how to be
filmmakers,
> journalists, speech pathologists, etc.
>
> Charles
>
>
> On 7/7/08 9:24 PM, "Bob Friesenhahn" <bfriesen at
simple.dallas.tx.us> wrote:
>
>   
>> On Mon, 7 Jul 2008, Mike Gerdts wrote:
>>     
>>> As I have considered deduplication for application data I see
several
>>> things happen in various areas.
>>>       
>> You have provided an excellent description of gross inefficiencies in
>> the way systems and software are deployed today, resulting in massive
>> duplication.  Massive duplication is used to ease service deployment
>> and management.  Most of this massive duplication is not technically
>> necessary.
>>     
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Justin Stringfellow

2008-Jul-08 08:04 UTC

head link

[zfs-discuss] ZFS deduplication

> Raw storage space is cheap.  Managing the data is what is expensive.
Not for my customer. Internal accounting means that the storage team gets paid
for each allocated GB on a monthly basis. They have
stacks of IO bandwidth and CPU cycles to spare outside of their daily busy
period. I can''t think of a better spend of their time
than a scheduled dedup.

> Perhaps deduplication is a response to an issue which should be solved 
> elsewhere?
I don''t think you can make this generalisation. For most people, yes,
but not everyone.


cheers,
--justin

Justin Stringfellow

2008-Jul-08 08:08 UTC

head link

[zfs-discuss] ZFS deduplication

> Does anyone know a tool that can look over a dataset and give 
> duplication statistics? I''m not looking for something incredibly 
> efficient but I''d like to know how much it would actually benefit
our
Check out the following blog..:

http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool

Erik Trimble

2008-Jul-08 09:55 UTC

head link

[zfs-discuss] ZFS deduplication

Justin Stringfellow wrote:>> Raw storage space is cheap.  Managing the data is what is expensive.
>>     
>
> Not for my customer. Internal accounting means that the storage team gets
paid for each allocated GB on a monthly basis. They have
> stacks of IO bandwidth and CPU cycles to spare outside of their daily busy
period. I can''t think of a better spend of their time
> than a scheduled dedup.
>
>
>   
>> Perhaps deduplication is a response to an issue which should be solved 
>> elsewhere?
>>     
>
> I don''t think you can make this generalisation. For most people,
yes, but not everyone.
>
>
> cheers,
> --justin
> _______________________________________________
>   Frankly, while I tend to agree with Bob that backend dedup is something 
that ever-cheaper disks and client-side misuse make unnecessary,  I 
would _very_ much like us to have some mechanism by which we could have 
some sort of a ''pay-per-feature'' system, so people who
disagree with me
can still get what they want.  <grin>

By that, I mean, that something along the lines of a ''bounty''
system
where folks pony up cash for features.

I''d love to have many more outside (from Sun) contributors to the 
OpenSolaris base, ZFS in particular.   Right now, virtually all the 
development work is being driven by internal-to-Sun priorities, which, 
given that Sun pays the developers, is OK.   However, I would really 
like to have some direct method where outsiders can show to Mgmt that 
there is direct cash for certain improvements.

For Justin, it sounds like being able to pony up several thousand 
(minimum) for desired feature would be no problem.  And, for the rest of 
us, I can think that a couple of hundred of us putting up $100 each to 
get RAIDZ expansion might move it to the front of the TODO list. <wink>

Plus, we might be able to attract some more interest from the hobbiest 
folks that way.

:-)

Buying a service contract and then bugging your service rep doesn''t say
the same thing a "I''m willing to pony up $10k right now for
feature X".
Big customers have weight to throw around, but we need some mechanism 
where a mid/small guy can make a real statement, and back it up.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Ross

2008-Jul-08 13:38 UTC

head link

[zfs-discuss] ZFS deduplication

Just going to make a quick comment here.  It''s a good point about
wanting backup software to support this, we''re a much smaller company
but it''s already more difficult to manage the storage needed for
backups than our live storage.

However, we''re actively planning that over the next 12 months, ZFS will
actually *be* our backup system, so for us just ZFS and send/receive supporting
de-duplication would be great :)

In fact, I can see that being useful for a number of places.  ZFS send/receive
is already a good way to stream incremental changes and keep filesystems in
sync.  Having de-duplication built into that can only be a good thing.

PS.  Yes, we''ll still have off-site tape backups just in case, but the
vast majority of our backup & restore functionality (including two off-site
backups) will be just ZFS.
 
 
This message posted from opensolaris.org

Wade.Stuart at fallon.com

2008-Jul-08 15:59 UTC

head link

[zfs-discuss] ZFS deduplication

> Even better would be using the ZFS block checksums (assuming we are only
> summing the data, not it''s position or time :)...
>
> Then we could have two files that have 90% the same blocks, and still
> get some dedup value... ;)
Yes,  but you will need to add some sort of highly collision resistant
checksum (sha+md5 maybe) and code to a; bit level compare blocks on
collision (100% bit verification) and b; handle linked or cascaded
collision tables (2+ blocks with the same hash but differing bits).  I
actually coded some of this and was playing with it.  My testbed relied on
another internal data store to track hash maps, collisions (dedup lists)
and collision cascades (kind of like what perl does with hash key
collisions).  It turned out to be a real pain when taking into account
snaps and clones.  I decided to wait until the resilver/grow/remove code
was in place as this seems to be part of the puzzle.


-Wade

Wade.Stuart at fallon.com

2008-Jul-08 16:10 UTC

head link

[zfs-discuss] ZFS deduplication

zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 03:08:26 AM:
>
>
> > Does anyone know a tool that can look over a dataset and give
> > duplication statistics? I''m not looking for something
incredibly
> > efficient but I''d like to know how much it would actually
benefit our
>
> Check out the following blog..:
>
> http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool
Just want to add,  while this is ok to give you a ballpark dedup number --
fletcher2 is notoriously collision prone on real data sets.  It is meant to
be fast at the expense of collisions.  This issue can show much more dedup
possible than really exists on large datasets.

Darren J Moffat

2008-Jul-08 16:20 UTC

head link

[zfs-discuss] ZFS deduplication

Wade.Stuart at fallon.com wrote:> 
> zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 03:08:26 AM:
> 
>>
>>> Does anyone know a tool that can look over a dataset and give
>>> duplication statistics? I''m not looking for something
incredibly
>>> efficient but I''d like to know how much it would actually
benefit our
>> Check out the following blog..:
>>
>> http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool
> 
> Just want to add,  while this is ok to give you a ballpark dedup number --
> fletcher2 is notoriously collision prone on real data sets.  It is meant to
> be fast at the expense of collisions.  This issue can show much more dedup
> possible than really exists on large datasets.
Doing this using sha256 as the checksum algorithm would be much more 
interesting.  I''m going to try that now and see how it compares with 
fletcher2 for a small contrived test.

-- 
Darren J Moffat

Richard Elling

2008-Jul-08 17:00 UTC

head link

[zfs-discuss] ZFS deduplication

Justin Stringfellow wrote:>> Raw storage space is cheap.  Managing the data is what is expensive.
>>     
>
> Not for my customer. Internal accounting means that the storage team gets
paid for each allocated GB on a monthly basis. They have
> stacks of IO bandwidth and CPU cycles to spare outside of their daily busy
period. I can''t think of a better spend of their time
> than a scheduled dedup.
>   
[donning my managerial accounting hat]
It is not a good idea to design systems based upon someone''s managerial
accounting whims.  These are subject to change in illogical ways at
unpredictable intervals.  This is why managerial accounting can be so
much fun for people who want to hide costs.  For example, some bright
manager decided that they should charge $100/month/port for ethernet
drops.  So now, instead of having a centralized, managed network with
well defined port mappings, every cube has an el-cheapo ethernet switch.
Saving money?  Not really, but this can be hidden by the accounting.

In the interim, I think you will find that if the goal is to reduce the 
number
of bits stored on some "expensive storage," there is more than one way
to accomplish that goal.
 -- richard

Keith Bierman

2008-Jul-08 17:15 UTC

head link

[zfs-discuss] ZFS deduplication

On Jul 8, 2008, at 11:00 AM, Richard Elling wrote:
> much fun for people who want to hide costs.  For example, some bright
> manager decided that they should charge $100/month/port for ethernet
> drops.  So now, instead of having a centralized, managed network with
> well defined port mappings, every cube has an el-cheapo ethernet  
> switch.
> Saving money?  Not really, but this can be hidden by the accounting.

Indeed, it actively hurts performance (mixing sunray, mobile, and  
fixed units on the same subnets rather than segregation by type).
-- 
Keith H. Bierman   khbkhb at gmail.com      | AIM kbiermank
5430 Nassau Circle East                  |
Cherry Hills Village, CO 80113           | 303-997-2749
<speaking for myself*> Copyright 2008

Bob Friesenhahn

2008-Jul-08 17:25 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, 8 Jul 2008, Richard Elling wrote:> [donning my managerial accounting hat]
> It is not a good idea to design systems based upon someone''s
managerial
> accounting whims.  These are subject to change in illogical ways at
> unpredictable intervals.  This is why managerial accounting can be so
Managerial accounting whims can be put to good use.  If there is 
desire to reduce the amout of disk space consumed, then the accounting 
whims should make sure that those who consume the disk space get to 
pay for it.  Apparently this is not currently the case or else there 
would not be so much blatant waste.  On the flip-side, the approach 
which results in so much blatant waste may be extremely profitable so 
the waste does not really matter.

Imagine if university students were allowed to use as much space as 
they wanted but had to pay a per megabyte charge every two weeks or 
their account is terminated?  This would surely result in huge 
reduction in disk space consumption.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Bob Friesenhahn

2008-Jul-08 18:26 UTC

head link

[zfs-discuss] ZFS deduplication

Something else came to mind which is a negative regarding 
deduplication.  When zfs writes new sequential files, it should try to 
allocate blocks in a way which minimizes "fragmentation" (disk seeks).
Disk seeks are the bane of existing storage systems since they come 
out of the available IOPS budget, which is only a couple hundred 
ops/second per drive.  The deduplication algorithm will surely result 
in increasing effective fragmentation (decreasing sequential 
performance) since duplicated blocks will result in a seek to the 
master copy of the block followed by a seek to the next block.  Disk 
seeks will remain an issue until rotating media goes away, which (in 
spite of popular opinion) is likely quite a while from now.

Someone has to play devil''s advocate here. :-)

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

David Collier-Brown

2008-Jul-08 18:37 UTC

head link

[zfs-discuss] ZFS deduplication

Hmmn, you might want to look at Andrew Tridgell''s'' thesis
(yes,
Andrew of Samba fame), as he had to solve this very question
to be able to select an algorithm to use inside rsync.

--dave

Darren J Moffat wrote:> Wade.Stuart at fallon.com wrote:
> 
>>zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 03:08:26 AM:
>>
>>
>>>>Does anyone know a tool that can look over a dataset and give
>>>>duplication statistics? I''m not looking for something
incredibly
>>>>efficient but I''d like to know how much it would
actually benefit our
>>>
>>>Check out the following blog..:
>>>
>>>http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool
>>
>>Just want to add,  while this is ok to give you a ballpark dedup number
--
>>fletcher2 is notoriously collision prone on real data sets.  It is meant
to
>>be fast at the expense of collisions.  This issue can show much more
dedup
>>possible than really exists on large datasets.
> 
> 
> Doing this using sha256 as the checksum algorithm would be much more 
> interesting.  I''m going to try that now and see how it compares
with
> fletcher2 for a small contrived test.
> 
-- 
David Collier-Brown            | Always do right. This will gratify
Sun Microsystems, Toronto      | some people and astonish the rest
davecb at sun.com                 |                      -- Mark Twain
(905) 943-1983, cell: (647) 833-9377, (800) 555-9786 x56583
bridge: (877) 385-4099 code: 506 9191#

Wade.Stuart at fallon.com

2008-Jul-08 18:59 UTC

head link

[zfs-discuss] ZFS deduplication

zfs-discuss-bounces at opensolaris.org wrote on 07/08/2008 01:26:15 PM:
> Something else came to mind which is a negative regarding
> deduplication.  When zfs writes new sequential files, it should try to
> allocate blocks in a way which minimizes "fragmentation" (disk
seeks).
> Disk seeks are the bane of existing storage systems since they come
> out of the available IOPS budget, which is only a couple hundred
> ops/second per drive.  The deduplication algorithm will surely result
> in increasing effective fragmentation (decreasing sequential
> performance) since duplicated blocks will result in a seek to the
> master copy of the block followed by a seek to the next block.  Disk
> seeks will remain an issue until rotating media goes away, which (in
> spite of popular opinion) is likely quite a while from now.
Yes,  I think it should be close to common sense to realize that you are
trading speed for space (but should be well documented if dedup/squash ever
makes it into the codebase).  You find these types of tradoffs in just
about every area of disk administration from the type of raid you select,
inode numbers, block size,  to the number of spindles and size of disk you
use.  The key here is that it would be a choice  just as compression is per
fs -- let the administrator choose her path.  In some situations it would
make sense,  in others not.

-Wade
>
> Someone has to play devil''s advocate here. :-)
Debate is welcome,  it is the only way to flesh out the issues.

>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Moore, Joe

2008-Jul-08 19:16 UTC

head link

[zfs-discuss] ZFS deduplication

Bob Friesenhahn wrote:> Something else came to mind which is a negative regarding 
> deduplication.  When zfs writes new sequential files, it 
> should try to 
> allocate blocks in a way which minimizes "fragmentation" 
> (disk seeks). 
It should, but because of its copy-on-write nature, fragmentation is a
significant part of the ZFS data lifecycle.

There was a discussion of this on this list at the beginning of the
year...
http://mail.opensolaris.org/pipermail/zfs-discuss/2007-November/044077.h
tml
> Disk seeks are the bane of existing storage systems since they come 
> out of the available IOPS budget, which is only a couple hundred 
> ops/second per drive.  The deduplication algorithm will surely result 
> in increasing effective fragmentation (decreasing sequential 
> performance) since duplicated blocks will result in a seek to the 
> master copy of the block followed by a seek to the next block.  Disk 
> seeks will remain an issue until rotating media goes away, which (in 
> spite of popular opinion) is likely quite a while from now.
On ZFS, sequential files are rarely sequential anyway.  The SPA tries to
keep blocks nearby, but when dealing with snapshotted sequential files
being rewritten, there is no way to keep everything in order.

But if you read through the thread referenced above, you''ll see that
there''s no clear data about just how that impacts performance (I still
owe Mr. Elling a filebench run on one of my spare servers)

--Joe

Bob Friesenhahn

2008-Jul-08 20:00 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, 8 Jul 2008, Moore, Joe wrote:>
> On ZFS, sequential files are rarely sequential anyway.  The SPA tries to
> keep blocks nearby, but when dealing with snapshotted sequential files
> being rewritten, there is no way to keep everything in order.
I think that rewriting files (updating existing blocks) is pretty 
rare.  Only limited types of applications do such things.  That is a 
good thing since zfs is not so good at rewriting files.  The most 
common situation is that a new file is written, even if selecting 
"save" for an existing file in an application.  Even if the user 
thinks that the file is being re-written, usually the application 
writes to a new temporary file and moves it into place once it is 
known to be written correctly.  The majority of files will be written 
sequentially and most files will be small enough that zfs will see all 
the data before it outputs to disk.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Mike Gerdts

2008-Jul-08 20:12 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, Jul 8, 2008 at 12:25 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> On Tue, 8 Jul 2008, Richard Elling wrote:
>> [donning my managerial accounting hat]
>> It is not a good idea to design systems based upon someone''s
managerial
>> accounting whims.  These are subject to change in illogical ways at
>> unpredictable intervals.  This is why managerial accounting can be so
>
> Managerial accounting whims can be put to good use.  If there is
> desire to reduce the amout of disk space consumed, then the accounting
> whims should make sure that those who consume the disk space get to
> pay for it.  Apparently this is not currently the case or else there
> would not be so much blatant waste.  On the flip-side, the approach
> which results in so much blatant waste may be extremely profitable so
> the waste does not really matter.
The existence of the waste paves the way for new products to come in
and offer competitive advantage over in-place solutions.  When
companies aren''t buying anything due to budget constraints, the only
way to make sales is to show businesses that by buying something they
will save money - and quickly.
> Imagine if university students were allowed to use as much space as
> they wanted but had to pay a per megabyte charge every two weeks or
> their account is terminated?  This would surely result in huge
> reduction in disk space consumption.
If you can offer the perception of more storage because of
efficiencies of the storage devices make it the same cost as less
storage, then perhaps allocating more per student is feasible.  Or
maybe tuition could drop by a few bucks.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Mike Gerdts

2008-Jul-08 20:26 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, Jul 8, 2008 at 1:26 PM, Bob Friesenhahn
<bfriesen at simple.dallas.tx.us> wrote:> Something else came to mind which is a negative regarding
> deduplication.  When zfs writes new sequential files, it should try to
> allocate blocks in a way which minimizes "fragmentation" (disk
seeks).
> Disk seeks are the bane of existing storage systems since they come
> out of the available IOPS budget, which is only a couple hundred
> ops/second per drive.  The deduplication algorithm will surely result
> in increasing effective fragmentation (decreasing sequential
> performance) since duplicated blocks will result in a seek to the
> master copy of the block followed by a seek to the next block.  Disk
> seeks will remain an issue until rotating media goes away, which (in
> spite of popular opinion) is likely quite a while from now.
>
> Someone has to play devil''s advocate here. :-)
With L2ARC on SSD, seeks are free and IOPs are quite cheap (compared
to spinning rust). Cold reads may be a problem, but there is a
reasonable chance that L2ARC sizing can be helpful here.  Also, the
blocks that are likely to be duplicate are going to be the same file
but just with a different offset.  That is, this file is going to be
the same in every one of my LDom disk images.

# du -h /usr/jdk/instances/jdk1.5.0/jre/lib/rt.jar
  38M   /usr/jdk/instances/jdk1.5.0/jre/lib/rt.jar

There is a pretty good chance that the first copy will be sequential
and as a result all of the deduped copies would be sequential as well.
 What''s more - it is quite likely to be in the ARC or L2ARC.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Jonathan Loran

2008-Jul-08 21:20 UTC

head link

[zfs-discuss] ZFS deduplication

Tim Spriggs wrote:> Does anyone know a tool that can look over a dataset and give 
> duplication statistics? I''m not looking for something incredibly 
> efficient but I''d like to know how much it would actually benefit
our
> dataset: HiRISE has a large set of spacecraft data (images) that could 
> potentially have large amounts of redundancy, or not. Also, other up and 
> coming missions have a large data volume that have a lot of duplicate 
> image info and a small budget; with "d11p" in OpenSolaris there
is a
> good business case to invest in Sun/OpenSolaris rather than buy the 
> cheaper storage (+ linux?) that can simply hold everything as is.
>
> If someone feels like coding a tool up that basically makes a file of 
> checksums and counts how many times a particular checksum get''s
hit over
> a dataset, I would be willing to run it and provide feedback. :)
>
> -Tim
>
>   
Me too.  Our data profile is just like Tim''s: Terra bytes of satellite 
data.  I''m going to guess that the d11p ratio won''t be
fantastic for
us.  I sure would like to measure it though.

Jon

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

Jonathan Loran

2008-Jul-08 21:30 UTC

head link

[zfs-discuss] ZFS deduplication

Justin Stringfellow wrote:>   
>> Does anyone know a tool that can look over a dataset and give 
>> duplication statistics? I''m not looking for something
incredibly
>> efficient but I''d like to know how much it would actually
benefit our
>>     
>
> Check out the following blog..:
>
> http://blogs.sun.com/erickustarz/entry/how_dedupalicious_is_your_pool
>
>   Unfortunately we are on Solaris 10 :(  Can I get a zdb for zfs V4 that 
will dump those checksums?

Jon

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

Jonathan Loran

2008-Jul-08 21:50 UTC

head link

[zfs-discuss] ZFS deduplication

Moore, Joe wrote:>
> On ZFS, sequential files are rarely sequential anyway.  The SPA tries to
> keep blocks nearby, but when dealing with snapshotted sequential files
> being rewritten, there is no way to keep everything in order.
>   
In some cases, a d11p system could actually speed up data reads and 
writes.  If you are repeatedly accessing duplicate data, then you will 
more likely hit your ARC, and not have to go to disk.  With your data 
d11p, the ARC can hold a significantly higher percentage of your data 
set, just like the disks.  For a d11p ARC, I would expire based upon 
block reference count.  If a block has few references, it should expire 
first, and vise versa, blocks with many references should be the last 
out.  With all the savings on disks, think how much RAM you could buy ;)

Jon

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 jloran at ssl.berkeley.edu
- ______/    ______/    ______/           AST:7731^29u18e3

Richard Elling

2008-Jul-09 00:21 UTC

head link

[zfs-discuss] ZFS deduplication

Mike Gerdts wrote:


[I agree with the comments in this thread, but... I think we''re still
being
old fashioned...]
>> Imagine if university students were allowed to use as much space as
>> they wanted but had to pay a per megabyte charge every two weeks or
>> their account is terminated?  This would surely result in huge
>> reduction in disk space consumption.
>>     
>
> If you can offer the perception of more storage because of
> efficiencies of the storage devices make it the same cost as less
> storage, then perhaps allocating more per student is feasible.  Or
> maybe tuition could drop by a few bucks.
>   
hmm... well, having spent the past two years at the University, I can
provide the observation that:

    0. Tuition never drops.

    1. Everybody (yes everybody) had a laptop.  I would say the average
    hard disk size per laptop was > 100 GBytes.

    2. Everybody (yes everybody) had USB flash drives.  In part because
    the school uses them for recruitment tools (give-aways), but they are
    inexpensive, too.

    3. Everybody (yes everybody) had a MP3 player of some magnitude.
    Many were disk-based, but there were many iPod Nanos, too.

    4. > 50% had smart phones -- crackberries, iPhones, etc.

    5. The school actually provides some storage space, but I don''t
know
    anyone who took advantage of the service.  E-mail and document
    sharing was outsourced to google -- no perceptible shortage of space
    there.

Even Microsoft charges only $3/user/month for exchange and sharepoint
services. I think many businesses would be hard-pressed to match
that sort of efficiency.

Unlike my undergraduate days, where we had to make trade-offs between
beer and floppy disks, there does not seem to be a shortage of storage
space amongst the university students today -- in spite of the rise of beer
prices recently (hops shortage, they claim ;-O  Is the era of centralized
home directories for students over? 

I think that the normal enterprise backup scenarios are more likely to
gain from de-dup, in part because they tend to make full backups of
systems and end up with zillions of copies of (static) OS files.  Actual
work files tend to be smaller, for many businesses.  De-dup on my
desktop seems to be a non-issue.  Has anyone done a full value chain
or data path analysis for de-dup?  Will de-dup grow beyond the
backup function?  Will the performance penalty of SHA-256 and
bit comparison kill all interactive performance?  Should I set aside a
few acres at the ranch to grow hops?  So many good questions, so
little time...
 -- richard

Rob Clark

2008-Jul-22 13:05 UTC

head link

[zfs-discuss] ZFS deduplication

> Hi All 
>Is there any hope for deduplication on ZFS ? 
>Mertol Ozyoney
>Storage Practice - Sales Manager
>Sun Microsystems
> Email mertol.ozyoney at sun.com
There is always hope.

Seriously thought, looking at
http://en.wikipedia.org/wiki/Comparison_of_revision_control_software there are a
lot of choices of how we could implement this.

SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge one of
those with ZFS.

It _could_ be as simple (with SVN as an example) of using directory listings to
produce files which were then ''diffed''. You could then view
the diffs as though they were changes made to lines of source code.

Just add a "tree" subroutine to allow you to grab all the diffs that
referenced changes to file ''xyz'' and you would have easy
access to all the changes of a particular file (or directory).

With the speed optimized ability added to use ZFS snapshots with the "tree
subroutine" to rollback a single file (or directory) you could undo / redo
your way through the filesystem.

Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html) you
could "sit out" on the play and watch from the sidelines -- returning
to the OS when you thought you were ''safe'' (and if not,
jumping backout).

Thus, Mertol, it is possible (and could work very well).

Rob
 
 
This message posted from opensolaris.org

Wade.Stuart at fallon.com

2008-Jul-22 14:39 UTC

head link

[zfs-discuss] ZFS deduplication

zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01 AM:
> > Hi All
> >Is there any hope for deduplication on ZFS ?
> >Mertol Ozyoney
> >Storage Practice - Sales Manager
> >Sun Microsystems
> > Email mertol.ozyoney at sun.com
>
> There is always hope.
>
> Seriously thought, looking at http://en.wikipedia.
> org/wiki/Comparison_of_revision_control_software there are a lot of
> choices of how we could implement this.
>
> SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge
> one of those with ZFS.
>
> It _could_ be as simple (with SVN as an example) of using directory
> listings to produce files which were then ''diffed''. You
could then
> view the diffs as though they were changes made to lines of source code.
>
> Just add a "tree" subroutine to allow you to grab all the diffs
that
> referenced changes to file ''xyz'' and you would have easy
access to
> all the changes of a particular file (or directory).
>
> With the speed optimized ability added to use ZFS snapshots with the
> "tree subroutine" to rollback a single file (or directory) you
could
> undo / redo your way through the filesystem.
>

dedup is not revision control,  you seem to completely misunderstand the
problem.


> Using a LKCD (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html
> ) you could "sit out" on the play and watch from the sidelines --
> returning to the OS when you thought you were ''safe'' (and
if not,
> jumping backout).
>
Now it seems you have veered even further off course.  What are you
implying the LKCD has to do with zfs, solaris, dedup, let alone revision
control software?

-Wade

Chris Cosby

2008-Jul-22 14:58 UTC

head link

[zfs-discuss] ZFS deduplication

To do dedup properly, it seems like there would have to be some overly
complicated methodology for a sort of delayed dedup of the data. For speed,
you''d want your writes to go straight into the cache and get flushed
out as
quickly as possibly, keep everything as ACID as possible. Then, a dedup
scrubber would take what was written, do the voodoo magic of checksumming
the new data, scanning the tree to see if there are any matches, locking the
duplicates, run the usage counters up or down for that block of data,
swapping out inodes, and marking the duplicate data as free space. It''s
a
lofty goal, but one that is doable. I guess this is only necessary if
deduplication is done at the file level. If done at the block level, it
could possibly be done on the fly, what with the already implemented
checksumming at the block level, but then your reads will suffer because
pieces of files can potentially be spread all over hell and half of Georgia
on the zdevs. Deduplication is going to require the judicious application of
hallucinogens and man hours. I expect that someone is up to the task.

On Tue, Jul 22, 2008 at 10:39 AM, <Wade.Stuart at fallon.com> wrote:
> zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01 AM:
>
> > > Hi All
> > >Is there any hope for deduplication on ZFS ?
> > >Mertol Ozyoney
> > >Storage Practice - Sales Manager
> > >Sun Microsystems
> > > Email mertol.ozyoney at sun.com
> >
> > There is always hope.
> >
> > Seriously thought, looking at http://en.wikipedia.
> > org/wiki/Comparison_of_revision_control_software there are a lot of
> > choices of how we could implement this.
> >
> > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge
> > one of those with ZFS.
> >
> > It _could_ be as simple (with SVN as an example) of using directory
> > listings to produce files which were then ''diffed''.
You could then
> > view the diffs as though they were changes made to lines of source
code.
> >
> > Just add a "tree" subroutine to allow you to grab all the
diffs that
> > referenced changes to file ''xyz'' and you would have
easy access to
> > all the changes of a particular file (or directory).
> >
> > With the speed optimized ability added to use ZFS snapshots with the
> > "tree subroutine" to rollback a single file (or directory)
you could
> > undo / redo your way through the filesystem.
> >
>
>
> dedup is not revision control,  you seem to completely misunderstand the
> problem.
>
>
>
> > Using a LKCD (
> http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html
> > ) you could "sit out" on the play and watch from the
sidelines --
> > returning to the OS when you thought you were ''safe''
(and if not,
> > jumping backout).
> >
>
> Now it seems you have veered even further off course.  What are you
> implying the LKCD has to do with zfs, solaris, dedup, let alone revision
> control software?
>
> -Wade
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

-- 
chris -at- microcozm -dot- net
=== Si Hoc Legere Scis Nimium Eruditionis Habes
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080722/cd9db2ef/attachment.html>

Wade.Stuart at fallon.com

2008-Jul-22 15:19 UTC

head link

[zfs-discuss] ZFS deduplication

zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 09:58:53 AM:
> To do dedup properly, it seems like there would have to be some
> overly complicated methodology for a sort of delayed dedup of the
> data. For speed, you''d want your writes to go straight into the
> cache and get flushed out as quickly as possibly, keep everything as
> ACID as possible. Then, a dedup scrubber would take what was
> written, do the voodoo magic of checksumming the new data, scanning
> the tree to see if there are any matches, locking the duplicates,
> run the usage counters up or down for that block of data, swapping
> out inodes, and marking the duplicate data as free space.I agree,  but what you are describing is file based dedup,  ZFS already has
the groundwork for dedup in the system (block level checksuming and
pointers).
> It''s a
> lofty goal, but one that is doable. I guess this is only necessary
> if deduplication is done at the file level. If done at the block
> level, it could possibly be done on the fly, what with the already
> implemented checksumming at the block level,
exactly -- that is why it is attractive for ZFS,  so much of the groundwork
is done and needed for the fs/pool already.
> but then your reads
> will suffer because pieces of files can potentially be spread all
> over hell and half of Georgia on the zdevs.
I don''t know that you can make this statement without some study of an
actual implementation on real world data -- and then because it is block
based,  you should see varying degrees of this dedup-flack-frag depending
on data/usage.

For instance,  I would imagine that in many scenarios much od the dedup
data blocks would belong to the same or very similar files. In this case
the blocks were written as best they could on the first write,  the deduped
blocks would point to a pretty sequential line o blocks.  Now on some files
there may be duplicate header or similar portions of data -- these may
cause you to jump around the disk; but I do not know how much this would be
hit or impact real world usage.

> Deduplication is going
> to require the judicious application of hallucinogens and man hours.
> I expect that someone is up to the task.
I would prefer the coder(s) not be seeing "pink elephants" while
writing
this,  but yes it can and will be done.  It (I believe) will be easier
after the grow/shrink/evac code paths are in place though. Also,  the
grow/shrink/evac path allows (if it is done right) for other cool things
like a base to build a roaming defrag that takes into account snaps,
clones, live and the like.  I know that some feel that the grow/shrink/evac
code is more important for home users,  but I think that it is super
important for most of these additional features.

-Wade
> On Tue, Jul 22, 2008 at 10:39 AM, <Wade.Stuart at fallon.com> wrote:
> zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01 AM:
>
> > > Hi All
> > >Is there any hope for deduplication on ZFS ?
> > >Mertol Ozyoney
> > >Storage Practice - Sales Manager
> > >Sun Microsystems
> > > Email mertol.ozyoney at sun.com
> >
> > There is always hope.
> >
> > Seriously thought, looking at http://en.wikipedia.
> > org/wiki/Comparison_of_revision_control_software there are a lot of
> > choices of how we could implement this.
> >
> > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;) merge
> > one of those with ZFS.
> >
> > It _could_ be as simple (with SVN as an example) of using directory
> > listings to produce files which were then ''diffed''.
You could then
> > view the diffs as though they were changes made to lines of source
code.> >
> > Just add a "tree" subroutine to allow you to grab all the
diffs that
> > referenced changes to file ''xyz'' and you would have
easy access to
> > all the changes of a particular file (or directory).
> >
> > With the speed optimized ability added to use ZFS snapshots with the
> > "tree subroutine" to rollback a single file (or directory)
you could
> > undo / redo your way through the filesystem.
> >
>
> dedup is not revision control,  you seem to completely misunderstand the
> problem.
>
>
>
> > Using a LKCD
(http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html> > ) you could "sit out" on the play and watch from the
sidelines --
> > returning to the OS when you thought you were ''safe''
(and if not,
> > jumping backout).
> >
> Now it seems you have veered even further off course.  What are you
> implying the LKCD has to do with zfs, solaris, dedup, let alone revision
> control software?
>
> -Wade
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>
>
> --
> chris -at- microcozm -dot- net
> === Si Hoc Legere Scis Nimium Eruditionis Habes
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Chris Cosby

2008-Jul-22 15:34 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, Jul 22, 2008 at 11:19 AM, <Wade.Stuart at fallon.com> wrote:
> zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 09:58:53 AM:
>
> > To do dedup properly, it seems like there would have to be some
> > overly complicated methodology for a sort of delayed dedup of the
> > data. For speed, you''d want your writes to go straight into
the
> > cache and get flushed out as quickly as possibly, keep everything as
> > ACID as possible. Then, a dedup scrubber would take what was
> > written, do the voodoo magic of checksumming the new data, scanning
> > the tree to see if there are any matches, locking the duplicates,
> > run the usage counters up or down for that block of data, swapping
> > out inodes, and marking the duplicate data as free space.
> I agree,  but what you are describing is file based dedup,  ZFS already has
> the groundwork for dedup in the system (block level checksuming and
> pointers).
>
> > It''s a
> > lofty goal, but one that is doable. I guess this is only necessary
> > if deduplication is done at the file level. If done at the block
> > level, it could possibly be done on the fly, what with the already
> > implemented checksumming at the block level,
>
> exactly -- that is why it is attractive for ZFS,  so much of the groundwork
> is done and needed for the fs/pool already.
>
> > but then your reads
> > will suffer because pieces of files can potentially be spread all
> > over hell and half of Georgia on the zdevs.
>
> I don''t know that you can make this statement without some study
of an
> actual implementation on real world data -- and then because it is block
> based,  you should see varying degrees of this dedup-flack-frag depending
> on data/usage.
It''s just a NonScientificWAG. I agree that most of the duplicated
blocks
will in most cases be part of identical files anyway, and thus lined up
exactly as you''d want them. I was just free thinking and typing.

>
>
> For instance,  I would imagine that in many scenarios much od the dedup
> data blocks would belong to the same or very similar files. In this case
> the blocks were written as best they could on the first write,  the deduped
> blocks would point to a pretty sequential line o blocks.  Now on some files
> there may be duplicate header or similar portions of data -- these may
> cause you to jump around the disk; but I do not know how much this would be
> hit or impact real world usage.
>
>
> > Deduplication is going
> > to require the judicious application of hallucinogens and man hours.
> > I expect that someone is up to the task.
>
> I would prefer the coder(s) not be seeing "pink elephants" while
writing
> this,  but yes it can and will be done.  It (I believe) will be easier
> after the grow/shrink/evac code paths are in place though. Also,  the
> grow/shrink/evac path allows (if it is done right) for other cool things
> like a base to build a roaming defrag that takes into account snaps,
> clones, live and the like.  I know that some feel that the grow/shrink/evac
> code is more important for home users,  but I think that it is super
> important for most of these additional features.
The elephants are just there to keep the coders company. There are tons of
benefits for dedup, both for home and non-home users. I''m happy that
it''s
going to be done. I expect the first complaints will come from those people
who don''t understand it, and their df and du numbers look different
than
their zpool status ones. Perhaps df/du will just have to be faked out for
those folks, or we just apply the same hallucinogens to them instead.

>
>
> -Wade
>
> > On Tue, Jul 22, 2008 at 10:39 AM, <Wade.Stuart at fallon.com>
wrote:
> > zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 08:05:01
AM:
> >
> > > > Hi All
> > > >Is there any hope for deduplication on ZFS ?
> > > >Mertol Ozyoney
> > > >Storage Practice - Sales Manager
> > > >Sun Microsystems
> > > > Email mertol.ozyoney at sun.com
> > >
> > > There is always hope.
> > >
> > > Seriously thought, looking at http://en.wikipedia.
> > > org/wiki/Comparison_of_revision_control_software there are a lot
of
> > > choices of how we could implement this.
> > >
> > > SVN/K , Mercurial and Sun Teamware all come to mind. Simply ;)
merge
> > > one of those with ZFS.
> > >
> > > It _could_ be as simple (with SVN as an example) of using
directory
> > > listings to produce files which were then
''diffed''. You could then
> > > view the diffs as though they were changes made to lines of
source
> code.
> > >
> > > Just add a "tree" subroutine to allow you to grab all
the diffs that
> > > referenced changes to file ''xyz'' and you would
have easy access to
> > > all the changes of a particular file (or directory).
> > >
> > > With the speed optimized ability added to use ZFS snapshots with
the
> > > "tree subroutine" to rollback a single file (or
directory) you could
> > > undo / redo your way through the filesystem.
> > >
> >
>
> > dedup is not revision control,  you seem to completely misunderstand
the
> > problem.
> >
> >
> >
> > > Using a LKCD
> (http://www.faqs.org/docs/Linux-HOWTO/Linux-Crash-HOWTO.html
> > > ) you could "sit out" on the play and watch from the
sidelines --
> > > returning to the OS when you thought you were
''safe'' (and if not,
> > > jumping backout).
> > >
>
> > Now it seems you have veered even further off course.  What are you
> > implying the LKCD has to do with zfs, solaris, dedup, let alone
revision
> > control software?
> >
> > -Wade
> >
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> >
> >
> >
> > --
> > chris -at- microcozm -dot- net
> > === Si Hoc Legere Scis Nimium Eruditionis Habes
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>
>

-- 
chris -at- microcozm -dot- net
=== Si Hoc Legere Scis Nimium Eruditionis Habes
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080722/177625a1/attachment.html>

Erik Trimble

2008-Jul-22 16:48 UTC

head link

[zfs-discuss] ZFS deduplication

Chris Cosby wrote:>
>
> On Tue, Jul 22, 2008 at 11:19 AM, <Wade.Stuart at fallon.com 
> <mailto:Wade.Stuart at fallon.com>> wrote:
>
>     zfs-discuss-bounces at opensolaris.org
>     <mailto:zfs-discuss-bounces at opensolaris.org> wrote on
07/22/2008
>     09:58:53 AM:
>
>     > To do dedup properly, it seems like there would have to be some
>     > overly complicated methodology for a sort of delayed dedup of the
>     > data. For speed, you''d want your writes to go straight
into the
>     > cache and get flushed out as quickly as possibly, keep everything
as
>     > ACID as possible. Then, a dedup scrubber would take what was
>     > written, do the voodoo magic of checksumming the new data,
scanning
>     > the tree to see if there are any matches, locking the duplicates,
>     > run the usage counters up or down for that block of data, swapping
>     > out inodes, and marking the duplicate data as free space.
>     I agree,  but what you are describing is file based dedup,  ZFS
>     already has
>     the groundwork for dedup in the system (block level checksuming and
>     pointers).
>
>     > It''s a
>     > lofty goal, but one that is doable. I guess this is only necessary
>     > if deduplication is done at the file level. If done at the block
>     > level, it could possibly be done on the fly, what with the already
>     > implemented checksumming at the block level,
>
>     exactly -- that is why it is attractive for ZFS,  so much of the
>     groundwork
>     is done and needed for the fs/pool already.
>
>     > but then your reads
>     > will suffer because pieces of files can potentially be spread all
>     > over hell and half of Georgia on the zdevs.
>
>     I don''t know that you can make this statement without some
study of an
>     actual implementation on real world data -- and then because it is
>     block
>     based,  you should see varying degrees of this dedup-flack-frag
>     depending
>     on data/usage.
>
> It''s just a NonScientificWAG. I agree that most of the duplicated 
> blocks will in most cases be part of identical files anyway, and thus 
> lined up exactly as you''d want them. I was just free thinking and
typing.
>  No, you are right to be concerned over block-level dedup seriously 
impacting seeks.  The problem is that, given many common storage 
scenarios, you will have not just similar files, but multiple common 
sections of many files.  Things such as the various standard 
productivity app documents will not just have the same header sections, 
but internally, there will be significant duplications of considerable 
length with other documents from the same application.  Your 5MB Word 
file is thus likely to share several (actually, many) multi-kB segments 
with other Word files.  You will thus end up seeking all over the disk 
to read _most_ Word files.  Which really sucks.  I can list at least a 
couple more common scenarios where dedup has to potential to save at 
least some reasonable amount of space, yet will absolutely kill performance.

>     For instance,  I would imagine that in many scenarios much od the
>     dedup
>     data blocks would belong to the same or very similar files. In
>     this case
>     the blocks were written as best they could on the first write,
>      the deduped
>     blocks would point to a pretty sequential line o blocks.  Now on
>     some files
>     there may be duplicate header or similar portions of data -- these may
>     cause you to jump around the disk; but I do not know how much this
>     would be
>     hit or impact real world usage.
>
>
>     > Deduplication is going
>     > to require the judicious application of hallucinogens and man
hours.
>     > I expect that someone is up to the task.
>
>     I would prefer the coder(s) not be seeing "pink elephants"
while
>     writing
>     this,  but yes it can and will be done.  It (I believe) will be easier
>     after the grow/shrink/evac code paths are in place though. Also,  the
>     grow/shrink/evac path allows (if it is done right) for other cool
>     things
>     like a base to build a roaming defrag that takes into account snaps,
>     clones, live and the like.  I know that some feel that the
>     grow/shrink/evac
>     code is more important for home users,  but I think that it is super
>     important for most of these additional features.
>
> The elephants are just there to keep the coders company. There are 
> tons of benefits for dedup, both for home and non-home users. I''m 
> happy that it''s going to be done. I expect the first complaints
will
> come from those people who don''t understand it, and their df and
du
> numbers look different than their zpool status ones. Perhaps df/du 
> will just have to be faked out for those folks, or we just apply the 
> same hallucinogens to them instead.
>I''m still not convinced that dedup is really worth it for anything but 
very limited, constrained usage. Disk is just so cheap, that you 
_really_ have to have an enormous amount of dup before the performance 
penalties of dedup are countered.

This in many ways reminds me the last year''s discussion over file 
versioning in the filesystem.  It sounds like a cool idea, but it''s not
a generally-good idea.  I tend to think that this kind of problem is 
better served by applications handling it, if they are concerned about it.

Pretty much, here''s what I''ve heard:

Dedup Advantages:

(1)  save space relative to the amount of duplication.  this is highly 
dependent on workload, and ranges from 0% to 99%, but the distribution 
of possibilities isn''t a bell curve (i.e. the average space saved
isn''t
50%).


Dedup Disadvantages:

(1)  increase codebase complexity, in both cases of dedup during write, 
and ex-post-facto batched dedup

(2)  noticable write performance penalty (assuming block-level dedup on 
write), with potential write cache issues.

(3)  very significant post-write dedup time, at least on the order of 
''zfs scrub''. Also, during such a post-write scenario, it more
or less
takes the zpool out of usage.

(4) If dedup is done at block level, not at file level, it kills read 
performance, effectively turning all dedup''d files from sequential read
to a random read.  That is, block-level dedup drastically accelerates 
filesystem fragmentation.

(5)  Something no one has talked about, but is of concern. By removing 
duplication, you increase the likelihood that loss of the "master" 
segment will corrupt many more files. Yes, ZFS has self-healing and 
such.  But, particularly in the case where there is no ZFS pool 
redundancy (or pool-level redundancy has been compromised), loss of one 
block can thus be many more times severe.


We need to think long and hard about what the real widespread benefits 
are of dedup before committing to a filesystem-level solution, rather 
than an application-level one.  In particular, we need some real-world 
data on the actual level of duplication under a wide variety of 
circumstances.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Richard Elling

2008-Jul-22 17:29 UTC

head link

[zfs-discuss] ZFS deduplication

FWIW,
Sun''s VTL products use ZFS and offer de-duplication services.
http://www.sun.com/aboutsun/pr/2008-04/sunflash.20080407.2.xml
 -- richard

Wade.Stuart at fallon.com

2008-Jul-22 17:58 UTC

head link

[zfs-discuss] ZFS deduplication

zfs-discuss-bounces at opensolaris.org wrote on 07/22/2008 11:48:30 AM:
> Chris Cosby wrote:
> >
> >
> > On Tue, Jul 22, 2008 at 11:19 AM, <Wade.Stuart at fallon.com
> > <mailto:Wade.Stuart at fallon.com>> wrote:
> >
> >     zfs-discuss-bounces at opensolaris.org
> >     <mailto:zfs-discuss-bounces at opensolaris.org> wrote on
07/22/2008
> >     09:58:53 AM:
> >
> >     > To do dedup properly, it seems like there would have to be
some
> >     > overly complicated methodology for a sort of delayed dedup of
the
> >     > data. For speed, you''d want your writes to go
straight into the
> >     > cache and get flushed out as quickly as possibly, keep
everything
as> >     > ACID as possible. Then, a dedup scrubber would take what was
> >     > written, do the voodoo magic of checksumming the new data,
scanning> >     > the tree to see if there are any matches, locking the
duplicates,
> >     > run the usage counters up or down for that block of data,
swapping> >     > out inodes, and marking the duplicate data as free space.
> >     I agree,  but what you are describing is file based dedup,  ZFS
> >     already has
> >     the groundwork for dedup in the system (block level checksuming
and
> >     pointers).
> >
> >     > It''s a
> >     > lofty goal, but one that is doable. I guess this is only
necessary> >     > if deduplication is done at the file level. If done at the
block
> >     > level, it could possibly be done on the fly, what with the
already> >     > implemented checksumming at the block level,
> >
> >     exactly -- that is why it is attractive for ZFS,  so much of the
> >     groundwork
> >     is done and needed for the fs/pool already.
> >
> >     > but then your reads
> >     > will suffer because pieces of files can potentially be spread
all
> >     > over hell and half of Georgia on the zdevs.
> >
> >     I don''t know that you can make this statement without
some study of
an> >     actual implementation on real world data -- and then because it is
> >     block
> >     based,  you should see varying degrees of this dedup-flack-frag
> >     depending
> >     on data/usage.
> >
> > It''s just a NonScientificWAG. I agree that most of the
duplicated
> > blocks will in most cases be part of identical files anyway, and thus
> > lined up exactly as you''d want them. I was just free thinking
and
typing.> >
> No, you are right to be concerned over block-level dedup seriously
> impacting seeks.  The problem is that, given many common storage
> scenarios, you will have not just similar files, but multiple common
> sections of many files.  Things such as the various standard
> productivity app documents will not just have the same header sections,
> but internally, there will be significant duplications of considerable
> length with other documents from the same application.  Your 5MB Word
> file is thus likely to share several (actually, many) multi-kB segments
> with other Word files.  You will thus end up seeking all over the disk
> to read _most_ Word files.  Which really sucks.  I can list at least a
> couple more common scenarios where dedup has to potential to save at
> least some reasonable amount of space, yet will absolutely killperformance.

While you may have a point on some data sets,  actual testing of this type
of data (28.000+ of actual end user doc files) using xdelta with 4k and 8k
block sizes shows that the similar blocks in these files are in the 2%
range (~ 6% for 4k). That means a full read of each file on average would
require < 6% seeks to other disk areas. That is not bad,  but this is the
worst case picture as those duplicate blocks would need to live in the same
offsets and have the same block boundaries to "match" under the
proposed
algo. To me this means word docs are not a good candidate for dedup at the
block level -- but the actual cost to dedup anyways seems small.  Of course
you could come up with data that is pathologically bad for these
benchmarks,  but I do not believe it would be nearly as bad as you are
making it out to be on real world data.


>
>
> >     For instance,  I would imagine that in many scenarios much od the
> >     dedup
> >     data blocks would belong to the same or very similar files. In
> >     this case
> >     the blocks were written as best they could on the first write,
> >      the deduped
> >     blocks would point to a pretty sequential line o blocks.  Now on
> >     some files
> >     there may be duplicate header or similar portions of data -- these
may> >     cause you to jump around the disk; but I do not know how much this
> >     would be
> >     hit or impact real world usage.
> >
> >
> >     > Deduplication is going
> >     > to require the judicious application of hallucinogens and man
hours.> >     > I expect that someone is up to the task.
> >
> >     I would prefer the coder(s) not be seeing "pink
elephants" while
> >     writing
> >     this,  but yes it can and will be done.  It (I believe) will be
easier> >     after the grow/shrink/evac code paths are in place though. Also,
the> >     grow/shrink/evac path allows (if it is done right) for other cool
> >     things
> >     like a base to build a roaming defrag that takes into account
snaps,> >     clones, live and the like.  I know that some feel that the
> >     grow/shrink/evac
> >     code is more important for home users,  but I think that it is
super> >     important for most of these additional features.
> >
> > The elephants are just there to keep the coders company. There are
> > tons of benefits for dedup, both for home and non-home users.
I''m
> > happy that it''s going to be done. I expect the first
complaints will
> > come from those people who don''t understand it, and their df
and du
> > numbers look different than their zpool status ones. Perhaps df/du
> > will just have to be faked out for those folks, or we just apply the
> > same hallucinogens to them instead.
> >
> I''m still not convinced that dedup is really worth it for anything
but
> very limited, constrained usage. Disk is just so cheap, that you
> _really_ have to have an enormous amount of dup before the performance
> penalties of dedup are countered.
If you can dedup 30% of your data,  your disk just became 30% cheaper.
Depending on workflow, the cost of disk is the barrier -- not cpu cycles or
write/read speed.

>
> This in many ways reminds me the last year''s discussion over file
> versioning in the filesystem.  It sounds like a cool idea, but
it''s not
> a generally-good idea.  I tend to think that this kind of problem is
> better served by applications handling it, if they are concerned about
it.>
snapping a full filesystem for versions is expensive -- you are dealing
with one file changing.  doing dedup on zfs is inexpensive vs a follow the
writes queue.

> Pretty much, here''s what I''ve heard:
>
> Dedup Advantages:
>
> (1)  save space relative to the amount of duplication.  this is highly
> dependent on workload, and ranges from 0% to 99%, but the distribution
> of possibilities isn''t a bell curve (i.e. the average space saved
isn''t
> 50%).
>
>
> Dedup Disadvantages:
>
> (1)  increase codebase complexity, in both cases of dedup during write,
> and ex-post-facto batched dedupyes,  but the code path is optional.
>
> (2)  noticable write performance penalty (assuming block-level dedup on
> write), with potential write cache issues.there is cost,  but smart use of hash lookups and caching should absorb
most of these.  most of the cost comes with using a better hashing algo
instead of fletch2/4

>
> (3)  very significant post-write dedup time, at least on the order of
> ''zfs scrub''. Also, during such a post-write scenario, it
more or less
> takes the zpool out of usage.
post write, while not as bad as a separate dedup app, reduces the value of
tying it to zfs.  it should be done inline.
>
> (4) If dedup is done at block level, not at file level, it kills read
> performance, effectively turning all dedup''d files from sequential
read
> to a random read.  That is, block-level dedup drastically accelerates
> filesystem fragmentation.
again,  this is completely dependant on the implementation and data sets.
looking at our real world data on a 14tb user file store shows that most
dedup that would happen (using 4, 8, 16 and 128k blocks) happens on totally
binary similar files,  a small percentage of dedup happens on other data if
a static block seek is used (no sliding delta window).



>
> (5)  Something no one has talked about, but is of concern. By removing
> duplication, you increase the likelihood that loss of the
"master"
> segment will corrupt many more files. Yes, ZFS has self-healing and
> such.  But, particularly in the case where there is no ZFS pool
> redundancy (or pool-level redundancy has been compromised), loss of one
> block can thus be many more times severe.

I assume that no one has talked about that because it seems obvious. Your
blocks become N times more "valuable" where N is the number of blocks
that
are pointed to that block for dedup. A lost block on zfs can therefore
affect N files + X snapshots + Y clones, or the entire filesystem if it was
holding one of a few zfs structures.

>
>
> We need to think long and hard about what the real widespread benefits
> are of dedup before committing to a filesystem-level solution, rather
> than an application-level one.  In particular, we need some real-world
> data on the actual level of duplication under a wide variety of
> circumstances.
There was already a post that shows how to exploit the zfs block checksums
to gather similar block stats. An issue I have with that is zfs default
hashing is pretty collision prone and the data seems suspect.

 I can probably post the perl scripts I used to gather data on my systems.
The hash lookup tables that they generate are pretty damn huge,  but the
reporting part could display relative info in a compact way for posting.
Assumptions I made were fixed block seeks (slurping in the largest block of
data each read and acting on it as all block sizes in the test phase to be
efficient),  md5 match = bin match (pretty safe but a real system would bit
level compare on a hash match).

-Wade
>
> --
> Erik Trimble
> Java System Support
> Mailstop:  usca22-123
> Phone:  x17195
> Santa Clara, CA
> Timezone: US/Pacific (GMT-0800)
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Mike Gerdts

2008-Jul-22 19:11 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, Jul 22, 2008 at 11:48 AM, Erik Trimble <Erik.Trimble at sun.com>
wrote:> No, you are right to be concerned over block-level dedup seriously
> impacting seeks.  The problem is that, given many common storage
> scenarios, you will have not just similar files, but multiple common
> sections of many files.  Things such as the various standard
> productivity app documents will not just have the same header sections,
> but internally, there will be significant duplications of considerable
> length with other documents from the same application.  Your 5MB Word
> file is thus likely to share several (actually, many) multi-kB segments
> with other Word files.  You will thus end up seeking all over the disk
> to read _most_ Word files.  Which really sucks.  I can list at least a
> couple more common scenarios where dedup has to potential to save at
> least some reasonable amount of space, yet will absolutely kill
performance.
This would actually argue in favor of dedup... If the blocks are
common they are more likely to be in the ARC with dedup, thus avoiding
a read altogether.  There would likely be greater overhead in
assembling smaller packets

Here''s some real life...

I have 442 Word documents created by me and others over several years.
 Many were created from the same corporate templates.  I generated the
MD5 hash of every 8 KB of each file and came up with a total of 8409
hash - implying 65 MB of word documents.  Taking those hashes through
"sort | uniq -c | sort -n" led to the following:

      3 p9I7HgbxFme7TlPZmsD6/Q
      3 sKE3RBwZt8A6uz+tAihMDA
      3 uA4PK1+SQqD+h1Nv6vJ6fQ
      3 wQoU2g7f+dxaBMzY5rVE5Q
      3 yM0csnXKtRxjpSxg1Zma0g
      3 yyokNamrTcD7lQiitcVgqA
      4 jdsZZfIHtshYZiexfX3bQw
     17 pohs0DWPFwF8HJ8p/HnFKw
     19 s0eKyh/vT1LothTvsqtZOw
     64 CCn3F0CqsauYsz6uId7hIg

Note that "CCn3F0CqsauYsz6uId7hIg" is the MD5 hash of 8 KB of zeros.
If compression is used as well, this block would not even be stored.

If 512 byte blocks are used, the story is a bit different:

     81 DEf6rofNmnr1g5f7oaV75w
    109 3gP+ZaZ2XKqMkTQ6zGLP/A
    121 ypk+0ryBeMVRnnjYQD2ZEA
    124 HcuMdyNKV7FDYcPqvb2o3Q
    371 s0eKyh/vT1LothTvsqtZOw
    372 ozgGMCCoc+0/RFbFDO8MsQ
   8535 v2GerAzfP2jUluqTRBN+iw

As you might guess, that most common hash is a block of zeros.

Most likely, however, these files will end up using 128K blocks for
the first part of the file, smaller for the portions that don''t fit.
When I look at just 128K...

      1 znJqBX8RtPrAOV2I6b5Wew
      2 6tuJccWHGVwv3v4nee6B9w
      2 Qr//PMqqhMtuKfgKhUIWVA
      2 idX0awfYjjFmwHwi60MAxg
      2 s0eKyh/vT1LothTvsqtZOw
      3 +Q/cXnknPr/uUCARsaSIGw
      3 /kyIGuWnPH/dC5ETtMqqLw
      3 4G/QmksvChYvfhAX+rfgzg
      3 SCMoKuvPepBdQEBVrTccvA
      3 vbaNWd5IQvsGdQ9R8dIqhw

There is actually very little duplication in word files.  Many of the
dupes above are from various revisions of the same files.
> Dedup Advantages:
>
> (1)  save space relative to the amount of duplication.  this is highly
> dependent on workload, and ranges from 0% to 99%, but the distribution
> of possibilities isn''t a bell curve (i.e. the average space saved
isn''t
> 50%).
I have evidence that shows 75% duplicate data on (mostly sparse) zone
roots created and maintained over a 18 month period.  I show other
evidence above that it is not nearly as good for one person''s copy of
word documents.  I suspect that it would be different if the file
system that I did this on was on a file server where all of my
colleagues also stored their documents (and revisions of mine that
they have reviewed).
> (2)  noticable write performance penalty (assuming block-level dedup on
> write), with potential write cache issues.
Depends on the approach taken.
> (3)  very significant post-write dedup time, at least on the order of
> ''zfs scrub''. Also, during such a post-write scenario, it
more or less
> takes the zpool out of usage.
The ZFS competition that has this in shipping product today does not
quiesce the file system during dedup passes.
> (4) If dedup is done at block level, not at file level, it kills read
> performance, effectively turning all dedup''d files from sequential
read
> to a random read.  That is, block-level dedup drastically accelerates
> filesystem fragmentation.
Absent data that shows this, I don''t accept this claim.  Arguably the
blocks that are duplicate are more likely to be in cache.  I think
that my analysis above shows that this is not a concern for my data
set.
> (5)  Something no one has talked about, but is of concern. By removing
> duplication, you increase the likelihood that loss of the
"master"
> segment will corrupt many more files. Yes, ZFS has self-healing and
> such.  But, particularly in the case where there is no ZFS pool
> redundancy (or pool-level redundancy has been compromised), loss of one
> block can thus be many more times severe.
I believe this is true and likely a good topic for discussion.
> We need to think long and hard about what the real widespread benefits
> are of dedup before committing to a filesystem-level solution, rather
> than an application-level one.  In particular, we need some real-world
> data on the actual level of duplication under a wide variety of
> circumstances.
The key thing here is that distributed applications will not play
nicely.  In my best use case, Solaris zones and LDoms are the
"application".  I don''t expect or want Solaris to form some
sort of
P2P storage system across my data center to save a few terabytes.
D12n at the storage device can do this much more reliably with less
complexity.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Charles Soto

2008-Jul-22 20:09 UTC

head link

[zfs-discuss] ZFS deduplication

On 7/22/08 11:48 AM, "Erik Trimble" <Erik.Trimble at Sun.COM>
wrote:
> I''m still not convinced that dedup is really worth it for anything
but
> very limited, constrained usage. Disk is just so cheap, that you
> _really_ have to have an enormous amount of dup before the performance
> penalties of dedup are countered.
Again, I will argue that the spinning rust itself isn''t expensive, but
data
management is.  If I am looking to protect multiple PB (through remote data
replication and backup), I need more than just the rust to store that.  I
need to copy this data, which takes time and effort.  If the system can say
"these 500K blocks are the same as these 500K, don''t bother
copying them to
the DR site AGAIN," then I have a less daunting data management task.
De-duplication makes a lot of sense at some layer(s) within the data
management scheme.

Charles

Bob Friesenhahn

2008-Jul-22 20:29 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, 22 Jul 2008, Erik Trimble wrote:>
> Dedup Disadvantages:
Obviously you do not work in the Sun marketing department which is 
intrested in this feature (due to some other companies marketing it). 
Note that the topic starter post came from someone in Sun''s marketing 
department.

I think that dedupication is a potential diversion which draws 
attention away from the core ZFS things which are still not ideally 
implemented or do not yet exist at all.  Compared with other 
filesystems, ZFS is still a toddler since it has only been deployed 
for a few years.  ZFS is intended to be an enterprise filesystem so 
let''s give it more time to mature before hiting it with the
"feature"
stick.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Miles Nordin

2008-Jul-22 21:14 UTC

head link

[zfs-discuss] ZFS deduplication

>>>>> "et" == Erik Trimble <Erik.Trimble at
Sun.COM> writes:
    et> Dedup Advantages:

    et> (1) save space

(2) coalesce data which is frequently used by many nodes in a large
    cluster into a small nugget of common data which can fit into RAM
    or L2 fast disk

(3) back up non-ZFS filesystems that don''t have snapshots and clones

(4) make offsite replication easier on the WAN

but, yeah, aside from imagining ahead to possible disastrous problems
with the final implementation, the imagined use cases should probably
be carefully compared to existing large installations.

Firstly, dedup may be more tempting as a bulletted marketing feature
or a bloggable/banterable boasting point than it is valuable to real
people.

Secondly, the comparison may drive the implementation.  For example,
should dedup happen at write time and be something that doesn''t happen
to data written before it''s turned on, like recordsize or compression,
to make it simpler in the user interface, and avoid problems with
scrubs making pools uselessly slow?  Or should it be scrub-like so
that already-written filesystems can be thrown into the dedup bag and
slowly squeezed, or so that dedup can run slowly during the business
day over data written quickly at night (fast outside-business-hours
backup)?
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 304 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080722/d3497f8b/attachment.bin>

Bob Friesenhahn

2008-Jul-22 22:19 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, 22 Jul 2008, Miles Nordin wrote:> scrubs making pools uselessly slow?  Or should it be scrub-like so
> that already-written filesystems can be thrown into the dedup bag and
> slowly squeezed, or so that dedup can run slowly during the business
> day over data written quickly at night (fast outside-business-hours
> backup)?
I think that the scrub-like model makes the most sense since ZFS write 
performance should not be penalized.  It is useful to implement 
score-boarding so that a block is not considered for de-duplication 
until it has been duplicated a certain number of times.  In order to 
decrease resource consumption, it is useful to perform de-duplication 
over a span of multiple days or multiple weeks doing just part of the 
job each time around. Deduping a petabyte of data seems quite 
challenging yet ZFS needs to be scalable to these levels.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Erik Trimble

2008-Jul-23 03:44 UTC

head link

[zfs-discuss] ZFS deduplication

Bob Friesenhahn wrote:> On Tue, 22 Jul 2008, Erik Trimble wrote:
>   
>> Dedup Disadvantages:
>>     
>
> Obviously you do not work in the Sun marketing department which is 
> intrested in this feature (due to some other companies marketing it). 
> Note that the topic starter post came from someone in Sun''s
marketing
> department.
>
> I think that dedupication is a potential diversion which draws 
> attention away from the core ZFS things which are still not ideally 
> implemented or do not yet exist at all.  Compared with other 
> filesystems, ZFS is still a toddler since it has only been deployed 
> for a few years.  ZFS is intended to be an enterprise filesystem so 
> let''s give it more time to mature before hiting it with the
"feature"
> stick.
>
> Bob
> =====================================> Bob Friesenhahn
> bfriesen at simple.dallas.tx.us,
http://www.simplesystems.org/users/bfriesen/
> GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/
>   
More than anything, Bob''s reply is my major feeling on this.  Dedup may
indeed turn out to be quite useful, but honestly, there''s no broad data
which says that it is a Big Win (tm) _right_now_, compared to finishing 
other features.  I''d really want a Engineering Study about the 
real-world use (i.e. what percentage of the userbase _could_ use such a 
feature, and what percentage _would_ use it, and exactly how useful 
would each segment find it...) before bumping it up in the priority 
queue of work to be done on ZFS.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Rob Clark

2008-Jul-23 04:55 UTC

head link

[zfs-discuss] ZFS deduplication

> On Tue, 22 Jul 2008, Miles Nordin wrote:
> > scrubs making pools uselessly slow?  Or should it be scrub-like so
> > that already-written filesystems can be thrown into the dedup bag and
> > slowly squeezed, or so that dedup can run slowly during the business
> > day over data written quickly at night (fast outside-business-hours
> > backup)?
> 
> I think that the scrub-like model makes the most sense since ZFS write 
> performance should not be penalized.  It is useful to implement 
> score-boarding so that a block is not considered for de-duplication 
> until it has been duplicated a certain number of times.  In order to 
> decrease resource consumption, it is useful to perform de-duplication 
> over a span of multiple days or multiple weeks doing just part of the 
> job each time around. Deduping a petabyte of data seems quite 
> challenging yet ZFS needs to be scalable to these levels.
> Bob Friesenhahn
In case anyone (other than Bob) missed it, this is why I suggested
"File-Level" Dedup:

"... using directory listings to produce files which were then
''diffed''. You could then view the diffs as though they were
changes made ..."

We could have:
"Block-Level" (if we wanted to restore an exact copy of the drive -
duplicate  the ''dd'' command) or
"Byte-Level" (if we wanted to use compression - duplicate the
''zfs set compression=on rpool'' _or_ ''bzip''
commands) ...
etc... 
assuming we wanted to duplicate commands which already implement those features,
and provide more than we (the filesystem) needs at a very high cost
(performance).

So I agree with your comment about the need to be mindful of "resource
consumption", the ability to do this over a period of days is also useful.

Indeed the Plan9 filesystem simply snapshots to WORM and has no delete - nor are
they able to fill their drives faster than they can afford to buy new ones:

Venti Filesystem
http://www.cs.bell-labs.com/who/seanq/p9trace.html

Rob

This message posted from opensolaris.org

Justin Stringfellow

2008-Jul-23 07:42 UTC

head link

[zfs-discuss] ZFS deduplication

> with other Word files.  You will thus end up seeking all over the disk 
> to read _most_ Word files.  Which really sucks.  
<snip>
> very limited, constrained usage. Disk is just so cheap, that you 
> _really_ have to have an enormous amount of dup before the performance 
> penalties of dedup are countered.
Neither of these hold true for SSDs though, do they? Seeks are essentially free,
and the devices are not cheap.

cheers,
--justin

Mike Gerdts

2008-Jul-23 11:27 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, Jul 22, 2008 at 10:44 PM, Erik Trimble <Erik.Trimble at sun.com>
wrote:> More than anything, Bob''s reply is my major feeling on this. 
Dedup may
> indeed turn out to be quite useful, but honestly, there''s no broad
data
> which says that it is a Big Win (tm) _right_now_, compared to finishing
> other features.  I''d really want a Engineering Study about the
> real-world use (i.e. what percentage of the userbase _could_ use such a
> feature, and what percentage _would_ use it, and exactly how useful
> would each segment find it...) before bumping it up in the priority
> queue of work to be done on ZFS.
I get this.  However, for most of my uses of clones dedup is
considered finishing the job.  Without it, I run the risk of having
way more writable data than I can restore.  Another solution to this
is to consider the output of "zfs send" to be a stable format and get
integration with enterprise backup software that can perform restores
in a way that maintains space efficiency.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Jim Klimov

2008-Aug-22 21:26 UTC

head link

[zfs-discuss] ZFS deduplication

Just my 2c: Is it possible to do an "offline" dedup, kind of like
snapshotting?

What I mean in practice, is: we make many Solaris full-root zones. They share a
lot of data as complete files. This is kind of easy to save space - make one
zone as a template, snapshot/clone its dataset, make new zones.

However, as projects evolve (software installed, etc.) these zones are filled
with many similar files, many of which are duplicates.

It seems reasonable to make some dedup process which would create a
least-common-denominator snapshot for all the datasets involved (zone roots), of
which all other datasets'' current data are to be dubbed "clones
with modified data".

For the system (and user) it should be perceived just the same as these datasets
are currently "clones with modified data" of the original template
zone-root dataset. Only the "template" becomes different...

Hope this idea makes sense, and perhaps makes its way into code sometime :)
 
 
This message posted from opensolaris.org

Wade.Stuart at fallon.com

2008-Aug-25 15:22 UTC

head link

[zfs-discuss] ZFS deduplication

zfs-discuss-bounces at opensolaris.org wrote on 08/22/2008 04:26:35 PM:
> Just my 2c: Is it possible to do an "offline" dedup, kind of like
> snapshotting?
>
> What I mean in practice, is: we make many Solaris full-root zones.
> They share a lot of data as complete files. This is kind of easy to
> save space - make one zone as a template, snapshot/clone its
> dataset, make new zones.
>
> However, as projects evolve (software installed, etc.) these zones
> are filled with many similar files, many of which are duplicates.
>
> It seems reasonable to make some dedup process which would create a
> least-common-denominator snapshot for all the datasets involved
> (zone roots), of which all other datasets'' current data are to be
> dubbed "clones with modified data".
>
> For the system (and user) it should be perceived just the same as
> these datasets are currently "clones with modified data" of the
> original template zone-root dataset. Only the "template" becomes
different...>
> Hope this idea makes sense, and perhaps makes its way into code sometime
:)>
Jim,

      There have been a few long threads about this in the past on this
list.  My take is that it is worthwhile, but should (or really needs to)
wait until the resilver/resize/evac code is done and the zfs libs are
stabilized and public (meaning people can actually write non throw away
code against them).  Some people feel that dedup is over extending the
premise of the filesystem (and would unnecessarily complicate the code).
Some feel that the benefits would be less than we suspect.  I would expect
first dedup code you see to be written by non sun people -- and if it is
enticing enough to be backported to trunk (maybe).  There are a bunch of
need-to-haves sitting in queue that Sun needs to focus on such as real
user/group quotas, disk shrink/evac,  utility/toolkit for failed pool
recovery (beyond skull and bones forensic tools), etc that should be way
ahead of the line vs dedup.

-Wade

>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Jim Klimov

2008-Aug-26 06:01 UTC

head link

[zfs-discuss] ZFS deduplication

Ok, thank you Nils, Wade for the concise replies.

After much reading I agree that the ZFS-development queued features do deserve a
higher ranking on the priority list (pool-shrinking/disk-removal and user/group
quotas would be my favourites), so probably the deduplication tool I''d
need would, indeed, probably be some community-contributed script which does
many hash-checks in zone-root file systems and does what Nils described to
calculate the most-common "template" filesystem and derive zone roots
as minimal changes to it.

Does anybody with a wider awareness know of such readily-available scripts on
some blog? :)

Does some script-usable ZFS API (if any) provide for fetching block/file hashes
(checksums) stored in the filesystem itself? In fact, am I wrong to expect
file-checksums to be readily available?
 
 
This message posted from opensolaris.org

Richard Elling

2008-Aug-26 14:44 UTC

head link

[zfs-discuss] ZFS deduplication

Jim Klimov wrote:> Ok, thank you Nils, Wade for the concise replies.
>
> After much reading I agree that the ZFS-development queued features do
deserve a higher ranking on the priority list (pool-shrinking/disk-removal and
user/group quotas would be my favourites), so probably the deduplication tool
I''d need would, indeed, probably be some community-contributed script
which does many hash-checks in zone-root file systems and does what Nils
described to calculate the most-common "template" filesystem and
derive zone roots as minimal changes to it.
>
> Does anybody with a wider awareness know of such readily-available scripts
on some blog? :)
>
> Does some script-usable ZFS API (if any) provide for fetching block/file
hashes (checksums) stored in the filesystem itself? In fact, am I wrong to
expect file-checksums to be readily available?
>   
Yes.  Files are not checksummed, blocks are checksummed.
 -- richard

Wade.Stuart at fallon.com

2008-Aug-26 15:12 UTC

head link

[zfs-discuss] ZFS deduplication

> >
> > Does some script-usable ZFS API (if any) provide for fetching
> block/file hashes (checksums) stored in the filesystem itself? In
> fact, am I wrong to expect file-checksums to be readily available?
> >
>
> Yes.  Files are not checksummed, blocks are checksummed.
>  -- richard
Further,  even if they were file level checksums, the default checksums in
zfs are too collision prone to be used for that purpose.  If I were to
write such a script I would md5+sha and then bit level compare collisions
to be safe.


-Wade

Darren J Moffat

2008-Aug-26 15:34 UTC

head link

[zfs-discuss] ZFS deduplication

Wade.Stuart at fallon.com wrote:>>> Does some script-usable ZFS API (if any) provide for fetching
>> block/file hashes (checksums) stored in the filesystem itself? In
>> fact, am I wrong to expect file-checksums to be readily available?
>> Yes.  Files are not checksummed, blocks are checksummed.
>>  -- richard
> 
> Further,  even if they were file level checksums, the default checksums in
> zfs are too collision prone to be used for that purpose.  If I were to
> write such a script I would md5+sha and then bit level compare collisions
> to be safe.
zfs set checksum=sha256

Remembering that doesn''t change existing data.

-- 
Darren J Moffat

Bob Friesenhahn

2008-Aug-26 15:39 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, 26 Aug 2008, Darren J Moffat wrote:>
> zfs set checksum=sha256
Expect performance to really suck after setting this.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Darren J Moffat

2008-Aug-26 15:58 UTC

head link

[zfs-discuss] ZFS deduplication

Bob Friesenhahn wrote:> On Tue, 26 Aug 2008, Darren J Moffat wrote:
>>
>> zfs set checksum=sha256
> 
> Expect performance to really suck after setting this.
Do you have evidence of that ?  What kind of workload and how did you 
test it ?

I''ve recently been benchmarking using filebench filemicro and filemacro
workloads for ZFS Crypto and as part of setting my base line I compared 
the default checksum (flecher2) with sha256 and I didn''t see a big 
enough difference to classify it as "sucks".

Here is my evidence for the filebench filemacro workload:

http://cr.opensolaris.org/~darrenm/zfs-checksum-compare.html

This was done on a X4500 running the zfs-crypto development binaries.

In the interest of "full disclosure" I have changed the sha256.c in
the
ZFS source to use the default kernel one via the crypto framework rather 
than a private copy. I wouldn''t expect that to have too big an impact
(I
will be verifying it I just didn''t have the data to hand quickly).

-- 
Darren J Moffat

Keith Bierman

2008-Aug-26 16:07 UTC

head link

[zfs-discuss] ZFS deduplication

On Aug 26, 2008, at 9:58 AM, Darren J Moffat wrote:
>
> than a private copy. I wouldn''t expect that to have too big an  
> impact (I
>
On a SPARC CMT (Niagara 1+) based system wouldn''t that be likely to  
have a large impact?

-- 
Keith H. Bierman   khbkhb at gmail.com      | AIM kbiermank
5430 Nassau Circle East                  |
Cherry Hills Village, CO 80113           | 303-997-2749
<speaking for myself*> Copyright 2008

Mike Gerdts

2008-Aug-26 16:07 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat <Darren.Moffat at
sun.com> wrote:> In the interest of "full disclosure" I have changed the sha256.c
in the
> ZFS source to use the default kernel one via the crypto framework rather
> than a private copy. I wouldn''t expect that to have too big an
impact (I
> will be verifying it I just didn''t have the data to hand quickly).
Would this also make it so that it would use hardware assisted sha256
on capable (e.g N2) platforms?  Is that the same as this change from
long ago?

http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.html

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Bob Friesenhahn

2008-Aug-26 16:08 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, 26 Aug 2008, Darren J Moffat wrote:
> Bob Friesenhahn wrote:
>> On Tue, 26 Aug 2008, Darren J Moffat wrote:
>>> 
>>> zfs set checksum=sha256
>> 
>> Expect performance to really suck after setting this.
>
> Do you have evidence of that ?  What kind of workload and how did you test
it
I did some random I/O throughput testing using iozone.  While I saw 
similar I/O performance degredation to what you did (similar to your 
"large_db_oltp_8k_cached"), I did observe high CPU usage.  The default
fletcher algorithm uses hardly any CPU.

In a dedicated file server, CPU usage is not a problem unless it slows 
subsequent requests.  In a desktop system, or compute workstation, 
filesystem CPU usage competes with application CPU usage.  With 
Solaris 10, enabling sha256 resulted in jerky mouse and desktop 
application behavior.

Bob
=====================================Bob Friesenhahn
bfriesen at simple.dallas.tx.us, http://www.simplesystems.org/users/bfriesen/
GraphicsMagick Maintainer,    http://www.GraphicsMagick.org/

Darren J Moffat

2008-Aug-26 16:11 UTC

head link

[zfs-discuss] ZFS deduplication

Keith Bierman wrote:> 
> On Aug 26, 2008, at 9:58 AM, Darren J Moffat wrote:
> 
>>
>> than a private copy. I wouldn''t expect that to have too big an
impact (I
>>
> 
> On a SPARC CMT (Niagara 1+) based system wouldn''t that be likely
to have
> a large impact?
UltraSPARC T1 has no hardware SHA256 so I wouldn''t expect any real 
change from running the private software sha256 copy in ZFS versus the 
software sha256 in the crypto framework.  The software sha256 in crypto 
framework has very little (if any) optimization for sun4v.

An UltraSPARC T2 has on chip SHA256 and it should have a good impact on 
performance to use the crypto framework.  I don''t have the data to hand
a the moment.

-- 
Darren J Moffat

Darren J Moffat

2008-Aug-26 16:13 UTC

head link

[zfs-discuss] ZFS deduplication

Mike Gerdts wrote:> On Tue, Aug 26, 2008 at 10:58 AM, Darren J Moffat <Darren.Moffat at
sun.com> wrote:
>> In the interest of "full disclosure" I have changed the
sha256.c in the
>> ZFS source to use the default kernel one via the crypto framework
rather
>> than a private copy. I wouldn''t expect that to have too big an
impact (I
>> will be verifying it I just didn''t have the data to hand
quickly).
> 
> Would this also make it so that it would use hardware assisted sha256
> on capable (e.g N2) platforms? 
Yes.
> Is that the same as this change from
> long ago?
> http://mail.opensolaris.org/pipermail/zfs-code/2007-March/000448.html
Slightly different implementation - in particular it doesn''t use
PKCS#11
in userland only libmd.  It also falls back to direct sha256 if the 
crypto framework call crypto_mech2id() call fails - this is needed to 
support ZFS boot.

-- 
Darren J Moffat

Keith Bierman

2008-Aug-26 17:32 UTC

head link

[zfs-discuss] ZFS deduplication

On Tue, Aug 26, 2008 at 10:11 AM, Darren J Moffat <Darren.Moffat at
sun.com>wrote:
> Keith Bierman wrote:
>
>>
>>
>>
>>>
>> On a SPARC CMT (Niagara 1+) based system wouldn''t that be
likely to have a
>> large impact?
>>
>
> UltraSPARC T1 has no hardware SHA256 so I wouldn''t expect any real
change
> from running the private software sha256 copy in ZFS versus the software
> sha256 in the crypto framework.  The

Sorry for the typo (or thinko; I did know that but it''s possible that
it
slipped my mind in the moment). Admittedly most community members probably
don''t have an N2 to play with, but it might well be available in the
DC.
-- 
Keith Bierman
khbkhb at gmail.com
kbiermank AIM
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20080826/18f8512c/attachment.html>

zfs discuss - Jul 2008 - ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication

[zfs-discuss] ZFS deduplication