thr3ads.net - zfs discuss - [zfs-discuss] zfs space efficiency [Jun 2007]

If this information is useful, please help other people find it:
Share via:

roland

2007-Jun-23 13:21 UTC

[zfs-discuss] zfs space efficiency

hello !

i think of using zfs for backup purpose of large binary data files (i.e. vmware
vm`s, oracle database) and want to rsync them in regular interval from other
systems to one central zfs system with compression on.

i`d like to have historical versions and thus want to make a snapshot before
each backup - i.e. rsync.

now i wonder:
if i have one large datafile on zfs, make a snapshot from that zfs fs holding it
and then overwrting that file by a newer version with slight differences inside
- what about the real disk consumption on the zfs side ?
do i need to handle this a special way to make it space-efficient ? do i need to
use rsync --inplace ?

typically , rsync writes a complete new (temporary) file based on the existing
one and on what has change at the remote site - and then replacing the old one
by the new one via delete/rename. i assume this will eat up my backup space very
quickly, even when using snapshots and even if only small parts of the large
file are changing.

comments? 

regards
roland
 
 
This message posted from opensolaris.org

Erik Trimble

2007-Jun-23 18:59 UTC

head link

[zfs-discuss] zfs space efficiency

roland wrote:> hello !
>
> i think of using zfs for backup purpose of large binary data files (i.e.
vmware vm`s, oracle database) and want to rsync them in regular interval from
other systems to one central zfs system with compression on.
>
> i`d like to have historical versions and thus want to make a snapshot
before each backup - i.e. rsync.
>
> now i wonder:
> if i have one large datafile on zfs, make a snapshot from that zfs fs
holding it and then overwrting that file by a newer version with slight
differences inside - what about the real disk consumption on the zfs side ?
> do i need to handle this a special way to make it space-efficient ? do i
need to use rsync --inplace ?
>
> typically , rsync writes a complete new (temporary) file based on the
existing one and on what has change at the remote site - and then replacing the
old one by the new one via delete/rename. i assume this will eat up my backup
space very quickly, even when using snapshots and even if only small parts of
the large file are changing.
>
> comments? 
>
> regards
> roland
>  
>   I''m pretty sure about this answer, but others should correct me if
I''m
wrong. :-)

Under ZFS, any equivalent to ''cp A B'' takes up no extra space.
The
metadata is updated so that B points to the blocks in A.  Should anyone 
begin writing to B, only the updated blocks are added on disk, with the 
metadata for B now containing the proper block list to be used (some 
from A, and the new blocks in B).   So, in your case, you get maximum 
space efficiency, where only the new blocks are stored, and the old 
blocks simply are referenced.

What I''m not sure of is exactly how ZFS does this.  Does the metadata 
for B contain an entire list of all blocks (in order) for that file? Or 
does each block effectively contain a pointer to the "next" (and 
possibly "prev") block, in effect a doubly-linked list?   I''d
hope for
the former, since that seems most efficient.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

roland

2007-Jun-23 19:11 UTC

head link

[zfs-discuss] Re: zfs space efficiency

>So, in your case, you get maximum
>space efficiency, where only the new blocks are stored, and the old
>blocks simply are referenced.
so - i assume that whenever some block is read from file A and written unchanged
to file B, zfs recognizes this and just creates a new reference to file A ?

that would be great.

i shouldn`t ask so much but try on my own, instead :)
 
 
This message posted from opensolaris.org

Matthew Ahrens

2007-Jun-23 19:13 UTC

head link

[zfs-discuss] zfs space efficiency

Erik Trimble wrote:> roland wrote:
>> hello !
>>
>> i think of using zfs for backup purpose of large binary data files 
>> (i.e. vmware vm`s, oracle database) and want to rsync them in regular 
>> interval from other systems to one central zfs system with compression 
>> on.
>>
>> i`d like to have historical versions and thus want to make a snapshot 
>> before each backup - i.e. rsync.
>>
>> now i wonder:
>> if i have one large datafile on zfs, make a snapshot from that zfs fs 
>> holding it and then overwrting that file by a newer version with 
>> slight differences inside - what about the real disk consumption on 
>> the zfs side ?
>> do i need to handle this a special way to make it space-efficient ? do 
>> i need to use rsync --inplace ?
>>
>> typically , rsync writes a complete new (temporary) file based on the 
>> existing one and on what has change at the remote site - and then 
>> replacing the old one by the new one via delete/rename. i assume this 
>> will eat up my backup space very quickly, even when using snapshots 
>> and even if only small parts of the large file are changing.
You are correct, when you write a new file, we will allocate space for that 
entire new file, even if some of its blocks happen to have the same content 
as blocks in the previous file.

This is one of the reasons that we implemented "zfs send".  If only a
few
blocks of a large file were modified on the sending side, then only those 
blocks will be sent, and we will find the blocks extremely quickly (in 
O(modified blocks) time; using the POSIX interfaces (as rsync does) would 
take O(filesize) time).  Of course, if the system you''re backing up
from is
not running ZFS, this does not help you.
> Under ZFS, any equivalent to ''cp A B'' takes up no extra
space. The
> metadata is updated so that B points to the blocks in A.  Should anyone 
> begin writing to B, only the updated blocks are added on disk, with the 
> metadata for B now containing the proper block list to be used (some 
> from A, and the new blocks in B).   So, in your case, you get maximum 
> space efficiency, where only the new blocks are stored, and the old 
> blocks simply are referenced.
That is not correct; what lead you to believe that?  With ZFS (and UFS, EXT2, 
WAFL, VxFS, etc), "cp a b" will copy the contents of the file,
resulting in
two copies stored on disk.

--matt

Darren Dunham

2007-Jun-23 21:15 UTC

head link

[zfs-discuss] zfs space efficiency

> if i have one large datafile on zfs, make a snapshot from that zfs fs
> holding it and then overwrting that file by a newer version with
> slight differences inside - what about the real disk consumption on
> the zfs side ?
If all the blocks are rewritten, then they''re all new blocks as far as
ZFS knows.
> do i need to handle this a special way to make it
> space-efficient ? do i need to use rsync --inplace ?
I would certainly try that to see if it worked, and if your access can
cope with files being partially edited at times.
> typically , rsync writes a complete new (temporary) file based on the
> existing one and on what has change at the remote site - and then
> replacing the old one by the new one via delete/rename. i assume this
> will eat up my backup space very quickly, even when using snapshots
> and even if only small parts of the large file are changing.
Yes, I think so.

I believe this is even more of a problem for a server with Windows
clients (via CIFS) because many of the apps tend to rewrite the entire
file on save.  Network Appliance eventually added an option on their
software to let you do additional work and save space if files are
substantially similar to the last snapshot.

Theirs works on file close, so it''s only a CIFS option.  ZFS could
conceivably do the same for local access as well, but I don''t think
anyone''s tried to work on it yet.


-- 
Darren Dunham                                           ddunham at taos.com
Senior Technical Consultant         TAOS            http://www.taos.com/
Got some Dr Pepper?                           San Francisco, CA bay area
         < This line left intentionally blank to confuse you. >

Erik Trimble

2007-Jun-23 21:30 UTC

head link

[zfs-discuss] zfs space efficiency

Matthew Ahrens wrote:> Erik Trimble wrote:
>> Under ZFS, any equivalent to ''cp A B'' takes up no
extra space. The
>> metadata is updated so that B points to the blocks in A.  Should 
>> anyone begin writing to B, only the updated blocks are added on disk, 
>> with the metadata for B now containing the proper block list to be 
>> used (some from A, and the new blocks in B).   So, in your case, you 
>> get maximum space efficiency, where only the new blocks are stored, 
>> and the old blocks simply are referenced.
>
> That is not correct; what lead you to believe that?  With ZFS (and 
> UFS, EXT2, WAFL, VxFS, etc), "cp a b" will copy the contents of
the
> file, resulting in two copies stored on disk.
>
> --mattBasically, the descriptions of Copy on Write.  Or does this apply only 
to Snapshots?   My original understanding was that CoW applied whenever 
you were making a duplicate of an existing file.  I can understand that 
''cp'' might not do that (given that there must be some
(system-call)
mechanism for ZFS to distinguish that we are replicating an existing 
file, not just creating a whole new one).   Now that I think about it, 
I''m not sure that I can see any way to change the behavior of POSIX 
calls to allow for this type of mechanism. You''d effectively have to 
create a whole new system call with multiple file arguments.  <sigh>

Wishfull thinking, I guess.


Now, wouldn''t it be nice to have syscalls which would implement
"cp" and
"mv", thus abstracting it away from the userland app?

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Will Murnane

2007-Jun-23 22:53 UTC

head link

[zfs-discuss] zfs space efficiency

On 6/23/07, Erik Trimble <Erik.Trimble at sun.com>
wrote:> Matthew Ahrens wrote:
> Basically, the descriptions of Copy on Write.  Or does this apply only
> to Snapshots?   My original understanding was that CoW applied whenever
> you were making a duplicate of an existing file.CoW happens all the time.  If you overwrite a file, instead of writing
it to the same location on disk, ZFS allocates a new block, writes to
that, and then creates a new tree in parallel (all on new, previously
unused blocks).  Then it changes the root of the tree to point to the
newly allocated blocks.
> Now that I think about it,
> I''m not sure that I can see any way to change the behavior of
POSIX
> calls to allow for this type of mechanism. You''d effectively have
to
> create a whole new system call with multiple file arguments.  <sigh>Files that are mostly the same, or exactly the same?  If they''re
exactly the same, it''s called a hardlink ;)  If they''re mostly
the
same, I guess, you could come up with a combination of a sparse file
and a symlink.  But I don''t think the needed functionality is commonly
enough used to bother implementing in kernel space.  If you really
want it in your application, do it yourself.  Make a little file with
two filenames, and a bitmap indicating which of them the application
blocks should come from.
> Now, wouldn''t it be nice to have syscalls which would implement
"cp" and
> "mv", thus abstracting it away from the userland app?Not really.  Different apps want different behavior in their copying,
so you''d have to expose a whole lot of things - how much of the copy
has completed? how fast is it going? - even if they never get used by
the userspace app.  And it duplicates functionality - you can do
everything necessary in userspace with stat(), read(), write() and
friends.

Matthew Ahrens

2007-Jun-24 18:23 UTC

head link

[zfs-discuss] zfs space efficiency

Will Murnane wrote:> On 6/23/07, Erik Trimble <Erik.Trimble at sun.com> wrote:
>> Now, wouldn''t it be nice to have syscalls which would
implement "cp" and
>> "mv", thus abstracting it away from the userland app?
 >> Not really.  Different apps want different behavior in their copying,
> so you''d have to expose a whole lot of things - how much of the
copy
> has completed? how fast is it going? - even if they never get used by
> the userspace app.  And it duplicates functionality - you can do
> everything necessary in userspace with stat(), read(), write() and
> friends.
A "copyfile" primitive would be great!  It would solve the problem of
having
all those "friends" to deal with -- stat(), extended attributes, UFS
ACLs,
NFSv4 ACLs, CIFS attributes, etc.  That isn''t to say that it would have
to be
implemented in the kernel; it could easily be a library function.

--matt

roland

2007-Jun-24 21:39 UTC

head link

[zfs-discuss] Re: zfs space efficiency

whoops - i see i have posted the same several times.

this was duo to i got an error message when posting and thought, it didn`t get
trough

could some moderator probably delete those double posts ?


meanwhile, i did some tests and have very weird results.

first off, i tried "--inplace" to update a slightly changing large
binary file - and it didn`t give a result as expected.

after a snapshot, rewriting that file made it take exactly twice the space,
although i only changed some bytes of the source file. (i used binary editor
"bed" for that)
i would have expected, that rsync would actually only merge the changed bytes
into  the destination file and thus zfs snapshot would be efficient with this
(because of some few write() calls)

it wasn`t.

this changed when i additionally added "--append" - this results in
space efficiency - but it only works as desired if i append data to the end of
the source file. if i change the file in the middle (i.e. only changing bytes,
not inserting new ones) , then rsync tells me:

WARNING: test.dat failed verification -- update retained (will try again).

which i don`t know what rsync want`s to tell me here......

looks like this is not a real problem, but i wonder about that message.

i tested this on zfs-fuse and cannot believe that the result is due to zfs-fuse
- but i will try to test this on solaris to see if it behaves differently.....
 
 
This message posted from opensolaris.org

roland

2007-Jun-24 22:08 UTC

head link

[zfs-discuss] Re: zfs space efficiency

update on this:

i think i have been caught by a rsync trap.

it seems, using rsync locally (i.e. rsync --inplace localsource
localdestination) and "remotely" (i.e. rsync --inplace localsource
localhost:/localdestination) is something different and rsync seems to handle
the writing very different.
when i use rsync through the network stack( i.e. localhost:/localdestination) it
seems to work as expected.

need some more testing to be real sure.....but for now things look more
promising

roland
 
 
This message posted from opensolaris.org

Erik Trimble

2007-Jun-24 22:39 UTC

head link

[zfs-discuss] zfs space efficiency

Matthew Ahrens wrote:> Will Murnane wrote:
>> On 6/23/07, Erik Trimble <Erik.Trimble at sun.com> wrote:
>>> Now, wouldn''t it be nice to have syscalls which would
implement "cp"
>>> and
>>> "mv", thus abstracting it away from the userland app?
> >
>> Not really.  Different apps want different behavior in their copying,
>> so you''d have to expose a whole lot of things - how much of
the copy
>> has completed? how fast is it going? - even if they never get used by
>> the userspace app.  And it duplicates functionality - you can do
>> everything necessary in userspace with stat(), read(), write() and
>> friends.
>
> A "copyfile" primitive would be great!  It would solve the
problem of
> having all those "friends" to deal with -- stat(), extended 
> attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc.  That
isn''t to
> say that it would have to be implemented in the kernel; it could 
> easily be a library function.
>
> --mattI''m with Matt.  Having a "copyfile" library/sys call would be
of
significant advantage.  In this case, we can''t currently take advantage
of the CoW ability of ZFS when doing ''cp A B''  (as has been
pointed out
to me).  ''cp'' simply opens file A with read(), opens a new
file B with
write(), and then shuffles the data between the two.  Now, if we had a 
copyfile(A,B) primitive, then the ''cp'' binary would simply
call this
function, and, depending on the underlying FS, it would get implemented 
differently.  In UFS, it would work as it does now. For ZFS, it would 
work like a snapshot, where file A and B share data blocks (at least 
until someone starts to update either A or B).

There are other instances of the various POSIX calls which make it hard 
to take advantage of modern advanced filesystems, while still retaining 
backwards compatibility.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Gary Mills

2007-Jun-24 22:58 UTC

head link

[zfs-discuss] zfs space efficiency

On Sun, Jun 24, 2007 at 03:39:40PM -0700, Erik Trimble
wrote:> Matthew Ahrens wrote:
> >Will Murnane wrote:
> >>On 6/23/07, Erik Trimble <Erik.Trimble at sun.com> wrote:
> >>>Now, wouldn''t it be nice to have syscalls which would
implement "cp"
> >>>and
> >>>"mv", thus abstracting it away from the userland app?
> >A "copyfile" primitive would be great!  It would solve the
problem of
> >having all those "friends" to deal with -- stat(), extended 
> >attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc.  That
isn''t to
> >say that it would have to be implemented in the kernel; it could 
> >easily be a library function.
> >
> I''m with Matt.  Having a "copyfile" library/sys call
would be of
> significant advantage.  In this case, we can''t currently take
advantage
> of the CoW ability of ZFS when doing ''cp A B''  (as has
been pointed out
> to me).  ''cp'' simply opens file A with read(), opens a
new file B with
> write(), and then shuffles the data between the two.  Now, if we had a 
> copyfile(A,B) primitive, then the ''cp'' binary would
simply call this
> function, and, depending on the underlying FS, it would get implemented 
> differently.  In UFS, it would work as it does now. For ZFS, it would 
> work like a snapshot, where file A and B share data blocks (at least 
> until someone starts to update either A or B).
Isn''t this technique an instance of `deduplication'', which
seems to be
a hot idea in storage these days?  I wonder if it could be done
automatically, behind the scenes, in some fashion.

-- 
-Gary Mills-    -Unix Support-    -U of M Academic Computing and Networking-

dave johnson

2007-Jun-24 23:58 UTC

head link

[zfs-discuss] zfs space efficiency

How other storage systems do it is by calculating a hash value for said file 
(or block), storing that value in a db, then checking every new file (or 
block) commit against the db for a match and if found, replace file (or 
block) with duplicate entry in db.

The most common non-proprietary hash calc for file-level deduplication seems 
to be the combination of the SHA1 and MD5 together.  Collisions have been 
shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but 
the probibility of collitions occuring simultaneously both is to
"small" as
the capacity of ZFS is to "large" :)

While computationally intense, this would be a VERY welcome feature addition 
to ZFS and given the existing infrastructure within the filesystem already, 
while non-trivial by any means, it seems a prime candidate.  I am not a 
programmer so I do not have the expertise to spearhead such a movement but I 
would think getting at least a placeholder "Goals and Objectives" page
into
the OZFS community pages would be a good start even if movement on this 
doesn''t come for a year or more.

Thoughts ?

-=dave

----- Original Message ----- 
From: "Gary Mills" <mills at cc.umanitoba.ca>
To: "Erik Trimble" <Erik.Trimble at Sun.COM>
Cc: "Matthew Ahrens" <Matthew.Ahrens at Sun.COM>;
"roland" <devzero at web.de>;
<zfs-discuss at opensolaris.org>
Sent: Sunday, June 24, 2007 3:58 PM
Subject: Re: [zfs-discuss] zfs space efficiency

> On Sun, Jun 24, 2007 at 03:39:40PM -0700, Erik Trimble wrote:
>> Matthew Ahrens wrote:
>> >Will Murnane wrote:
>> >>On 6/23/07, Erik Trimble <Erik.Trimble at sun.com> wrote:
>> >>>Now, wouldn''t it be nice to have syscalls which
would implement "cp"
>> >>>and
>> >>>"mv", thus abstracting it away from the userland
app?
>
>> >A "copyfile" primitive would be great!  It would solve
the problem of
>> >having all those "friends" to deal with -- stat(),
extended
>> >attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc.  That
isn''t to
>> >say that it would have to be implemented in the kernel; it could
>> >easily be a library function.
>> >
>> I''m with Matt.  Having a "copyfile" library/sys call
would be of
>> significant advantage.  In this case, we can''t currently take
advantage
>> of the CoW ability of ZFS when doing ''cp A B''  (as
has been pointed out
>> to me).  ''cp'' simply opens file A with read(), opens
a new file B with
>> write(), and then shuffles the data between the two.  Now, if we had a
>> copyfile(A,B) primitive, then the ''cp'' binary would
simply call this
>> function, and, depending on the underlying FS, it would get implemented
>> differently.  In UFS, it would work as it does now. For ZFS, it would
>> work like a snapshot, where file A and B share data blocks (at least
>> until someone starts to update either A or B).
>
> Isn''t this technique an instance of `deduplication'',
which seems to be
> a hot idea in storage these days?  I wonder if it could be done
> automatically, behind the scenes, in some fashion.
>
> -- 
> -Gary Mills-    -Unix Support-    -U of M Academic Computing and 
> Networking-
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Torrey McMahon

2007-Jun-25 00:10 UTC

head link

[zfs-discuss] zfs space efficiency

The interesting collision is going to be file system level encryption 
vs. de-duplication as the former makes the latter pretty difficult.

dave johnson wrote:> How other storage systems do it is by calculating a hash value for 
> said file (or block), storing that value in a db, then checking every 
> new file (or block) commit against the db for a match and if found, 
> replace file (or block) with duplicate entry in db.
>
> The most common non-proprietary hash calc for file-level deduplication 
> seems to be the combination of the SHA1 and MD5 together.  Collisions 
> have been shown to exist in MD5 and theoried to exist in SHA1 by 
> extrapolation, but the probibility of collitions occuring 
> simultaneously both is to "small" as the capacity of ZFS is to
"large" :)
>
> While computationally intense, this would be a VERY welcome feature 
> addition to ZFS and given the existing infrastructure within the 
> filesystem already, while non-trivial by any means, it seems a prime 
> candidate.  I am not a programmer so I do not have the expertise to 
> spearhead such a movement but I would think getting at least a 
> placeholder "Goals and Objectives" page into the OZFS community
pages
> would be a good start even if movement on this doesn''t come for a
year
> or more.
>
> Thoughts ?
>
> -=dave
>
> ----- Original Message ----- From: "Gary Mills" <mills at
cc.umanitoba.ca>
> To: "Erik Trimble" <Erik.Trimble at Sun.COM>
> Cc: "Matthew Ahrens" <Matthew.Ahrens at Sun.COM>;
"roland"
> <devzero at web.de>; <zfs-discuss at opensolaris.org>
> Sent: Sunday, June 24, 2007 3:58 PM
> Subject: Re: [zfs-discuss] zfs space efficiency
>
>
>> On Sun, Jun 24, 2007 at 03:39:40PM -0700, Erik Trimble wrote:
>>> Matthew Ahrens wrote:
>>> >Will Murnane wrote:
>>> >>On 6/23/07, Erik Trimble <Erik.Trimble at sun.com>
wrote:
>>> >>>Now, wouldn''t it be nice to have syscalls
which would implement "cp"
>>> >>>and
>>> >>>"mv", thus abstracting it away from the
userland app?
>>
>>> >A "copyfile" primitive would be great!  It would
solve the problem of
>>> >having all those "friends" to deal with -- stat(),
extended
>>> >attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc.  That
isn''t to
>>> >say that it would have to be implemented in the kernel; it
could
>>> >easily be a library function.
>>> >
>>> I''m with Matt.  Having a "copyfile" library/sys
call would be of
>>> significant advantage.  In this case, we can''t currently
take advantage
>>> of the CoW ability of ZFS when doing ''cp A B'' 
(as has been pointed out
>>> to me).  ''cp'' simply opens file A with read(),
opens a new file B with
>>> write(), and then shuffles the data between the two.  Now, if we
had a
>>> copyfile(A,B) primitive, then the ''cp'' binary
would simply call this
>>> function, and, depending on the underlying FS, it would get
implemented
>>> differently.  In UFS, it would work as it does now. For ZFS, it
would
>>> work like a snapshot, where file A and B share data blocks (at
least
>>> until someone starts to update either A or B).
>>
>> Isn''t this technique an instance of `deduplication'',
which seems to be
>> a hot idea in storage these days?  I wonder if it could be done
>> automatically, behind the scenes, in some fashion.
>>
>> -- 
>> -Gary Mills-    -Unix Support-    -U of M Academic Computing and 
>> Networking-
>> _______________________________________________
>> zfs-discuss mailing list
>> zfs-discuss at opensolaris.org
>> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Bill Sommerfeld

2007-Jun-25 18:50 UTC

head link

[zfs-discuss] zfs space efficiency

On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote:> The most common non-proprietary hash calc for file-level deduplication
seems
> to be the combination of the SHA1 and MD5 together.  Collisions have been 
> shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but 
> the probibility of collitions occuring simultaneously both is to
"small" as
> the capacity of ZFS is to "large" :)
No.  Collisions in *any* hash function with output smaller than input
are known to exist through information theory.  The tricky part is
finding the collisions without needing to resort to brute force search.

Last I checked, the cryptographers specializing in hash functions are
much less optimistic than this.  

I wouldn''t de-duplicate without actually verifying that two blocks were
actually bitwise identical.  
> 
> While computationally intense, this would be a VERY welcome feature
addition
> to ZFS and given the existing infrastructure within the filesystem already,
> while non-trivial by any means, it seems a prime candidate.  I am not a 
> programmer so I do not have the expertise to spearhead such a movement but
I
> would think getting at least a placeholder "Goals and Objectives"
page into
> the OZFS community pages would be a good start even if movement on this 
> doesn''t come for a year or more.
> 
> Thoughts ?
> 
> -=dave
> 
> ----- Original Message ----- 
> From: "Gary Mills" <mills at cc.umanitoba.ca>
> To: "Erik Trimble" <Erik.Trimble at Sun.COM>
> Cc: "Matthew Ahrens" <Matthew.Ahrens at Sun.COM>;
"roland" <devzero at web.de>;
> <zfs-discuss at opensolaris.org>
> Sent: Sunday, June 24, 2007 3:58 PM
> Subject: Re: [zfs-discuss] zfs space efficiency
> 
> 
> > On Sun, Jun 24, 2007 at 03:39:40PM -0700, Erik Trimble wrote:
> >> Matthew Ahrens wrote:
> >> >Will Murnane wrote:
> >> >>On 6/23/07, Erik Trimble <Erik.Trimble at sun.com>
wrote:
> >> >>>Now, wouldn''t it be nice to have syscalls
which would implement "cp"
> >> >>>and
> >> >>>"mv", thus abstracting it away from the
userland app?
> >
> >> >A "copyfile" primitive would be great!  It would
solve the problem of
> >> >having all those "friends" to deal with -- stat(),
extended
> >> >attributes, UFS ACLs, NFSv4 ACLs, CIFS attributes, etc.  That
isn''t to
> >> >say that it would have to be implemented in the kernel; it
could
> >> >easily be a library function.
> >> >
> >> I''m with Matt.  Having a "copyfile" library/sys
call would be of
> >> significant advantage.  In this case, we can''t currently
take advantage
> >> of the CoW ability of ZFS when doing ''cp A B'' 
(as has been pointed out
> >> to me).  ''cp'' simply opens file A with read(),
opens a new file B with
> >> write(), and then shuffles the data between the two.  Now, if we
had a
> >> copyfile(A,B) primitive, then the ''cp'' binary
would simply call this
> >> function, and, depending on the underlying FS, it would get
implemented
> >> differently.  In UFS, it would work as it does now. For ZFS, it
would
> >> work like a snapshot, where file A and B share data blocks (at
least
> >> until someone starts to update either A or B).
> >
> > Isn''t this technique an instance of `deduplication'',
which seems to be
> > a hot idea in storage these days?  I wonder if it could be done
> > automatically, behind the scenes, in some fashion.
> >
> > -- 
> > -Gary Mills-    -Unix Support-    -U of M Academic Computing and 
> > Networking-
> > _______________________________________________
> > zfs-discuss mailing list
> > zfs-discuss at opensolaris.org
> > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> > 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Casper.Dik at sun.com

2007-Jun-25 18:56 UTC

head link

[zfs-discuss] zfs space efficiency

>I wouldn''t de-duplicate without actually verifying that two blocks
were
>actually bitwise identical.  
Absolutely not, indeed.

But the nice property of hashes is that if the hashes don''t match then
the inputs do not either.

I.e., the likelyhood of having to do a full bitwise compare is vanishingly
small; the likelyhood of it returning "equal" is high.

Casper

Bill Sommerfeld

2007-Jun-25 19:15 UTC

head link

[zfs-discuss] zfs space efficiency

[This is version 2.  the first one escaped early by mistake]

On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote:> The most common non-proprietary hash calc for file-level deduplication
seems
> to be the combination of the SHA1 and MD5 together.  Collisions have been 
> shown to exist in MD5 and theoried to exist in SHA1 by extrapolation, but 
> the probibility of collitions occuring simultaneously both is to
"small" as
> the capacity of ZFS is to "large" :)
No.  Collisions in *any* hash function with output smaller than input
are known to exist through information theory (you can''t put kilobytes
of information into a 128 or 160 bit bucket) The tricky part lies in
finding collisions faster than a brute force search would find them.

Last I checked, the cryptographers specializing in hash functions were
pessimistic; the breakthroughs in collision-finding by Wang & crew a
couple years ago had revealed how little the experts actually knew about
building collision-resistant hash functions; the advice to those of us
who have come to rely on that hash function property was to migrate now
to sha256/sha512 (notably, ZFS uses sha256, not sha1), and then migrate
again once the cryptographers felt they had a better grip on the
problem; the fear was that the newly discovered attacks would generalize
to sha256.

But there''s another way -- design the system so correct behavior
doesn''t
rely on collisions being impossible to find.

I wouldn''t de-duplicate without actually verifying that two blocks or
files were actually bitwise identical; if you do this, the
collision-resistance of the hash function becomes far less important to
correctness.  

					- Bill

Erik Trimble

2007-Jun-25 20:02 UTC

head link

[zfs-discuss] zfs space efficiency

Bill Sommerfeld wrote:> [This is version 2.  the first one escaped early by mistake]
>
> On Sun, 2007-06-24 at 16:58 -0700, dave johnson wrote:
>   
>> The most common non-proprietary hash calc for file-level deduplication
seems
>> to be the combination of the SHA1 and MD5 together.  Collisions have
been
>> shown to exist in MD5 and theoried to exist in SHA1 by extrapolation,
but
>> the probibility of collitions occuring simultaneously both is to
"small" as
>> the capacity of ZFS is to "large" :)
>>     
>
> No.  Collisions in *any* hash function with output smaller than input
> are known to exist through information theory (you can''t put
kilobytes
> of information into a 128 or 160 bit bucket) The tricky part lies in
> finding collisions faster than a brute force search would find them.
>
> Last I checked, the cryptographers specializing in hash functions were
> pessimistic; the breakthroughs in collision-finding by Wang & crew a
> couple years ago had revealed how little the experts actually knew about
> building collision-resistant hash functions; the advice to those of us
> who have come to rely on that hash function property was to migrate now
> to sha256/sha512 (notably, ZFS uses sha256, not sha1), and then migrate
> again once the cryptographers felt they had a better grip on the
> problem; the fear was that the newly discovered attacks would generalize
> to sha256.
>
> But there''s another way -- design the system so correct behavior
doesn''t
> rely on collisions being impossible to find.
>
> I wouldn''t de-duplicate without actually verifying that two blocks
or
> files were actually bitwise identical; if you do this, the
> collision-resistance of the hash function becomes far less important to
> correctness.  
>
> 					- Bill
>   I''m assuming the de-duplication scheme would be run in a similar manner
as scrub currently is under ZFS.  That is, infrequently, batched, and 
interruptible. :-)

Long before we look at deduplication, I''d vote for being able to 
optimize the low-hanging fruit of the instance we KNOW that two files 
are identical (i.e. on copying).

Oh, and last I looked, there was no consensus that there wouldn''t be 
considerable overlap between collision-causing files and MD5 checksums 
and SHA1 checksums.  That is, there is no confidence that those datasets 
which cause collision under MD5 will not cause collision under SHA.  
They might, they might not, but it''s kinda like the P->NP problem
right
now (as to determining the scope of the overlap).  So don''t make any 
assumptions about the validity of using two different checksum 
algorithms.  I think (as Casper said), that should you need to, use SHA 
to weed out the cases where the checksums are different (since, that 
definitively indicates they are different), then do a bitwise compare on 
any that produce the same checksum, to see if they really are the same file.

-- 
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

Frank Cusack

2007-Jun-25 20:34 UTC

head link

[zfs-discuss] zfs space efficiency

On June 25, 2007 1:02:38 PM -0700 Erik Trimble <Erik.Trimble at Sun.COM>
wrote:> algorithms.  I think (as Casper said), that should you need to, use SHA
> to weed out the cases where the checksums are different (since, that
> definitively indicates they are different), then do a bitwise compare on
> any that produce the same checksum, to see if they really are the same
> file.
Additionally, when you start the comparison, you are likely to find out
*right away* (within the first few bytes of the first block) that the
files are different.

-frank

Mattias Pantzare

2007-Jun-30 11:57 UTC

head link

[zfs-discuss] zfs space efficiency

2007/6/25, Casper.Dik at sun.com <Casper.Dik at
sun.com>:>
> >I wouldn''t de-duplicate without actually verifying that two
blocks were
> >actually bitwise identical.
>
> Absolutely not, indeed.
>
> But the nice property of hashes is that if the hashes don''t match
then
> the inputs do not either.
>
> I.e., the likelyhood of having to do a full bitwise compare is vanishingly
> small; the likelyhood of it returning "equal" is high.
For this application (deduplication data) the likelihood of matching
hashes are very high. In fact it has to be, otherwise there would not
be any data to deduplicate.

In the cp example, all writes would have matching hashes and all need a verify.

J. David Beutel

2007-Jul-02 19:25 UTC

head link

[zfs-discuss] zfs space efficiency

Mattias Pantzare wrote:> For this application (deduplication data) the likelihood of matching
> hashes are very high. In fact it has to be, otherwise there would not
> be any data to deduplicate.
>
> In the cp example, all writes would have matching hashes and all need 
> a verify. 
Would the read for verifying a matching hash take much longer than 
writing duplicate data?  Wouldn''t the significant overhead be in 
managing hashes and searching for matches, not in verifying matches?  
However, this could be optimized for the cp example by keeping a cache 
of the hashes of data that was recently read, or even caching the data 
itself so that verification requires no duplicate read.  A disk-wide 
search for independent duplicate data would be a different process, though.

Cheers,
11011011

Wade.Stuart at fallon.com

2007-Jul-02 19:59 UTC

head link

[zfs-discuss] zfs space efficiency

zfs-discuss-bounces at opensolaris.org wrote on 07/02/2007 02:25:50 PM:
> Mattias Pantzare wrote:
> > For this application (deduplication data) the likelihood of matching
> > hashes are very high. In fact it has to be, otherwise there would not
> > be any data to deduplicate.
> >
> > In the cp example, all writes would have matching hashes and all need
> > a verify.
>
> Would the read for verifying a matching hash take much longer than
> writing duplicate data?  Wouldn''t the significant overhead be in
> managing hashes and searching for matches, not in verifying matches?
> However, this could be optimized for the cp example by keeping a cache
> of the hashes of data that was recently read, or even caching the data
> itself so that verification requires no duplicate read.  A disk-wide
> search for independent duplicate data would be a different process,though.

David,

      Yes -- if the hash data was attached to the blocks directly and no
where else.  You would in effect  have to run the full block tree to find
the hash.  I think a logical approach would be to extend zfs''s tree
structure with a lookup tree for hash values <-> blocks.  You would of
course still need to add fields to the blocks such as:

* hash(s) -- you may need to rebuild the hash table from scratch, no?
* hash_collision_version -- if you do get a hash collision on unique data,
make a sub select to specify the right block for the hash (think perl hash
collision subs).
* Ref count -- how many stubs are pointing to this block as the backing
store?  if zero safe to delete.
* maybe others to optimize frequently run operations.

Also you would need to add some structures to the stubs to allow block
pointing etc.

-=dave

2007-Jul-07 22:28 UTC

head link

[zfs-discuss] zfs space efficiency

agreed.

while a bitwise check is the only assured way to determine duplicative nature of
two blocks, if the check were done in a streaming method as you suggest,
performance, while a huge impact compared to not, would be more than bearable if
used within an environment with large known levels of duplicative data, such as
a large central backup zfs send target.   the checksum metadata is sent first,
then the data, while the receiving system is checking it''s db for
possible dupe and if found, reads the data from local disks and compares to data
as it is coming from sender.  If it gets to the end and hasn''t found a
difference, updates the pointer for the block to point to the duplicate.  This
won''t save any bandwidth during the backup, but will save on-disk space
and given the application, could be very advantagous.

thank you for the insightful discussion on this.   within the electronic
discovery and records and information management space data deduplication and
policy-based aging are the foremost topics of the day but this is at the file
level while block-level deduplication would lend no benefit to that regardless.

-=dave
 
 
This message posted from opensolaris.org

-=dave

2007-Jul-07 23:00 UTC

head link

[zfs-discuss] zfs space efficiency

one other thing... the checksums for all files to send *could* be checked first
in batch and known unique blocks prioritized and sent first, then the possibly
duplicative data sent afterwards to be verified a dupe, thereby decreasing the
possible data loss for the backup window to levels equivolently low to the
checksum collision probability.

-=dave
 
 
This message posted from opensolaris.org

Dave Johnson

2007-Jul-09 04:39 UTC

head link

[zfs-discuss] zfs space efficiency

Richard.Elling at Sun.COM wrote:>> -=dave wrote:> > one other
thing... the checksums for all files to send *could* be checked first in batch
and known unique blocks prioritized and sent first, then the possibly
duplicative data sent afterwards to be verified a dupe, thereby decreasing the
possible data loss for the backup window to levels equivolently low to the
checksum collision probability.>> ZFS doesn''t checksum files. It
checksums blocks. There has> been occasional discussion on the problems with
checksumming> at the file level, if you check the archives.That was a mistake
due to my haste and was meant to be "...blocks to send..." as used
correctly later in the same sentence.> I''m not very familiar with
what vendors are claiming for de-> duplication. Are most implementations at
the file level?Yes, most implementations I''ve evaluated are at the file
level although some products typically use block-level, most notably VTL backup
appliance vendors. I would suspect block-level would yield a marginal
percentage increase in dedupe vs file level due solely to the high occurrence
rate of the 0x00-filled block within many files as I don''t think ZFS
treats zero-filled blocks as automatically sparse? as for their sales pitch
"deduplication" claims, they are quite outlandish on the order of
15:1. Of course based on the best-case scenario such as all files from 2000
machines backed up to a single appliance nightly but again, such a scenario
*could* yield such huge space savings. i know that sun axed it''s OS
VTL project but someone else may pickup the torch. There is an interesting
fledgling VTL offering for Linux at http://markh794.googlepages.com but then i
digress...
I looked through the archives and found a similar discussion to this thread
http://www.opensolaris.org/jive/thread.jspa?messageID=84033 with good (and even
some duplicative ;) implementation logistics discussion. A question of Torrey
McMahon that went unanswered in the thread was:

Is Honeycomb doing anything in this space?
-=dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20070708/4800eb67/attachment.html>

zfs discuss - Jun 2007 - zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] Re: zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] Re: zfs space efficiency

[zfs-discuss] Re: zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency

[zfs-discuss] zfs space efficiency