thr3ads.net - zfs discuss - [zfs-discuss] Hashing files rapidly on ZFS [May 2010]

If this information is useful, please help other people find it:
Share via:

Bertrand Augereau

2010-May-11 10:14 UTC

[zfs-discuss] Hashing files rapidly on ZFS

Hello everybody,

is there a way to compute very quickly some hash of a file in a zfs?
As I understand it, everything is signed in the filesystem, so I''m
wondering if I can avoid reading whole files with md5sum just to get a unique
hash. Seems very redundant to me :)

Bottom line is I want to make manual deduplication at the file level.

Thanks in advance,
Bertrand
-- 
This message posted from opensolaris.org

Darren J Moffat

2010-May-11 10:23 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

On 11/05/2010 11:14, Bertrand Augereau wrote:> Hello everybody,
>
> is there a way to compute very quickly some hash of a file in a zfs?
> As I understand it, everything is signed in the filesystem, so I''m
wondering if I can avoid reading whole files with md5sum just to get a unique
hash. Seems very redundant to me :)

Signing implies use of a key which ZFS does not use for its block based 
checksums.

There is no "quick" way to do this just now because ZFS checksums are 
block based not whole file based.

-- 
Darren J Moffat

Bertrand Augereau

2010-May-11 11:15 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

Is there a O(nb_blocks_for_the_file) solution, then?

I know O(nb_blocks_for_the_file) == O(nb_bytes_in_the_file), from Mr.
Landau''s POV, but I''m quite interested in a good constant
factor.
-- 
This message posted from opensolaris.org

Daniel Carosone

2010-May-17 22:19 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

On Tue, May 11, 2010 at 04:15:24AM -0700, Bertrand Augereau
wrote:> Is there a O(nb_blocks_for_the_file) solution, then?
>
> I know O(nb_blocks_for_the_file) == O(nb_bytes_in_the_file), from Mr.
Landau''s POV, but I''m quite interested in a good constant
factor.
If you were considering the hashes of each zfs block as a precomputed
value, it might be tempting to think of getting all of these and
hashing them together. You could thereby avoiding reading file data,
and the file metadata with the hashes in, you''d have needed to read
anyway. This would seem to be appealing, eliminating seeks and cpu
work.

However, there are some issues that make the approach basically
infeasible and unreliable for comparing the results of two otherwise
identical files.

First, you''re assuming there''s an easy interface to get the
stored
hashes of a block, which there isn''t. Even if we ignore that for a
moment, the hashes zfs records depend on factors other than just the
file content, including the way the file has been written over time.

The blocks of the file may not be constant size; a file that grew
slowly may have different hashes to a copy of it or one extracted
from an archive in a fast stream. Filesystem properties, including
checksum (obvious), dedup (which implies checksum), compress (which
changes written data and can make holes), blocksize and maybe others
may be different between filesystems or even change over the time a
file has been written, and again change results and defeat
comparisons.

These things can defeat zfs''s dedup too, even though it does have
access to the block level checksums.

If you''re going to do an application-level dedup, you want to utilise
the advantage of being independent of these things - or even of the
underlying filesystem at all (e.g. dedup between two NAS shares).

Something similar would be useful, and much more readily achievable,
from ZFS from such an application, and many others. Rather than a way
to compare reliably between two files for identity, I''ld liek a way to
compare identity of a single file between two points in time. If my
application can tell quickly that the file content is unaltered since
last time I saw the file, I can avoid rehashing the content and use a
stored value. If I can achieve this result for a whole directory
tree, even better.

--
Dan.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100518/fa781cd7/attachment.bin>

Bertrand Augereau

2010-May-18 20:55 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

Thanks Dan, this is exactly what I had in mind (hashing the block checksums).
You convinced me to do it independently from zfs.
-- 
This message posted from opensolaris.org

Arne Jansen

2010-Jul-06 15:29 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

Daniel Carosone wrote:> Something similar would be useful, and much more readily achievable,
> from ZFS from such an application, and many others.  Rather than a way
> to compare reliably between two files for identity, I''ld liek a
way to
> compare identity of a single file between two points in time.  If my
> application can tell quickly that the file content is unaltered since
> last time I saw the file, I can avoid rehashing the content and use a
> stored value. If I can achieve this result for a whole directory
> tree, even better.
This would be great for any kind of archiving software. Aren''t zfs
checksums
already ready to solve this? If a file changes, it''s dnodes''
checksum changes,
the checksum of the directory it is in and so forth all the way up to the
uberblock.
There may be ways a checksum changes without a real change in the files content,
but the other way round should hold. If the checksum didn''t change, the
file
didn''t change.
So the only missing link is a way to determine zfs''s checksum for a
file/directory/dataset. Am I missing something here? Of course atime update
should be turned off, otherwise the checksum will get changed by the archiving
agent.

Daniel Carosone

2010-Jul-07 10:34 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

On Tue, Jul 06, 2010 at 05:29:54PM +0200, Arne Jansen
wrote:> Daniel Carosone wrote:
> > Something similar would be useful, and much more readily achievable,
> > from ZFS from such an application, and many others.  Rather than a way
> > to compare reliably between two files for identity, I''ld liek
a way to
> > compare identity of a single file between two points in time.  If my
> > application can tell quickly that the file content is unaltered since
> > last time I saw the file, I can avoid rehashing the content and use a
> > stored value. If I can achieve this result for a whole directory
> > tree, even better.
> 
> This would be great for any kind of archiving software. Aren''t zfs
checksums
> already ready to solve this? If a file changes, it''s
dnodes'' checksum changes,
> the checksum of the directory it is in and so forth all the way up to the
> uberblock.
Not quite.  The merkle tree of file and metadata blocks gets to the root
via a path that is largely separate from the tree of directory objects
that name them.  Changing the inode(equivalent) metadata doesn''t
change the contents of any of the directories that have entries
pointing to that file.   

Put another way, we don''t have named file versions - a directory name
refers to an object and all future versions of that object.  Future
versions will be found at different disk addresses, but are the same
object id.  There''s no need to rewrite the directory object unless the
name changes or the link is removed.   We have named filesystem
versions (snapshots), that name the entire tree of data and metadata
objects, once they do finally converge.
> There may be ways a checksum changes without a real change in the files
content,
> but the other way round should hold. If the checksum didn''t
change, the file
> didn''t change.
That''s true, for individual files. (Checksum changes can happen when
checksum algorithm is changed and same data is rewritten, or if
zero-filled blocks are replaced/rewritten with holes or vice-versa,
and some other cases)
> So the only missing link is a way to determine zfs''s checksum for
a
> file/directory/dataset. 
That is missing, yes.  Perhaps in part because noone could see a valid
application-level use for this implementation-specific metadata.  The
original purpose of my post above was to illustrate a use case for it.

The throwaway line at the end, about a directory tree, was more in the
vein of wishful thinking.. :)
> Am I missing something here? Of course atime update
> should be turned off, otherwise the checksum will get changed by the
archiving
> agent.
Separate checksums exist that cover the difference between content and
metadata changes, so at least in theory an interface that exposes
enough detail could avoid this restriction.

--
Dan.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 194 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100707/d46f6f2a/attachment.bin>

Mike Gerdts

2010-Jul-07 22:18 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

On Tue, Jul 6, 2010 at 10:29 AM, Arne Jansen <sensille at gmx.net>
wrote:> Daniel Carosone wrote:
>> Something similar would be useful, and much more readily achievable,
>> from ZFS from such an application, and many others. ?Rather than a way
>> to compare reliably between two files for identity, I''ld liek
a way to
>> compare identity of a single file between two points in time. ?If my
>> application can tell quickly that the file content is unaltered since
>> last time I saw the file, I can avoid rehashing the content and use a
>> stored value. If I can achieve this result for a whole directory
>> tree, even better.
>
> This would be great for any kind of archiving software. Aren''t zfs
checksums
> already ready to solve this? If a file changes, it''s
dnodes'' checksum changes,
> the checksum of the directory it is in and so forth all the way up to the
> uberblock.
> There may be ways a checksum changes without a real change in the files
content,
> but the other way round should hold. If the checksum didn''t
change, the file
> didn''t change.
> So the only missing link is a way to determine zfs''s checksum for
a
> file/directory/dataset. Am I missing something here? Of course atime update
> should be turned off, otherwise the checksum will get changed by the
archiving
> agent.
What is the likelihood that the same data is re-written to the file?
If that is unlikely, it looks as though znode_t''s z_seq may be useful.
 While it isn''t a checksum, it seems to be incremented on every file
change.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/

Edward Ned Harvey

2010-Jul-08 22:46 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Bertrand Augereau
> 
> is there a way to compute very quickly some hash of a file in a zfs?
> As I understand it, everything is signed in the filesystem, so I''m
> wondering if I can avoid reading whole files with md5sum just to get a
> unique hash. Seems very redundant to me :)
If I understand right:

Although zfs is calculating hashes of blocks, it doesn''t correlate to
hashes
of files, for many reasons:

Block boundaries are not well aligned with file boundaries.  A single block
might encapsulate several small files, or a file might start in the middle
of a block, span several more, and end in the middle of another block.

Blocks also contain non-file information.

Hashing blocks will be even more irrelevant to file hashes, if you have
compression enabled, because I think it hashes the compressed data, not the
uncompressed data.

If you want to create file hashes out of block hashes, it''s even more
convoluted.  Because you can''t generally compute hash(A+B) based on
hash(A)
and hash(B).  Although perhaps you can for some algorithms.

My advice would be:

Computing hashes is not very expensive, as long as you''re just
computing
hashes for data that you were going to handle for other reasons anyway.
Specifically, I benchmarked several hash algorithms a while back, and found
... I forget which ... either adler32 or crc is almost zero-time to compute
... that is ... the cpu was very lightly utilized while hashing blocks at
maximum disk speed.

The weakness of adler32 and crc is that they''re not cryptographic
hashes.
If a malicious person wants to corrupt a data stream while preserving the
hash, it''s not difficult to do.  adler32 and crc are good as long as
you can
safely assume no malice.

md5 is significantly slower (but surprisingly not much slower) and it''s
a
cryptographic hash.  Probably not necessary for your needs.

And one more thing.  No matter how strong your hash is, unless your hash is
just as big as your file, collisions happen.  Don''t assume data is the
same
just because hash is the same, if you care about your data.  Always
byte-level verify every block or file whose hash matches some other hash.

Garrett D''Amore

2010-Jul-08 22:56 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

On Thu, 2010-07-08 at 18:46 -0400, Edward Ned Harvey
wrote:> > From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> > bounces at opensolaris.org] On Behalf Of Bertrand Augereau
> > 
> > is there a way to compute very quickly some hash of a file in a zfs?
> > As I understand it, everything is signed in the filesystem, so
I''m
> > wondering if I can avoid reading whole files with md5sum just to get a
> > unique hash. Seems very redundant to me :)
> 
> If I understand right:
> 
> Although zfs is calculating hashes of blocks, it doesn''t correlate
to hashes
> of files, for many reasons:
> 
> Block boundaries are not well aligned with file boundaries.  A single block
> might encapsulate several small files, or a file might start in the middle
> of a block, span several more, and end in the middle of another block.
> 
> Blocks also contain non-file information.
> 
> Hashing blocks will be even more irrelevant to file hashes, if you have
> compression enabled, because I think it hashes the compressed data, not the
> uncompressed data.
> 
> If you want to create file hashes out of block hashes, it''s even
more
> convoluted.  Because you can''t generally compute hash(A+B) based
on hash(A)
> and hash(B).  Although perhaps you can for some algorithms.
> 
> My advice would be:
> 
> Computing hashes is not very expensive, as long as you''re just
computing
> hashes for data that you were going to handle for other reasons anyway.
> Specifically, I benchmarked several hash algorithms a while back, and found
> ... I forget which ... either adler32 or crc is almost zero-time to compute
> ... that is ... the cpu was very lightly utilized while hashing blocks at
> maximum disk speed.
> 
> The weakness of adler32 and crc is that they''re not cryptographic
hashes.
> If a malicious person wants to corrupt a data stream while preserving the
> hash, it''s not difficult to do.  adler32 and crc are good as long
as you can
> safely assume no malice.
> 
> md5 is significantly slower (but surprisingly not much slower) and
it''s a
> cryptographic hash.  Probably not necessary for your needs.
> 
> And one more thing.  No matter how strong your hash is, unless your hash is
> just as big as your file, collisions happen.  Don''t assume data is
the same
> just because hash is the same, if you care about your data.  Always
> byte-level verify every block or file whose hash matches some other hash.
MD5 hashing is not recommended for "cryptographically strong" hashing
anymore.  SHA256 is the current recommendation I would make (the state
of the art changes over time.)

The caution about collisions happening is relevant, but with a suitably
strong hash, the risk is close enough to zero that normal people don''t
care.

By that, I mean that the chance of a collision within a 256 bit hash is
something like 1/2^255.  You''re probably more likely to spontaneously
combust (by an order of magnitude) than you are to have two files that
"accidentally" (or even maliciously) reduce to the same hash.

When the probability of the Sun going supernova in the next 30 seconds
exceeds the probability of a cryptographic hash collision, I don''t
worry
about the collision anymore. :-)

	- Garrett
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
>

Peter Jeremy

2010-Jul-09 00:23 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

On 2010-Jul-09 06:46:54 +0800, Edward Ned Harvey <solaris2 at
nedharvey.com> wrote:>md5 is significantly slower (but surprisingly not much slower) and
it''s a
>cryptographic hash.  Probably not necessary for your needs.
As someone else has pointed out, MD5 is no longer considered secure
(neither is SHA-1).  If you want cryptographic hashing, you should
probably use SHA-256 for now and be prepared to migrate to SHA-3 once
it is announced.  Unfortunately, SHA-256 is significantly slower than
MD5 (about 4 times on a P-4, about 3 times on a SPARC-IV) and no
cryptographic hash is amenable to multi-threading .  The new crypto
instructions on some of Intel''s recent offerings may help performance
(and it''s likely that they will help more with SHA-3).
>And one more thing.  No matter how strong your hash is, unless your hash is
>just as big as your file, collisions happen.  Don''t assume data is
the same
>just because hash is the same, if you care about your data.  Always
>byte-level verify every block or file whose hash matches some other hash.
In theory, collisions happen.  In practice, given a cryptographic hash,
if you can find two different blocks or files that produce the same
output, please publicise it widely as you have broken that hash function.

-- 
Peter Jeremy
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
URL:
<http://mail.opensolaris.org/pipermail/zfs-discuss/attachments/20100709/939a9a71/attachment.bin>

Garrett D''Amore

2010-Jul-09 03:42 UTC

head link

[zfs-discuss] Hashing files rapidly on ZFS

On Fri, 2010-07-09 at 10:23 +1000, Peter Jeremy wrote:> On 2010-Jul-09 06:46:54 +0800, Edward Ned Harvey <solaris2 at
nedharvey.com> wrote:
> >md5 is significantly slower (but surprisingly not much slower) and
it''s a
> >cryptographic hash.  Probably not necessary for your needs.
> 
> As someone else has pointed out, MD5 is no longer considered secure
> (neither is SHA-1).  If you want cryptographic hashing, you should
> probably use SHA-256 for now and be prepared to migrate to SHA-3 once
> it is announced.  Unfortunately, SHA-256 is significantly slower than
> MD5 (about 4 times on a P-4, about 3 times on a SPARC-IV) and no
> cryptographic hash is amenable to multi-threading .  The new crypto
> instructions on some of Intel''s recent offerings may help
performance
> (and it''s likely that they will help more with SHA-3).
> 
> >And one more thing.  No matter how strong your hash is, unless your
hash is
> >just as big as your file, collisions happen.  Don''t assume
data is the same
> >just because hash is the same, if you care about your data.  Always
> >byte-level verify every block or file whose hash matches some other
hash.
> 
> In theory, collisions happen.  In practice, given a cryptographic hash,
> if you can find two different blocks or files that produce the same
> output, please publicise it widely as you have broken that hash function.
Not necessarily.  While you *should* publicize it widely, given all the
possible text that we have, and all the other variants, its
theoretically possibly to get likely.  Like winning a lottery where
everyone else has a million tickets, but you only have one.

Such an occurrence -- if isolated -- would not, IMO, constitute a
''breaking'' of the hash function.

	- Garrett
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Nicolas Williams

2010-Jul-09 16:31 UTC

head link

[zfs-discuss] Hash functions (was Re: Hashing files rapidly on ZFS)

On Thu, Jul 08, 2010 at 08:42:33PM -0700, Garrett D''Amore
wrote:> On Fri, 2010-07-09 at 10:23 +1000, Peter Jeremy wrote:
> > In theory, collisions happen.  In practice, given a cryptographic
hash,
> > if you can find two different blocks or files that produce the same
> > output, please publicise it widely as you have broken that hash
function.
> 
> Not necessarily.  While you *should* publicize it widely, given all the
> possible text that we have, and all the other variants, its
> theoretically possibly to get likely.  Like winning a lottery where
> everyone else has a million tickets, but you only have one.
> 
> Such an occurrence -- if isolated -- would not, IMO, constitute a
> ''breaking'' of the hash function.
A hash function is broken when we know how to create colliding
inputs.  A random collision does not a break make, though it might,
perhaps, help figure out how to break the hash function later.

Nico
--

zfs discuss - May 2010 - Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hashing files rapidly on ZFS

[zfs-discuss] Hash functions (was Re: Hashing files rapidly on ZFS)