thr3ads.net - zfs discuss - [zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc. [May 2011]

If this information is useful, please help other people find it:
Share via:

Eugen Leitl

2011-May-21 18:56 UTC

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

----- Forwarded message from Zooko O''Whielacronx <zooko at
zooko.com> -----

From: Zooko O''Whielacronx <zooko at zooko.com>
Date: Sat, 21 May 2011 12:50:19 -0600
To: Crypto discussion list <cryptography at randombit.net>
Subject: Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
Reply-To: Crypto discussion list <cryptography at randombit.net>

Dear Nico Williams:

Thanks for the reference! Very cool.

What I would most want is for ZFS (and every other filesystem) to
maintain a Merkle Tree over the file data with a good secure hash.
Whenever a change to a file is made, the filesystem can update the
Merkle Tree this with mere O(log(N)) work in the size of the file plus
O(N) work in the size of the change. For a modern filesystem like ZFS
which is already maintaining a checksum tree the *added* cost of
maintaining the secure hash Merkle Tree could be minimal.

Then, the filesystem should make this Merkle Tree available to
applications through a simple query.

This would enable applications?without needing any further
in-filesystem code?to perform a Merkle Tree sync, which would range
from "noticeably more efficient" to "dramatically more
efficient" than
rsync or zfs send. :-)

Of course it is only more efficient because we''re treating the
maintenance of the secure-hash Merkle Tree as free. There are two
senses in which this is legitimate and it is almost free:

1. Since the values get maintained persistently over the file''s
lifetime then the total computation required is approximately O(N)
where N is the total size of all deltas that have been applied to the
file in its life. (Let''s just drop the logarithmic part for now,
because see 2. below.)

Compare this to the cost of doing a fast, insecure CRC over the whole
file such as in rsync. The cost of that is O(N) * K where N is the
(then current) size of the file and K is the number of times you run
rsync on that file.

The extreme case is if the file hasn''t changed. Then for the
application-level code to confirm that the file on this machine is the
same as the file on that machine, it merely has to ask the filesystem
for the root hash on each machine and transmit that root hash over the
network. This is optimally fast compared to rsync, and unlike "zfs
send|recv" it is optimally fast whenever the two files are identical
even if they have both changed since the last time they were synced.

2. Since the modern, sophisticated filesystem like ZFS is maintaining
a tree of checksums over the data *anyway* you can piggy-back this
computation onto that work, avoiding any extra seeks and minimizing
extra memory access.

In fact, ZFS itself can actually use SHA-256 for the checksum tree,
which would make it almost provide exactly what I want, except for:

2. a. From what I''ve read, nobody uses the SHA-256 configuration in
ZFS because it is too computationally expensive, so they use an
insecure checksum (fletcher2/4) instead.

2. b. I assume the shape of the resulting checksum tree is modified by
artifacts of the ZFS layout instead of being a simple canonical shape.
This is a show-stopper for this use case because if the same file data
exists on a different system, and some software on that system
computes a Merkle Tree over the data, it might come out with different
hashes than the ZFS checksum tree, thus eliminating all of the
performance benefits of this approach.

But, if ZFS could be modified to fix these problems or if a new
filesystem would add a feature of maintaining a canonical,
reproducible Merkle Tree, then it might be extremely useful.

Thanks to Brian Warner and Dan Shoutis for discussions about this idea.

Regards,

Zooko
_______________________________________________
cryptography mailing list
cryptography at randombit.net
http://lists.randombit.net/mailman/listinfo/cryptography

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

Eugen Leitl

2011-May-22 08:22 UTC

head link

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

----- Forwarded message from Nico Williams <nico at cryptonector.com>
-----

From: Nico Williams <nico at cryptonector.com>
Date: Sat, 21 May 2011 16:22:27 -0500
To: Crypto discussion list <cryptography at randombit.net>
Subject: Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
Reply-To: Crypto discussion list <cryptography at randombit.net>

On Sat, May 21, 2011 at 1:50 PM, Zooko O''Whielacronx <zooko at
zooko.com> wrote:> What I would most want is for ZFS (and every other filesystem) to
> maintain a Merkle Tree over the file data with a good secure hash.
Me too.  ZFS does do that, but unfortunately the internal Merkel hash
maintained this way also has metadata tree nodes (namely, indirect
blocks), which wouldn''t be a problem except that embedded in that
metadata (and hashed in) are block pointers, which contain some data
which is completely non-deterministic relative to the file contents.
The right thing to have done would have been to exclude the address
portion of block pointers, but that owuld have required more data
movement and/or computation (and perhaps was not thought of at the
time, but I was never in the ZFS team, so I''d not know.
> Whenever a change to a file is made, the filesystem can update the
> Merkle Tree this with mere O(log(N)) work in the size of the file plus
> O(N) work in the size of the change. For a modern filesystem like ZFS
> which is already maintaining a checksum tree the *added* cost of
> maintaining the secure hash Merkle Tree could be minimal.
Right!
> Then, the filesystem should make this Merkle Tree available to
> applications through a simple query.
Indeed.  And if non-deterministic metadata is excluded then the only
thing one would have to know in order to independently compute the
same hash would be the maximum breath of the tree and leaf block size,
and that the tree is intended to be kept with all but the rightmost
leaf node full.  This would be a wonderful feature to have.
> This would enable applications?without needing any further
> in-filesystem code?to perform a Merkle Tree sync, which would range
> from "noticeably more efficient" to "dramatically more
efficient" than
> rsync or zfs send. :-)
I haven''t thought about this enough, but my level-0 concern would be
that data moves not aligned with block size would require a rolling
hash in order to make synchronization efficient, thus the Merkle tree
wouldn''t help unless, perhaps, it were constructed from rolling hash
functions.

The other thought I had was that now that we accept that file stores
need large amounts of computational power, not jut I/O bandwidth, it''s
not a stretch for the server to also compute and store (i.e.,
pre-compute) arrays of rolling CRCs, each taken over a small data
chunk contiguous with the preceding and succeeding ones.  This would
allow a client to very quickly read the rsync signature of a remote
file and locally generate a patch to it, making the write process very
efficient for large files that get many small changes all over the
place.
> Of course it is only more efficient because we''re treating the
> maintenance of the secure-hash Merkle Tree as free. There are two
> senses in which this is legitimate and it is almost free:
And/or because we intend to use this for optimizing an otherwise slow
operation, and if we do it enough then it''d be worthwhile even if the
given FS had never before used a Merkle tree.> [...]
> 2. a. From what I''ve read, nobody uses the SHA-256 configuration
in
> ZFS because it is too computationally expensive, so they use an
> insecure checksum (fletcher2/4) instead.
You kinda have to use SHA-256 if you want dedup.
> 2. b. I assume the shape of the resulting checksum tree is modified by
> artifacts of the ZFS layout instead of being a simple canonical shape.
Yes: a) the shape of the indirect block tree (but this is
deterministic if you know the maximum block size for the filesystem
and the maximum block size for the file), and b) physical block
address information in indirect blocks (which can and should have been
excluded -- it''s good enough to compute the hash of an indirect block
over the block checksums in each block pointer).  See above.  Oh,
also, there the first few block pointers in the dnode too.
> This is a show-stopper for this use case because if the same file data
> exists on a different system, and some software on that system
> computes a Merkle Tree over the data, it might come out with different
> hashes than the ZFS checksum tree, thus eliminating all of the
> performance benefits of this approach.
Indeed, but:
> But, if ZFS could be modified to fix these problems or if a new
> filesystem would add a feature of maintaining a canonical,
> reproducible Merkle Tree, then it might be extremely useful.
I think it could be.  I don''t think you''ll get much out of ZFS
folks
nowadays though.  Maybe if you talk to ZFS folks outside of Oracle
though, they might confirm whether this is doable.

BTW, IIRC I filed an RFE at Sun to expose a Merkle hash tree checksum
to applications.  I believe I''ve seen others ask for this before too
at various times, and it may be that the RFE I filed (if I did) was in
response to a request on one of the OSOL discuss lists -- this must
have been back in 2005.  This is a really, really obvious enhancement
to want -- it should be obvious the moment you realize that Merkle
hash trees are a godsend.

Nico
--
_______________________________________________
cryptography mailing list
cryptography at randombit.net
http://lists.randombit.net/mailman/listinfo/cryptography

----- End forwarded message -----
-- 
Eugen* Leitl <a href="http://leitl.org">leitl</a>
http://leitl.org
______________________________________________________________
ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE

Richard Elling

2011-May-22 15:20 UTC

head link

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

On May 21, 2011, at 11:56 AM, Eugen Leitl <eugen at leitl.org> wrote:
> ----- Forwarded message from Zooko O''Whielacronx <zooko at
zooko.com> -----
> 
> From: Zooko O''Whielacronx <zooko at zooko.com>
> Date: Sat, 21 May 2011 12:50:19 -0600
> To: Crypto discussion list <cryptography at randombit.net>
> Subject: Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
> Reply-To: Crypto discussion list <cryptography at randombit.net>
> 
> Dear Nico Williams:
> 
> Thanks for the reference! Very cool.
> 
> What I would most want is for ZFS (and every other filesystem) to
> maintain a Merkle Tree over the file data with a good secure hash.
> Whenever a change to a file is made, the filesystem can update the
> Merkle Tree this with mere O(log(N)) work in the size of the file plus
> O(N) work in the size of the change. For a modern filesystem like ZFS
> which is already maintaining a checksum tree the *added* cost of
> maintaining the secure hash Merkle Tree could be minimal.
ZFS already tracks the blocks that have been written, and the time that
they were written. So we already know when something was writtem, though
that does not answer the question of whether the data was changed. I think
it is a pretty good bet that newly written data is different :-)
> 
> Then, the filesystem should make this Merkle Tree available to
> applications through a simple query.
Something like "zfs diff" ?
> 
> This would enable applications?without needing any further
> in-filesystem code?to perform a Merkle Tree sync, which would range
> from "noticeably more efficient" to "dramatically more
efficient" than
> rsync or zfs send. :-)
Since ZFS send already has an option to only send the changed blocks,
I disagree with your assertion that your solution will be "dramatically
more efficient" than zsf send. We already know zfs send is much more
efficient than rsync for large file systems.
 -- richard
> 
> Of course it is only more efficient because we''re treating the
> maintenance of the secure-hash Merkle Tree as free. There are two
> senses in which this is legitimate and it is almost free:
> 
> 1. Since the values get maintained persistently over the file''s
> lifetime then the total computation required is approximately O(N)
> where N is the total size of all deltas that have been applied to the
> file in its life. (Let''s just drop the logarithmic part for now,
> because see 2. below.)
> 
> Compare this to the cost of doing a fast, insecure CRC over the whole
> file such as in rsync. The cost of that is O(N) * K where N is the
> (then current) size of the file and K is the number of times you run
> rsync on that file.
> 
> The extreme case is if the file hasn''t changed. Then for the
> application-level code to confirm that the file on this machine is the
> same as the file on that machine, it merely has to ask the filesystem
> for the root hash on each machine and transmit that root hash over the
> network. This is optimally fast compared to rsync, and unlike "zfs
> send|recv" it is optimally fast whenever the two files are identical
> even if they have both changed since the last time they were synced.
> 
> 2. Since the modern, sophisticated filesystem like ZFS is maintaining
> a tree of checksums over the data *anyway* you can piggy-back this
> computation onto that work, avoiding any extra seeks and minimizing
> extra memory access.
> 
> In fact, ZFS itself can actually use SHA-256 for the checksum tree,
> which would make it almost provide exactly what I want, except for:
> 
> 2. a. From what I''ve read, nobody uses the SHA-256 configuration
in
> ZFS because it is too computationally expensive, so they use an
> insecure checksum (fletcher2/4) instead.
> 
> 2. b. I assume the shape of the resulting checksum tree is modified by
> artifacts of the ZFS layout instead of being a simple canonical shape.
> This is a show-stopper for this use case because if the same file data
> exists on a different system, and some software on that system
> computes a Merkle Tree over the data, it might come out with different
> hashes than the ZFS checksum tree, thus eliminating all of the
> performance benefits of this approach.
> 
> But, if ZFS could be modified to fix these problems or if a new
> filesystem would add a feature of maintaining a canonical,
> reproducible Merkle Tree, then it might be extremely useful.
> 
> Thanks to Brian Warner and Dan Shoutis for discussions about this idea.
> 
> Regards,
> 
> Zooko
> _______________________________________________
> cryptography mailing list
> cryptography at randombit.net
> http://lists.randombit.net/mailman/listinfo/cryptography
> 
> ----- End forwarded message -----
> -- 
> Eugen* Leitl <a href="http://leitl.org">leitl</a>
http://leitl.org
> ______________________________________________________________
> ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org
> 8B29F6BE: 099D 78BA 2FD3 B014 B08A  7779 75B0 2443 8B29 F6BE
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss at opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Nico Williams

2011-May-22 18:52 UTC

head link

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

On Sun, May 22, 2011 at 10:20 AM, Richard Elling
<richard.elling at gmail.com> wrote:> ZFS already tracks the blocks that have been written, and the time that
> they were written. So we already know when something was writtem, though
> that does not answer the question of whether the data was changed. I think
> it is a pretty good bet that newly written data is different :-)
Not really.  There''s bp rewrite (assuming that ever ships, or gets
implemented elsewhere), for example.
>> Then, the filesystem should make this Merkle Tree available to
>> applications through a simple query.
>
> Something like "zfs diff" ?
That works within a filesystem.  And zfs send/recv works when you have
one filesystem faithfully tracking another.

When you have two filesystems with similar contents, and the history
of each is useless in deciding how to do a bi-directional
synchronization, then you need a way to diff files that is not based
on intra-filesystem history.  The rsync algorithm is the best
high-performance algorithm that we have for determining differences
between files separated by a network.  My proposal (back then, and
Zooko''s now) is to leverage work that the filesystem does anyways to
implement a high-performance remote diff that is faster than rsync for
the simple reason that some of the rsync algorithm essentially gets
pre-computed.
>> This would enable applications?without needing any further
>> in-filesystem code?to perform a Merkle Tree sync, which would range
>> from "noticeably more efficient" to "dramatically more
efficient" than
>> rsync or zfs send. :-)
>
> Since ZFS send already has an option to only send the changed blocks,
> I disagree with your assertion that your solution will be
"dramatically
> more efficient" than zsf send. We already know zfs send is much more
> efficient than rsync for large file systems.
You missed Zooko''s point completely.  It might help to know that Zooko
works on a project called Tahoe Least-Authority Filesystem, which is
by nature distributed.  Once you lose the constraints of not having a
network or having uni-directional replication only, I think you''ll get
it.  Or perhaps you''ll argue that no one should ever need bi-di
replication, that if one finds oneself wanting that then one has taken
a wrong turn somewhere.

Nico
--

Nico Williams

2011-May-22 20:24 UTC

head link

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

On Sun, May 22, 2011 at 1:52 PM, Nico Williams <nico at cryptonector.com>
wrote:> [...] Or perhaps you''ll argue that no one should ever need bi-di
> replication, that if one finds oneself wanting that then one has taken
> a wrong turn somewhere.
You could also grant the premise and argue instead that nothing the
filesystem can do to speed up remote bi-di sync is worth the cost --
an argument that requires a lot more analysis.  For example, if the
filesystem were to compute and store rsync rolling CRC signatures,
well, that would require significant compute and storage resources,
and it might not speed up synchronization enough to ever be
worthwhile.  Similarly, a Merkle hash tree based on rolling hash
functions (and excluding physical block pointer details) might require
each hash output to grow linearly with block size in order to retain
the rolling hash property (I''m not sure about this; I know very little
about rolling hash functions), in which case the added complexity
would be a complete non-starter.  Whereas a Merkle hash tree built
with regular hash functions would not be able to resolve
insertions/deletions of data chunks of size that is not a whole
multiple of block size.

Nico
--

Edward Ned Harvey

2011-May-23 03:23 UTC

head link

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Eugen Leitl
> 
> This would enable applications?without needing any further
> in-filesystem code?to perform a Merkle Tree sync, which would range
> from "noticeably more efficient" to "dramatically more
efficient" than
> rsync or zfs send. :-)
Don''t compare rsync to zfs send.   Rsync must work at the filesystem
level, and examine every file on both the source and destination machines. 
Incremental zfs send is instantly and automatically aware at the block level,
which blocks in the pool have changed between the old and new snapshots.  Zfs
send is "dramatically more efficient" than rsync or anything else that
needs to walk the filesystem tree.  Because zfs send doesn''t walk the
tree.  It already knows what blocks changed.  And I think that''s what
you''re getting at, yeah?

Unless you''re wishing you could use something as efficient as
incremental zfs send, that applies to a whole class of filesystems not just
zfs...   Or if you''re wishing you could do something as efficiently as
incremental zfs send without the need for a matching older snapshot...  Or
you''re wishing you could do an operation like incremental zfs send,
which is only applied to a subdirectory, not a whole filesystem....

Richard Elling

2011-May-23 04:24 UTC

head link

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

On May 22, 2011, at 11:52 AM, Nico Williams <nico at cryptonector.com>
wrote:
> On Sun, May 22, 2011 at 10:20 AM, Richard Elling
> <richard.elling at gmail.com> wrote:
>> ZFS already tracks the blocks that have been written, and the time that
>> they were written. So we already know when something was writtem,
though
>> that does not answer the question of whether the data was changed. I
think
>> it is a pretty good bet that newly written data is different :-)
> 
> Not really.  There''s bp rewrite (assuming that ever ships, or gets
> implemented elsewhere), for example.
I don''t see a connection to BP rewrite. The BP already has the birth
txg,
so we know when the block was born. 
> 
>>> Then, the filesystem should make this Merkle Tree available to
>>> applications through a simple query.
>> 
>> Something like "zfs diff" ?
> 
> That works within a filesystem.  And zfs send/recv works when you have
> one filesystem faithfully tracking another.
agree. What was not clear in the OP was the use case.
> When you have two filesystems with similar contents, and the history
> of each is useless in deciding how to do a bi-directional
> synchronization, then you need a way to diff files that is not based
> on intra-filesystem history.
agree
>  The rsync algorithm is the best
> high-performance algorithm that we have for determining differences
> between files separated by a network.  My proposal (back then, and
> Zooko''s now) is to leverage work that the filesystem does anyways
to
> implement a high-performance remote diff that is faster than rsync for
> the simple reason that some of the rsync algorithm essentially gets
> pre-computed.
I''m not convinced this can work at the block level in ZFS because the 
block structure is not guaranteed to be the same between to file
systems with identical data written by different means. 

Also, there is the minor issues that the checksum is computed after
compression, so the checksum depends on the checksum algorithm,
that can be different for each block.

Dedup also brings interesting wrinkles.
> 
>>> This would enable applications?without needing any further
>>> in-filesystem code?to perform a Merkle Tree sync, which would range
>>> from "noticeably more efficient" to "dramatically
more efficient" than
>>> rsync or zfs send. :-)
>> 
>> Since ZFS send already has an option to only send the changed blocks,
>> I disagree with your assertion that your solution will be
"dramatically
>> more efficient" than zsf send. We already know zfs send is much
more
>> efficient than rsync for large file systems.
> 
> You missed Zooko''s point completely.  It might help to know that
Zooko
> works on a project called Tahoe Least-Authority Filesystem, which is
> by nature distributed.  Once you lose the constraints of not having a
> network or having uni-directional replication only, I think you''ll
get
> it.  Or perhaps you''ll argue that no one should ever need bi-di
> replication, that if one finds oneself wanting that then one has taken
> a wrong turn somewhere.
Bidirectional synchronization is best accomplished in the context of the
application, not the abstraction level of the file system.
 -- richard
>

Edward Ned Harvey

2011-May-23 14:09 UTC

head link

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss-
> bounces at opensolaris.org] On Behalf Of Nico Williams
> 
> When you have two filesystems with similar contents, and the history
> of each is useless in deciding how to do a bi-directional
> synchronization, then you need a way to diff files that is not based
> on intra-filesystem history.  
I think you''ll benefit by looking into the differences between the
implementations of ZFS pool integrity checksums versus what''s done with
dedup enabled.  Theoretically, what''s done in dedup, I think is
precisely what you''re looking for...

In both cases (with/without dedup), your zpool is made up of parent blocks that
reference a bunch of child blocks, and the parent blocks contain the checksums
of each of the children.  But when dedup is enabled, all those checksums are
additionally stored in an AVL Tree, so you can easily quickly find any block
that matches any given checksum.  Without dedup, the default hash is fletcher,
with the option of using something stronger such as sha256, but when dedup is
enabled, the only hash available is sha256.

I think what you''re asking for is precisely this.  If you have dedup
enabled, then you''ve already got a self-balancing tree of all the
checksums, in order to find any block from its sha256 sum.  So if (for example)
you wanted to bi-dir sync a big data file, which had some blocks changed at
random on both sides, then you would only need to compare the parent block on
each side.  You could easily see (for example) that block #233 inside the file
doesn''t match block #233 on the remote side and vice-versa.

The main question is what to do with this knowledge, how to put it to use? 
Because without intra-filesystem history, all you know is that block #233 is
different.  How do you know which side has the new copy and the old copy? 
You''ve got to have a baseline from a previous point in time...  You
need to know the local side now, the remote side now, and some baseline in the
past where the two were in-sync.  Or else you can''t do bi-directional.

And of course, there''s the question of how to handle conflicts. 
Suppose block #233 was changed on both sides.

Apparently Analagous Threads

Search for more possibly parallel threads

zfs discuss - May 2011 - [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.

Apparently Analagous Threads