Eugen Leitl
2011-May-21 18:56 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
----- Forwarded message from Zooko O''Whielacronx <zooko at zooko.com> ----- From: Zooko O''Whielacronx <zooko at zooko.com> Date: Sat, 21 May 2011 12:50:19 -0600 To: Crypto discussion list <cryptography at randombit.net> Subject: Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc. Reply-To: Crypto discussion list <cryptography at randombit.net> Dear Nico Williams: Thanks for the reference! Very cool. What I would most want is for ZFS (and every other filesystem) to maintain a Merkle Tree over the file data with a good secure hash. Whenever a change to a file is made, the filesystem can update the Merkle Tree this with mere O(log(N)) work in the size of the file plus O(N) work in the size of the change. For a modern filesystem like ZFS which is already maintaining a checksum tree the *added* cost of maintaining the secure hash Merkle Tree could be minimal. Then, the filesystem should make this Merkle Tree available to applications through a simple query. This would enable applications?without needing any further in-filesystem code?to perform a Merkle Tree sync, which would range from "noticeably more efficient" to "dramatically more efficient" than rsync or zfs send. :-) Of course it is only more efficient because we''re treating the maintenance of the secure-hash Merkle Tree as free. There are two senses in which this is legitimate and it is almost free: 1. Since the values get maintained persistently over the file''s lifetime then the total computation required is approximately O(N) where N is the total size of all deltas that have been applied to the file in its life. (Let''s just drop the logarithmic part for now, because see 2. below.) Compare this to the cost of doing a fast, insecure CRC over the whole file such as in rsync. The cost of that is O(N) * K where N is the (then current) size of the file and K is the number of times you run rsync on that file. The extreme case is if the file hasn''t changed. Then for the application-level code to confirm that the file on this machine is the same as the file on that machine, it merely has to ask the filesystem for the root hash on each machine and transmit that root hash over the network. This is optimally fast compared to rsync, and unlike "zfs send|recv" it is optimally fast whenever the two files are identical even if they have both changed since the last time they were synced. 2. Since the modern, sophisticated filesystem like ZFS is maintaining a tree of checksums over the data *anyway* you can piggy-back this computation onto that work, avoiding any extra seeks and minimizing extra memory access. In fact, ZFS itself can actually use SHA-256 for the checksum tree, which would make it almost provide exactly what I want, except for: 2. a. From what I''ve read, nobody uses the SHA-256 configuration in ZFS because it is too computationally expensive, so they use an insecure checksum (fletcher2/4) instead. 2. b. I assume the shape of the resulting checksum tree is modified by artifacts of the ZFS layout instead of being a simple canonical shape. This is a show-stopper for this use case because if the same file data exists on a different system, and some software on that system computes a Merkle Tree over the data, it might come out with different hashes than the ZFS checksum tree, thus eliminating all of the performance benefits of this approach. But, if ZFS could be modified to fix these problems or if a new filesystem would add a feature of maintaining a canonical, reproducible Merkle Tree, then it might be extremely useful. Thanks to Brian Warner and Dan Shoutis for discussions about this idea. Regards, Zooko _______________________________________________ cryptography mailing list cryptography at randombit.net http://lists.randombit.net/mailman/listinfo/cryptography ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
Eugen Leitl
2011-May-22 08:22 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
----- Forwarded message from Nico Williams <nico at cryptonector.com> ----- From: Nico Williams <nico at cryptonector.com> Date: Sat, 21 May 2011 16:22:27 -0500 To: Crypto discussion list <cryptography at randombit.net> Subject: Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc. Reply-To: Crypto discussion list <cryptography at randombit.net> On Sat, May 21, 2011 at 1:50 PM, Zooko O''Whielacronx <zooko at zooko.com> wrote:> What I would most want is for ZFS (and every other filesystem) to > maintain a Merkle Tree over the file data with a good secure hash.Me too. ZFS does do that, but unfortunately the internal Merkel hash maintained this way also has metadata tree nodes (namely, indirect blocks), which wouldn''t be a problem except that embedded in that metadata (and hashed in) are block pointers, which contain some data which is completely non-deterministic relative to the file contents. The right thing to have done would have been to exclude the address portion of block pointers, but that owuld have required more data movement and/or computation (and perhaps was not thought of at the time, but I was never in the ZFS team, so I''d not know.> Whenever a change to a file is made, the filesystem can update the > Merkle Tree this with mere O(log(N)) work in the size of the file plus > O(N) work in the size of the change. For a modern filesystem like ZFS > which is already maintaining a checksum tree the *added* cost of > maintaining the secure hash Merkle Tree could be minimal.Right!> Then, the filesystem should make this Merkle Tree available to > applications through a simple query.Indeed. And if non-deterministic metadata is excluded then the only thing one would have to know in order to independently compute the same hash would be the maximum breath of the tree and leaf block size, and that the tree is intended to be kept with all but the rightmost leaf node full. This would be a wonderful feature to have.> This would enable applications?without needing any further > in-filesystem code?to perform a Merkle Tree sync, which would range > from "noticeably more efficient" to "dramatically more efficient" than > rsync or zfs send. :-)I haven''t thought about this enough, but my level-0 concern would be that data moves not aligned with block size would require a rolling hash in order to make synchronization efficient, thus the Merkle tree wouldn''t help unless, perhaps, it were constructed from rolling hash functions. The other thought I had was that now that we accept that file stores need large amounts of computational power, not jut I/O bandwidth, it''s not a stretch for the server to also compute and store (i.e., pre-compute) arrays of rolling CRCs, each taken over a small data chunk contiguous with the preceding and succeeding ones. This would allow a client to very quickly read the rsync signature of a remote file and locally generate a patch to it, making the write process very efficient for large files that get many small changes all over the place.> Of course it is only more efficient because we''re treating the > maintenance of the secure-hash Merkle Tree as free. There are two > senses in which this is legitimate and it is almost free:And/or because we intend to use this for optimizing an otherwise slow operation, and if we do it enough then it''d be worthwhile even if the given FS had never before used a Merkle tree.> [...] > 2. a. From what I''ve read, nobody uses the SHA-256 configuration in > ZFS because it is too computationally expensive, so they use an > insecure checksum (fletcher2/4) instead.You kinda have to use SHA-256 if you want dedup.> 2. b. I assume the shape of the resulting checksum tree is modified by > artifacts of the ZFS layout instead of being a simple canonical shape.Yes: a) the shape of the indirect block tree (but this is deterministic if you know the maximum block size for the filesystem and the maximum block size for the file), and b) physical block address information in indirect blocks (which can and should have been excluded -- it''s good enough to compute the hash of an indirect block over the block checksums in each block pointer). See above. Oh, also, there the first few block pointers in the dnode too.> This is a show-stopper for this use case because if the same file data > exists on a different system, and some software on that system > computes a Merkle Tree over the data, it might come out with different > hashes than the ZFS checksum tree, thus eliminating all of the > performance benefits of this approach.Indeed, but:> But, if ZFS could be modified to fix these problems or if a new > filesystem would add a feature of maintaining a canonical, > reproducible Merkle Tree, then it might be extremely useful.I think it could be. I don''t think you''ll get much out of ZFS folks nowadays though. Maybe if you talk to ZFS folks outside of Oracle though, they might confirm whether this is doable. BTW, IIRC I filed an RFE at Sun to expose a Merkle hash tree checksum to applications. I believe I''ve seen others ask for this before too at various times, and it may be that the RFE I filed (if I did) was in response to a request on one of the OSOL discuss lists -- this must have been back in 2005. This is a really, really obvious enhancement to want -- it should be obvious the moment you realize that Merkle hash trees are a godsend. Nico -- _______________________________________________ cryptography mailing list cryptography at randombit.net http://lists.randombit.net/mailman/listinfo/cryptography ----- End forwarded message ----- -- Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org ______________________________________________________________ ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE
Richard Elling
2011-May-22 15:20 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On May 21, 2011, at 11:56 AM, Eugen Leitl <eugen at leitl.org> wrote:> ----- Forwarded message from Zooko O''Whielacronx <zooko at zooko.com> ----- > > From: Zooko O''Whielacronx <zooko at zooko.com> > Date: Sat, 21 May 2011 12:50:19 -0600 > To: Crypto discussion list <cryptography at randombit.net> > Subject: Re: [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc. > Reply-To: Crypto discussion list <cryptography at randombit.net> > > Dear Nico Williams: > > Thanks for the reference! Very cool. > > What I would most want is for ZFS (and every other filesystem) to > maintain a Merkle Tree over the file data with a good secure hash. > Whenever a change to a file is made, the filesystem can update the > Merkle Tree this with mere O(log(N)) work in the size of the file plus > O(N) work in the size of the change. For a modern filesystem like ZFS > which is already maintaining a checksum tree the *added* cost of > maintaining the secure hash Merkle Tree could be minimal.ZFS already tracks the blocks that have been written, and the time that they were written. So we already know when something was writtem, though that does not answer the question of whether the data was changed. I think it is a pretty good bet that newly written data is different :-)> > Then, the filesystem should make this Merkle Tree available to > applications through a simple query.Something like "zfs diff" ?> > This would enable applications?without needing any further > in-filesystem code?to perform a Merkle Tree sync, which would range > from "noticeably more efficient" to "dramatically more efficient" than > rsync or zfs send. :-)Since ZFS send already has an option to only send the changed blocks, I disagree with your assertion that your solution will be "dramatically more efficient" than zsf send. We already know zfs send is much more efficient than rsync for large file systems. -- richard> > Of course it is only more efficient because we''re treating the > maintenance of the secure-hash Merkle Tree as free. There are two > senses in which this is legitimate and it is almost free: > > 1. Since the values get maintained persistently over the file''s > lifetime then the total computation required is approximately O(N) > where N is the total size of all deltas that have been applied to the > file in its life. (Let''s just drop the logarithmic part for now, > because see 2. below.) > > Compare this to the cost of doing a fast, insecure CRC over the whole > file such as in rsync. The cost of that is O(N) * K where N is the > (then current) size of the file and K is the number of times you run > rsync on that file. > > The extreme case is if the file hasn''t changed. Then for the > application-level code to confirm that the file on this machine is the > same as the file on that machine, it merely has to ask the filesystem > for the root hash on each machine and transmit that root hash over the > network. This is optimally fast compared to rsync, and unlike "zfs > send|recv" it is optimally fast whenever the two files are identical > even if they have both changed since the last time they were synced. > > 2. Since the modern, sophisticated filesystem like ZFS is maintaining > a tree of checksums over the data *anyway* you can piggy-back this > computation onto that work, avoiding any extra seeks and minimizing > extra memory access. > > In fact, ZFS itself can actually use SHA-256 for the checksum tree, > which would make it almost provide exactly what I want, except for: > > 2. a. From what I''ve read, nobody uses the SHA-256 configuration in > ZFS because it is too computationally expensive, so they use an > insecure checksum (fletcher2/4) instead. > > 2. b. I assume the shape of the resulting checksum tree is modified by > artifacts of the ZFS layout instead of being a simple canonical shape. > This is a show-stopper for this use case because if the same file data > exists on a different system, and some software on that system > computes a Merkle Tree over the data, it might come out with different > hashes than the ZFS checksum tree, thus eliminating all of the > performance benefits of this approach. > > But, if ZFS could be modified to fix these problems or if a new > filesystem would add a feature of maintaining a canonical, > reproducible Merkle Tree, then it might be extremely useful. > > Thanks to Brian Warner and Dan Shoutis for discussions about this idea. > > Regards, > > Zooko > _______________________________________________ > cryptography mailing list > cryptography at randombit.net > http://lists.randombit.net/mailman/listinfo/cryptography > > ----- End forwarded message ----- > -- > Eugen* Leitl <a href="http://leitl.org">leitl</a> http://leitl.org > ______________________________________________________________ > ICBM: 48.07100, 11.36820 http://www.ativel.com http://postbiota.org > 8B29F6BE: 099D 78BA 2FD3 B014 B08A 7779 75B0 2443 8B29 F6BE > _______________________________________________ > zfs-discuss mailing list > zfs-discuss at opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
Nico Williams
2011-May-22 18:52 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On Sun, May 22, 2011 at 10:20 AM, Richard Elling <richard.elling at gmail.com> wrote:> ZFS already tracks the blocks that have been written, and the time that > they were written. So we already know when something was writtem, though > that does not answer the question of whether the data was changed. I think > it is a pretty good bet that newly written data is different :-)Not really. There''s bp rewrite (assuming that ever ships, or gets implemented elsewhere), for example.>> Then, the filesystem should make this Merkle Tree available to >> applications through a simple query. > > Something like "zfs diff" ?That works within a filesystem. And zfs send/recv works when you have one filesystem faithfully tracking another. When you have two filesystems with similar contents, and the history of each is useless in deciding how to do a bi-directional synchronization, then you need a way to diff files that is not based on intra-filesystem history. The rsync algorithm is the best high-performance algorithm that we have for determining differences between files separated by a network. My proposal (back then, and Zooko''s now) is to leverage work that the filesystem does anyways to implement a high-performance remote diff that is faster than rsync for the simple reason that some of the rsync algorithm essentially gets pre-computed.>> This would enable applications?without needing any further >> in-filesystem code?to perform a Merkle Tree sync, which would range >> from "noticeably more efficient" to "dramatically more efficient" than >> rsync or zfs send. :-) > > Since ZFS send already has an option to only send the changed blocks, > I disagree with your assertion that your solution will be "dramatically > more efficient" than zsf send. We already know zfs send is much more > efficient than rsync for large file systems.You missed Zooko''s point completely. It might help to know that Zooko works on a project called Tahoe Least-Authority Filesystem, which is by nature distributed. Once you lose the constraints of not having a network or having uni-directional replication only, I think you''ll get it. Or perhaps you''ll argue that no one should ever need bi-di replication, that if one finds oneself wanting that then one has taken a wrong turn somewhere. Nico --
Nico Williams
2011-May-22 20:24 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On Sun, May 22, 2011 at 1:52 PM, Nico Williams <nico at cryptonector.com> wrote:> [...] Or perhaps you''ll argue that no one should ever need bi-di > replication, that if one finds oneself wanting that then one has taken > a wrong turn somewhere.You could also grant the premise and argue instead that nothing the filesystem can do to speed up remote bi-di sync is worth the cost -- an argument that requires a lot more analysis. For example, if the filesystem were to compute and store rsync rolling CRC signatures, well, that would require significant compute and storage resources, and it might not speed up synchronization enough to ever be worthwhile. Similarly, a Merkle hash tree based on rolling hash functions (and excluding physical block pointer details) might require each hash output to grow linearly with block size in order to retain the rolling hash property (I''m not sure about this; I know very little about rolling hash functions), in which case the added complexity would be a complete non-starter. Whereas a Merkle hash tree built with regular hash functions would not be able to resolve insertions/deletions of data chunks of size that is not a whole multiple of block size. Nico --
Edward Ned Harvey
2011-May-23 03:23 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Eugen Leitl > > This would enable applications?without needing any further > in-filesystem code?to perform a Merkle Tree sync, which would range > from "noticeably more efficient" to "dramatically more efficient" than > rsync or zfs send. :-)Don''t compare rsync to zfs send. Rsync must work at the filesystem level, and examine every file on both the source and destination machines. Incremental zfs send is instantly and automatically aware at the block level, which blocks in the pool have changed between the old and new snapshots. Zfs send is "dramatically more efficient" than rsync or anything else that needs to walk the filesystem tree. Because zfs send doesn''t walk the tree. It already knows what blocks changed. And I think that''s what you''re getting at, yeah? Unless you''re wishing you could use something as efficient as incremental zfs send, that applies to a whole class of filesystems not just zfs... Or if you''re wishing you could do something as efficiently as incremental zfs send without the need for a matching older snapshot... Or you''re wishing you could do an operation like incremental zfs send, which is only applied to a subdirectory, not a whole filesystem....
Richard Elling
2011-May-23 04:24 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
On May 22, 2011, at 11:52 AM, Nico Williams <nico at cryptonector.com> wrote:> On Sun, May 22, 2011 at 10:20 AM, Richard Elling > <richard.elling at gmail.com> wrote: >> ZFS already tracks the blocks that have been written, and the time that >> they were written. So we already know when something was writtem, though >> that does not answer the question of whether the data was changed. I think >> it is a pretty good bet that newly written data is different :-) > > Not really. There''s bp rewrite (assuming that ever ships, or gets > implemented elsewhere), for example.I don''t see a connection to BP rewrite. The BP already has the birth txg, so we know when the block was born.> >>> Then, the filesystem should make this Merkle Tree available to >>> applications through a simple query. >> >> Something like "zfs diff" ? > > That works within a filesystem. And zfs send/recv works when you have > one filesystem faithfully tracking another.agree. What was not clear in the OP was the use case.> When you have two filesystems with similar contents, and the history > of each is useless in deciding how to do a bi-directional > synchronization, then you need a way to diff files that is not based > on intra-filesystem history.agree> The rsync algorithm is the best > high-performance algorithm that we have for determining differences > between files separated by a network. My proposal (back then, and > Zooko''s now) is to leverage work that the filesystem does anyways to > implement a high-performance remote diff that is faster than rsync for > the simple reason that some of the rsync algorithm essentially gets > pre-computed.I''m not convinced this can work at the block level in ZFS because the block structure is not guaranteed to be the same between to file systems with identical data written by different means. Also, there is the minor issues that the checksum is computed after compression, so the checksum depends on the checksum algorithm, that can be different for each block. Dedup also brings interesting wrinkles.> >>> This would enable applications?without needing any further >>> in-filesystem code?to perform a Merkle Tree sync, which would range >>> from "noticeably more efficient" to "dramatically more efficient" than >>> rsync or zfs send. :-) >> >> Since ZFS send already has an option to only send the changed blocks, >> I disagree with your assertion that your solution will be "dramatically >> more efficient" than zsf send. We already know zfs send is much more >> efficient than rsync for large file systems. > > You missed Zooko''s point completely. It might help to know that Zooko > works on a project called Tahoe Least-Authority Filesystem, which is > by nature distributed. Once you lose the constraints of not having a > network or having uni-directional replication only, I think you''ll get > it. Or perhaps you''ll argue that no one should ever need bi-di > replication, that if one finds oneself wanting that then one has taken > a wrong turn somewhere.Bidirectional synchronization is best accomplished in the context of the application, not the abstraction level of the file system. -- richard>
Edward Ned Harvey
2011-May-23 14:09 UTC
[zfs-discuss] [cryptography] rolling hashes, EDC/ECC vs MAC/MIC, etc.
> From: zfs-discuss-bounces at opensolaris.org [mailto:zfs-discuss- > bounces at opensolaris.org] On Behalf Of Nico Williams > > When you have two filesystems with similar contents, and the history > of each is useless in deciding how to do a bi-directional > synchronization, then you need a way to diff files that is not based > on intra-filesystem history.I think you''ll benefit by looking into the differences between the implementations of ZFS pool integrity checksums versus what''s done with dedup enabled. Theoretically, what''s done in dedup, I think is precisely what you''re looking for... In both cases (with/without dedup), your zpool is made up of parent blocks that reference a bunch of child blocks, and the parent blocks contain the checksums of each of the children. But when dedup is enabled, all those checksums are additionally stored in an AVL Tree, so you can easily quickly find any block that matches any given checksum. Without dedup, the default hash is fletcher, with the option of using something stronger such as sha256, but when dedup is enabled, the only hash available is sha256. I think what you''re asking for is precisely this. If you have dedup enabled, then you''ve already got a self-balancing tree of all the checksums, in order to find any block from its sha256 sum. So if (for example) you wanted to bi-dir sync a big data file, which had some blocks changed at random on both sides, then you would only need to compare the parent block on each side. You could easily see (for example) that block #233 inside the file doesn''t match block #233 on the remote side and vice-versa. The main question is what to do with this knowledge, how to put it to use? Because without intra-filesystem history, all you know is that block #233 is different. How do you know which side has the new copy and the old copy? You''ve got to have a baseline from a previous point in time... You need to know the local side now, the remote side now, and some baseline in the past where the two were in-sync. Or else you can''t do bi-directional. And of course, there''s the question of how to handle conflicts. Suppose block #233 was changed on both sides.