On 2/28/23 12:39, Richard W.M. Jones wrote:> On Tue, Feb 28, 2023 at 12:24:04PM +0100, Laszlo Ersek wrote:
>> On 2/27/23 17:44, Richard W.M. Jones wrote:
>>> On Mon, Feb 27, 2023 at 08:42:23AM -0600, Eric Blake wrote:
>>>> Or intentionally choose a hash that can be computed
out-of-order,
>>>> such as a Merkle Tree. But we'd need a standard setup for
all
>>>> parties to agree on how the hash is to be computed and checked,
if
>>>> it is going to be anything more than just a linear hash of the
>>>> entire guest-visible contents.
>>>
>>> Unfortunately I suspect that by far the easiest way for people who
>>> host images to compute checksums is to run 'shaXXXsum' on
them or
>>> sign them with a GPG signature, rather than engaging in a novel
hash
>>> function. Indeed that's what is happening now:
>>>
>>> https://alt.fedoraproject.org/en/verify.html
>>
>> If the output is produced with unordered writes, but the complete
>> output needs to be verified with a hash *chain*, that still allows
>> for some level of asynchrony. The start of the hashing need not be
>> delayed until after the end of output, only after the start of
>> output.
>>
>> For example, nbdcopy could maintain the highest offset up to which
>> the output is contiguous, and on a separate thread, it could be
>> hashing the output up to that offset.
>>
>> Considering a gigantic output, as yet unassembled blocks could likely
>> not be buffered in memory (that's why the writes are unordered in
the
>> first place!), so the hashing thread would have to re-read the output
>> via NBD. Whether that would cause performance to improve or to
>> deteriorate is undecided IMO. If the far end of the output network
>> block device can accommodate a reader that is independent of the
>> writers, then this level of overlap is beneficial. Otherwise, this
>> extra reader thread would just add more thrashing, and we'd be
better
>> off with a separate read-through once writing is complete.
>
> In my mind I'm wondering if there's any mathematical result that
lets
> you combine each hash(block_i) into the final hash(block[1..N])
> without needing to compute the hash of each block in order.
I've now checked:
https://en.wikipedia.org/wiki/SHA-2
https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction
https://en.wikipedia.org/wiki/One-way_compression_function#Construction_from_block_ciphers
https://en.wikipedia.org/wiki/One-way_compression_function#Davies%E2%80%93Meyer
Consider the following order of steps:
- precompute hash(block[n]), with some initial IV
- throw away block[n]
- wait until block[n-1] is processed, providing the actual IV for
hashing block[n]
- mix the new IV into hash(block[n]) without having access to block[n]
If such a method existed, it would break the security (i.e., the
original design) of the hash, IMO, as it would separate the IV from
block[n]. In a way, it would make the "mix" and "concat"
operators (of
the underlying block cipher's chaining method) distributive. I believe
then you could generate a bunch of *valid* hash(block[n]) values as a
mere function of the IV, without having access to block[n]. You could
perhaps use that for probing against other hash(block[m]) values, and
maybe determine repeating patterns in the plaintext. I'm not a
cryptographer so I can't exactly show what security property is broken
by separating the IV from block[n].
> (This is what blkhash solves, but unfortunately the output isn't
> compatible with standard hashes.)
Assuming blkhash is a Merkle Tree implementation, blkhash solves a
different problem IMO. In your above notation, hash(block[1..N]) is a
hash of the concatenated plaintext blocks, and that's not what a Merkle
Tree describes. The "mix" and "concat" operators remain
non-distributive; it's the operator trees that differ. With a Merkle
Tree, there are sub-trees that can be evaluated independently of each
other. With SHA256, we have a fully imbalanced operator tree, one where
the tree depth is maximal.
Laszlo