Hi, The following lines in compat.c are rather imprudent: if (read_batch || write_batch) checksum_seed = 32761; else checksum_seed = time(NULL); write_int(f_out,checksum_seed); Setting checksum_seed to a constant in batch mode means block collisions are reproducible and predictable. Thus, some files will be permanently "unlucky" in batch mode and will always need retransmission. Also, it means a malicious adversary can perform a performance degradation attack by injecting a pair of blocks with the same checksum1 and checksum2 (so that, for example, if you use rsync to backup your website database, someone could slow down your backups by posting carefully crafted comments). The latter issue also exists in non-batch mode, since time() is often predictable. The right thing to do is to always use a really random value (e.g., use /dev/urandom if available). In batch mode, you can store the random value somewhere in the batch fileset. Note that, above, block hash collisions are very easy to find if you know checksum_seed. The rolling Fletcher checksum1 is trivially defeated. To defeat the k-bit truncated MD4 checksum2, just keep generate random blocks having the same checksum1 until you find two with the same checksum2; by the birthday paradox it will take about 2^(k/2) attempts, where usually k=16 or k=24 with J.W. Schultz's code. Another note is that Donovan Baarda's formula for the probability of retransmission (and thus J.W. Schultz's code) assumes that the hashes are random. This is a reasonable assumption for the truncated MD4 checksum2 when checksum_seed is random, but it's a pretty rotten assumption for the Fletcher checksum1. For the purpose of evaluating the probability of retransmission, rsync uses s2length bytes of good hash plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really apply. Eran
On Wed, 2004-03-10 at 16:43, Eran Tromer wrote: [...]> Note that, above, block hash collisions are very easy to find if you > know checksum_seed. The rolling Fletcher checksum1 is trivially > defeated. To defeat the k-bit truncated MD4 checksum2, just keep > generate random blocks having the same checksum1 until you find two with > the same checksum2; by the birthday paradox it will take about 2^(k/2) > attempts, where usually k=16 or k=24 with J.W. Schultz's code.I haven't really thought much about carefully crafted data to bust things. I didn't think the Fletcher checksum was that bad. Although I can see that it would be fairly trivial to craft data for collisions, wouldn't it be harder to craft data for a Fletcher collision _and_ attempt to produce a checksum2 collision? The 2^(k/2) attempts assumes random attempts... how does this compare to "crafted to bust fletcher" attempts?> Another note is that Donovan Baarda's formula for the probability of > retransmission (and thus J.W. Schultz's code) assumes that the hashes > are random. This is a reasonable assumption for the truncated MD4 > checksum2 when checksum_seed is random, but it's a pretty rottenDoes checksum_seed really need to be random for the checksum2 to be random? I know md4 is considered vulnerable compared to md5, but does it have a poor distribution for a fixed seed? If it does, would it make sense to switch to md5 rather than randomise the seed?> assumption for the Fletcher checksum1. For the purpose of evaluating the > probability of retransmission, rsync uses s2length bytes of good hash > plus 4 bytes of wishful thinking, and Baarda's analysis doesn't really > apply.You can still use the same formula, just don't count checksum1 in the "checksum size" part of it. Or depending on how much you think it's worth you could count it as x<32 bits worth of checksum, not the full 32 bits. -- Donovan Baarda <abo@minkirri.apana.org.au> http://minkirri.apana.org.au/~abo/