thr3ads.net - rsync - Non-determinism [Apr 2002]

If this information is useful, please help other people find it:
Share via:

Berend Tober

2002-Apr-17 06:17 UTC

Non-determinism

Is anyone else concerned about the fact that rsync doesn't guarantee 
to produce identical file copies on the the target machine? 

Don't get me wrong in sounding critical because I think that rsync is 
a great example of how software should be written. (I often make the 
observation, as I learn more about Linux, and inevitably find myself 
comparing open source applications to Microsoft products, that the 
people that wrote unix way back when at AT&T Bell Labs REALLY knew 
what they were doing. I also have the same attitude toward the 
developer and maintainer of rsync.)

But the "Technical Report" at 
http://rsync.samba.org/tech_report/tech_report.html states that:

"If the two strong checksums match, we assume that we have found a 
block of A which matches a block of B. In fact the blocks could be 
different, but the probability of this is microscopic, and in 
practice this is a reasonable assumption."

Is that good enough? The statement, I believe, refers to some 
analytical estimate of the chance that the check-sums might match 
despite having different source files for comparison, but has anyone 
done empirical work to verify the we can pretty-much count on getting 
reliable file copies on the target?

And how does this small probablity of file corruption compare to, 
say, using a full file transfer or copy? In the latter case, you 
might be tempted to think there is zero probablity of file 
corruption, but if you think of any data transfer as sending a 
digital signal through a noisy communication channel, there must be 
some way to quantify the realiability of cp verses rsync. I'm not 
sure that I have all the skills to do this analysis, but I'd be 
interested in seeing it done.
Regards,
Berend Tober

Martin Pool

2002-Apr-17 07:13 UTC

head link

Non-determinism

You gave me a scare with that subject line :-/

I'm glad you like it.  Unix as a literature or culture is amazing.

The analysis is done to a reasonable extent in tridge's thesis.
MD4 (and MD5) is no longer considered cryptographically strong, 
but we're not contending against an intelligent adversary here,
only random chance.

You might like to look at Schneier's /Applied Cryptography 2nd ed/
for details on MD4.  It produces a 128-bit hash; I am fairly
sure that the way it's used in rsync that means there is a
2^-128 chance of an undetected failure.

Sure, it's only probabilistic.  Most aspects of computer systems
are:

 - your memory chip or processor might be hit by a sufficiently
   powerful photon to cause corruption

 - the ECC in your memory might not detect the error (this is 
   based on checksums too, and weaker ones than MD4)

 - your TCP stack might not detect a data channel error (another
   checksum)

 - all the disks in your RAID set might die simultaneously

 - a comet might strike Earth extinguishing all life

Schneier has a neat table of various probabilities in chapter 1.
A failure of MD4 by random data corruption (2^-128) is astromically
less likely than "winning the top prize in a US state lottery 
and being killed by lightning on the same day" (2^-55).  Etcetera.

Leaving aside random failures, disks will certainly grind themselves
into dust before getting anywhere near 2^128 operations.  (The universe
is about 2^61 seconds old.)

It's possible that something about the way rsync uses MD4 makes
the protection much less strong.  I suspect one would be more likely
to find such a problem by analysis not testing.

So I don't want to discourage you from checking that the probability 
is actually as low as it is claimed to be, or from finding an embarassing 
error in my maths :-), but I don't think you need worry about it 
merely because it is probabilistic.

rsync's problems mostly lie in software engineering (bugs, portability,
back-compatibility, documentation, ..), not computer science (probability,
algorithms, etc...)  Sometimes I think the other way would be more fun.

--
Martin

Berend Tober

2002-Apr-17 08:23 UTC

head link

Non-determinism

On 18 Apr 2002 at 0:12, Martin Pool wrote:
> 
> The analysis is done to a reasonable extent in tridge's thesis.
> MD4 (and MD5) is no longer considered cryptographically strong, 
> but we're not contending against an intelligent adversary here,
> only random chance.
> 
> You might like to look at Schneier's /Applied Cryptography 2nd ed/ for
> details on MD4.  It produces a 128-bit hash; I am fairly sure that the
> way it's used in rsync that means there is a 2^-128 chance of an
> undetected failure.
> 
> Sure, it's only probabilistic.  Most aspects of computer systems
> are:
> 
That was my point about comparing rsync to sending the entire file 
using say, ftp or cp. That is, one might think that sending the 
entire file via ftp or cp will produce a exact file copy, however the 
actual transmission of the data takes the form of electrical signals 
on a wire that must be detected at the receiving end. The detection 
process must have some probablilty of false alarm/missed detection 
characteristic and so there must be some estimate of the probability 
of ftp and cp failing to produce a reliable copy. So while the 
software algorithm of ftp and cp are deterministic, there must be 
some quantifiable probablity of failure non-the-less. The difference 
with rsync is that not only are the same effects of data corruption 
at work as with ftp and cp, but the algorithm itself introduces non-
determinism.

I still think rsync as in incredible tool, however, despite me 
expression of this reservation.
Regards,
Berend Tober

Michael Zimmermann

2002-Apr-17 09:15 UTC

head link

Non-determinism

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

At Mittwoch, 17. April 2002 14:52 Berend Tober wrote:> Is anyone else concerned about the fact that rsync doesn't guarantee
> to produce identical file copies on the the target machine?
Computers are not deterministic - only in theory they are.

A modern computer-system with it's gigabyte of memory and all
the peripheral chips etc. is expected to have a one-bit failure
about once a day. Were else do spurious interrupts come from?
And disk errors? And missing spots in TFT monitors? And why would
we need memorytests otherwise?

I think, the probaility to transmit the checksums wrongly over TCP/IP
is magnitudes of magnitudes higher than the probability that the 
checksum-system doesn't detect a difference.

Or what do you zjimk?

- - just for the fun of it -
- -- 
Michael Zimmermann (Vegaa Safety and Security for Internet Services)
<zim@vegaa.de>   phone +49 89 6283 7632    hotline +49 163 823 1195
Key fingerprint = 1E47 7B99 A9D3 698D 7E35  9BB5 EF6B EEDB 696D 5811
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8vZ8072vu22ltWBERAoRoAJ9ndaI2gDTujw6qRk7gyfzc+X+KyACfZTX0
EC0RvB21VcvFcKbDtw5HTm4=r0LC
-----END PGP SIGNATURE-----

David Bolen

2002-Apr-17 10:48 UTC

head link

Non-determinism

Berend Tober [btober@computer.org] writes:
> That was my point about comparing rsync to sending the entire file 
> using say, ftp or cp. That is, one might think that sending the 
> entire file via ftp or cp will produce a exact file copy, however the 
> actual transmission of the data takes the form of electrical signals 
> on a wire that must be detected at the receiving end. The detection 
> process must have some probablilty of false alarm/missed detection 
> characteristic and so there must be some estimate of the probability 
> of ftp and cp failing to produce a reliable copy. So while the 
> software algorithm of ftp and cp are deterministic, there must be 
> some quantifiable probablity of failure non-the-less. The difference 
> with rsync is that not only are the same effects of data corruption 
> at work as with ftp and cp, but the algorithm itself introduces non-
> determinism.
Except of course that rsync uses its own final checksum to balance out
its risk of incorrectly deciding a block is the same.  If the final
full-file checksum doesn't match, then rsync automatically restarts
the transfer (using a slightly different seed, I believe).

Thus, it's fairly accurate to compare rsync to performing an ftp or cp
and then doing a full checksum on the file, so one could argue it's
actually more reliable than a straight ftp/cp without the checksum.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l@fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/

tim.conway@philips.com

2002-Apr-17 16:38 UTC

head link

Non-determinism

Not in the least.  The only checksum that guarantees that two files are 
identical is one from which the entire file can be regenerated in only a 
single way, in other words, some form of compression.  If you want to send 
the whole file, that's fairly straightforward.  Rsync is a way of 
optimizing the process within certain limits.  If a 1/whatever it is with 
those sums is not good enough, don't use rsync.

Tim Conway
tim.conway@philips.com
303.682.4917
Philips Semiconductor - Longmont TC
1880 Industrial Circle, Suite D
Longmont, CO 80501
Available via SameTime Connect within Philips, n9hmg on AIM
perl -e 'print pack(nnnnnnnnnnnn, 
19061,29556,8289,28271,29800,25970,8304,25970,27680,26721,25451,25970), 
".\n" '
"There are some who call me.... Tim?"




"Berend Tober" <btober@computer.org>
Sent by: rsync-admin@lists.samba.org
04/17/2002 06:52 AM

 
        To:     rsync@lists.samba.org
        cc:     (bcc: Tim Conway/LMT/SC/PHILIPS)
        Subject:        Non-determinism
        Classification: 



Is anyone else concerned about the fact that rsync doesn't guarantee 
to produce identical file copies on the the target machine? 

Don't get me wrong in sounding critical because I think that rsync is 
a great example of how software should be written. (I often make the 
observation, as I learn more about Linux, and inevitably find myself 
comparing open source applications to Microsoft products, that the 
people that wrote unix way back when at AT&T Bell Labs REALLY knew 
what they were doing. I also have the same attitude toward the 
developer and maintainer of rsync.)

But the "Technical Report" at 
http://rsync.samba.org/tech_report/tech_report.html states that:

"If the two strong checksums match, we assume that we have found a 
block of A which matches a block of B. In fact the blocks could be 
different, but the probability of this is microscopic, and in 
practice this is a reasonable assumption."

Is that good enough? The statement, I believe, refers to some 
analytical estimate of the chance that the check-sums might match 
despite having different source files for comparison, but has anyone 
done empirical work to verify the we can pretty-much count on getting 
reliable file copies on the target?

And how does this small probablity of file corruption compare to, 
say, using a full file transfer or copy? In the latter case, you 
might be tempted to think there is zero probablity of file 
corruption, but if you think of any data transfer as sending a 
digital signal through a noisy communication channel, there must be 
some way to quantify the realiability of cp verses rsync. I'm not 
sure that I have all the skills to do this analysis, but I'd be 
interested in seeing it done.
Regards,
Berend Tober


-- 
To unsubscribe or change options: 
http://lists.samba.org/mailman/listinfo/rsync
Before posting, read: http://www.tuxedo.org/~esr/faqs/smart-questions.html

David Bolen

2002-Apr-17 20:26 UTC

head link

Non-determinism

Martin Pool [mbp@samba.org] writes:
> To put it in simple language, the probability of an file transmission
> error being undetected by MD4 message digest is believed to be
> approximately one in one thousand million million million million
> million million.  
I think that's one duodecillion :-)

As a cryptographic message-digest hash, MD4 (and MD5) is intended as
having 2^128 operations necessary to crack a specific digest (find the
original source), but probably only on the order of 2^64 operations to 
find two messages that have the same digest.  But even that isn't a 
direct translation to the probability that two random input strings 
might hash to the same value.

There's an interesting thread from sci.crypt from late last year that
had some addressing of this question:

http://groups.google.com/groups?threadm=u21i5llf2bpt03%40corp.supernews.com

which for one of the examples where the computation was followed through
(the odds of a collision when keeping all 128 bits of the hash and
running it against about 67 million files), the probability of a
collision was about 2^-77.  So I suppose you'd sort of have to figure
out what you wanted to declare your universe of files to be since more
files would increase the odds and less files decrease them.

It's about at this point that I sit back and just say, that's one tiny
probability!

It is interesting that MD4 has been a "cracked" algorithm for a while
now, so if someone was explicitly trying to forge a file that would
fool it, it's very doable.  But I doubt that changes the odds on two
random files colliding.  MD5 has not yet had any duplication found
(and plenty of protocols currently assume there aren't any), but it's
far more computationally intensive to compute, so I think MD4 is more
than sufficient for rsync.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l@fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/

Seemingly Similar Threads

Search for more seemingly similar threads

rsync - Apr 2002 - Non-determinism

Non-determinism

Non-determinism

Non-determinism

Non-determinism

Non-determinism

Non-determinism

Non-determinism

Seemingly Similar Threads