thr3ads.net - rsync - Need for a partial checksums patch? [Dec 2011]

If this information is useful, please help other people find it:
Share via:

Simo Melenius

2011-Dec-28 12:04 UTC

Need for a partial checksums patch?

Hi everyone!

I played around with rsync sources a little and wrote a small patch that
computes the checksums from parts of the files only. I'm just writing to
ask if the rsync developers would have any interest in the sort of
functionality described below. If you do, I'm willing to work with you to
produce a cleaned up patch for git.

For background: This started as a way to satisfy an itch I had with backing
up my media files. I had some problems, assumedly with timestamps (causing
some files to be backed up again eventhough they had not changed), and thus
I initially learned about 'rsync -c' to only copy files that had
actually
been physically changed. However, checksumming big files (even dozens of
gigabytes) takes time. Now, I observed that my files never really change
only little and in only some parts. Also, undetected corruption is not an
issue here: I can survive that by other means. Yet using --size-only would
have been too coarse: I wanted to peek into the contents of the file a
little. So, basically, I needed a quick way to recognize or fingerprint a
big blob of data with high probability and check if it had been backed up
already.

I experimented by adding a new option that causes file_checksum() to not
sweep the file linearly but with increasing intervals. As a first approach,
I just doubled the index 'i' in each iteration and added another
md5_update
to be applied at location size-i-CSUM_CHUNK. Thus, the file is checksummed
sparsely but with increasing density towards the beginning and the end of
the file. This seems to work well enough for me and best of all, it's
blazing fast (with enough practical confidence, for me). Further details
and changes about implementation and the approach are likely to emerge.

But at this point the question is whether rsync team would be interested in
a fuzzy-checksum feature like this at all? I'll keep a local fork anyway
but that only benefits me.


Simo

-- 
() Today is the car of the cdr of your life.
/\ http://arc.pasp.de/
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20111228/e0966e7f/attachment.html>

Wayne Davison

2011-Dec-28 21:56 UTC

head link

Need for a partial checksums patch?

On Wed, Dec 28, 2011 at 4:04 AM, Simo Melenius <simo.melenius at iki.fi>
wrote:
> However, checksumming big files (even dozens of gigabytes) takes time.
> Now, I observed that my files never really change only little and in only
> some parts. Also, undetected corruption is not an issue here: I can survive
> that by other means.
>
Check out the various checksum* and db patches in the patches distribution.
 They provide a way to cache the checksum for files that haven't changed.
 They work based on the idiom that the ctime for a file will change even if
the mtime gets set back to an older value.  When the ctime changes, rsync
recomputes the checksum.  For all other files, the cached checksum suffices.

Ignore the checksum-xattr.diff patch, as that just provides a way for a
server/mirror host to cache the checksums for files -- it doesn't provide a
safe way to detect when a new checksum is needed.

The checksum-updating.diff patch is a reasonable solution, as long as you
don't mind a bunch of .rsyncsums files getting sprinkled about (it requires
the checksum-reading.diff patch).

Finally, the db.diff patch stores its checksums in a database.  It supports
MySQL and SQLite3 (though the write speed on the latter needs to be
improved).  The db patch doesn't currently handle expiring orphaned
checksum entries from the db yet, though.

As for the sparse checksumming, feel free to send me a patch -- I'll
consider putting it into the patches release.

..wayne..
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20111228/f5d08af7/attachment.html>

Maybe Matching Threads

Search for more maybe matching threads

rsync - Dec 2011 - Need for a partial checksums patch?

Need for a partial checksums patch?

Need for a partial checksums patch?

Maybe Matching Threads