Hi everyone! I played around with rsync sources a little and wrote a small patch that computes the checksums from parts of the files only. I'm just writing to ask if the rsync developers would have any interest in the sort of functionality described below. If you do, I'm willing to work with you to produce a cleaned up patch for git. For background: This started as a way to satisfy an itch I had with backing up my media files. I had some problems, assumedly with timestamps (causing some files to be backed up again eventhough they had not changed), and thus I initially learned about 'rsync -c' to only copy files that had actually been physically changed. However, checksumming big files (even dozens of gigabytes) takes time. Now, I observed that my files never really change only little and in only some parts. Also, undetected corruption is not an issue here: I can survive that by other means. Yet using --size-only would have been too coarse: I wanted to peek into the contents of the file a little. So, basically, I needed a quick way to recognize or fingerprint a big blob of data with high probability and check if it had been backed up already. I experimented by adding a new option that causes file_checksum() to not sweep the file linearly but with increasing intervals. As a first approach, I just doubled the index 'i' in each iteration and added another md5_update to be applied at location size-i-CSUM_CHUNK. Thus, the file is checksummed sparsely but with increasing density towards the beginning and the end of the file. This seems to work well enough for me and best of all, it's blazing fast (with enough practical confidence, for me). Further details and changes about implementation and the approach are likely to emerge. But at this point the question is whether rsync team would be interested in a fuzzy-checksum feature like this at all? I'll keep a local fork anyway but that only benefits me. Simo -- () Today is the car of the cdr of your life. /\ http://arc.pasp.de/ -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20111228/e0966e7f/attachment.html>
On Wed, Dec 28, 2011 at 4:04 AM, Simo Melenius <simo.melenius at iki.fi> wrote:> However, checksumming big files (even dozens of gigabytes) takes time. > Now, I observed that my files never really change only little and in only > some parts. Also, undetected corruption is not an issue here: I can survive > that by other means. >Check out the various checksum* and db patches in the patches distribution. They provide a way to cache the checksum for files that haven't changed. They work based on the idiom that the ctime for a file will change even if the mtime gets set back to an older value. When the ctime changes, rsync recomputes the checksum. For all other files, the cached checksum suffices. Ignore the checksum-xattr.diff patch, as that just provides a way for a server/mirror host to cache the checksums for files -- it doesn't provide a safe way to detect when a new checksum is needed. The checksum-updating.diff patch is a reasonable solution, as long as you don't mind a bunch of .rsyncsums files getting sprinkled about (it requires the checksum-reading.diff patch). Finally, the db.diff patch stores its checksums in a database. It supports MySQL and SQLite3 (though the write speed on the latter needs to be improved). The db patch doesn't currently handle expiring orphaned checksum entries from the db yet, though. As for the sparse checksumming, feel free to send me a patch -- I'll consider putting it into the patches release. ..wayne.. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20111228/f5d08af7/attachment.html>