Doug Robinson
2014-Mar-11 22:11 UTC
Caching {filePath,mtime64,checksum} values to speed up execution-time
Folks:
When using rsync to copy huge amounts of data I've found that a significant
amount of time is spent computing the checksums. Sometimes hours, ...
sometimes days - it depends on the total amount of data checked! And after
that sometimes it's only a few files that need to be updated.
I've pulled the latest git (rsync-3.1.1pre1) and didn't see anything to
address this (or I missed it?).
I was wondering what folks thought of a proposal to enhance rsync to be
able to create and maintain a cache of {filePath, 64-bit mtime, checksum}
beforehand on both source and target systems and then use that cache later
on when asked to sync the two systems together? Then cache entry
validation would be a quick stat64() to make sure that the 64-bit mtime
didn't change before sending the checksum over the wire for comparison.
Clearly the cache would need to be completely invalidated (or re-created)
if the file system became corrupt. That could be handled via an "rm
-rf"
of the cache.
Thoughts?
Thank you.
Doug
--
WANdisco // *Non-Stop Data*
t. 925-396-1125
e. doug.robinson at wandisco.com
--
Join us in New York and San Francisco for Subversion & Git Live
2014<http://www.wandisco.com/subversion-git-live-2014>
Listed on the London Stock Exchange:
WAND<http://www.bloomberg.com/quote/WAND:LN>
THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE
PRIVILEGED. If this message was misdirected, WANdisco, Inc. and its
subsidiaries, ("WANdisco") does not waive any confidentiality or
privilege.
If you are not the intended recipient, please notify us immediately and
destroy the message without disclosing its contents to anyone. Any
distribution, use or copying of this e-mail or the information it contains
by other than an intended recipient is unauthorized. The views and
opinions expressed in this e-mail message are the author's own and may not
reflect the views and opinions of WANdisco, unless the author is authorized
by WANdisco to express such views or opinions on its behalf. All email
sent to or from this address is subject to electronic storage and review by
WANdisco. Although WANdisco operates anti-virus programs, it does not
accept responsibility for any damage whatsoever caused by viruses being
passed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20140311/7e8def10/attachment.html>
Doug Robinson
2014-Mar-11 22:16 UTC
Caching {filePath,mtime64,checksum} values to speed up execution-time
Ah, forgot to mention that we can't preserve time-stamps and matching byte
counts are false negatives so we really need to use file checksums as
comparisons.
Thanks again.
Doug
--
WANdisco // *Non-Stop Data*
t. 925-396-1125
e. doug.robinson at wandisco.com
--
Join us in New York and San Francisco for Subversion & Git Live
2014<http://www.wandisco.com/subversion-git-live-2014>
Listed on the London Stock Exchange:
WAND<http://www.bloomberg.com/quote/WAND:LN>
THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND MAY BE
PRIVILEGED. If this message was misdirected, WANdisco, Inc. and its
subsidiaries, ("WANdisco") does not waive any confidentiality or
privilege.
If you are not the intended recipient, please notify us immediately and
destroy the message without disclosing its contents to anyone. Any
distribution, use or copying of this e-mail or the information it contains
by other than an intended recipient is unauthorized. The views and
opinions expressed in this e-mail message are the author's own and may not
reflect the views and opinions of WANdisco, unless the author is authorized
by WANdisco to express such views or opinions on its behalf. All email
sent to or from this address is subject to electronic storage and review by
WANdisco. Although WANdisco operates anti-virus programs, it does not
accept responsibility for any damage whatsoever caused by viruses being
passed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20140311/9eb7988f/attachment.html>
Kevin Korb
2014-Mar-11 22:18 UTC
Caching {filePath,mtime64,checksum} values to speed up execution-time
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 - --checksum should not be used during normal rsync operations. It is for special cases only. Rsync can still have a lot of overhead getting the timestamps via stat() but that can't really be helped. I don't really understand how file mtimes would be cached. How would rsync know what mtimes don't match the cache without checking stat()ing the files and then the job is already done so the cache wouldn't accomplish anything. On 03/11/2014 06:11 PM, Doug Robinson wrote:> Folks: > > When using rsync to copy huge amounts of data I've found that a > significant amount of time is spent computing the checksums. > Sometimes hours, ... sometimes days - it depends on the total > amount of data checked! And after that sometimes it's only a few > files that need to be updated. > > I've pulled the latest git (rsync-3.1.1pre1) and didn't see > anything to address this (or I missed it?). > > I was wondering what folks thought of a proposal to enhance rsync > to be able to create and maintain a cache of {filePath, 64-bit > mtime, checksum} beforehand on both source and target systems and > then use that cache later on when asked to sync the two systems > together? Then cache entry validation would be a quick stat64() to > make sure that the 64-bit mtime didn't change before sending the > checksum over the wire for comparison. > > Clearly the cache would need to be completely invalidated (or > re-created) if the file system became corrupt. That could be > handled via an "rm -rf" of the cache. > > Thoughts? > > Thank you. > > Doug -- WANdisco // /Non-Stop Data/ > > t. 925-396-1125 e. doug.robinson at wandisco.com > <mailto:doug.robinson at wandisco.com> > > Join us in New York and San Francisco for Subversion & Git Live > 2014 <http://www.wandisco.com/subversion-git-live-2014> > > Listed on the London Stock Exchange: WAND > <http://www.bloomberg.com/quote/WAND:LN> > > THIS MESSAGE AND ANY ATTACHMENTS ARE CONFIDENTIAL, PROPRIETARY, AND > MAY BE PRIVILEGED. If this message was misdirected, WANdisco, Inc. > and its subsidiaries, ("WANdisco") does not waive any > confidentiality or privilege. If you are not the intended > recipient, please notify us immediately and destroy the message > without disclosing its contents to anyone. Any distribution, use > or copying of this e-mail or the information it contains by other > than an intended recipient is unauthorized. The views and opinions > expressed in this e-mail message are the author's own and may not > reflect the views and opinions of WANdisco, unless the author is > authorized by WANdisco to express such views or opinions on its > behalf. All email sent to or from this address is subject to > electronic storage and review by WANdisco. Although WANdisco > operates anti-virus programs, it does not accept responsibility for > any damage whatsoever caused by viruses being passed. > > >- -- ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ Kevin Korb Phone: (407) 252-6853 Systems Administrator Internet: FutureQuest, Inc. Kevin at FutureQuest.net (work) Orlando, Florida kmk at sanitarium.net (personal) Web page: http://www.sanitarium.net/ PGP public key available on web site. ~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~'`^`'~*-,._.,-*~ -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlMfi7YACgkQVKC1jlbQAQdhqwCgsJfz2NIqyYuPVD2vO1rrL0Hd xjcAoLBwGIz+WHIySNVpmX4krMCWncwE =iFCJ -----END PGP SIGNATURE-----
Wayne Davison
2014-Mar-14 00:37 UTC
Caching {filePath,mtime64,checksum} values to speed up execution-time
On Tue, Mar 11, 2014 at 3:11 PM, Doug Robinson <doug.robinson at wandisco.com>wrote:> I was wondering what folks thought of a proposal to enhance rsync to be > able to create and maintain a cache of {filePath, 64-bit mtime, checksum} > beforehand on both source and target systems and then use that cache later > on when asked to sync the two systems together?See patches (in order of recommendation): db.diff, checksum-updating.diff, checksum-xattr.diff. I personally use db.diff in one situation at work combined with a sqlite DB on the source and destination machines. You just need to periodically weed out any old inode values (via rsyncdb --clean /dirs) if things start to bloat. In the future I'd like to see the db.diff code included by default as loadable libraries, which would allow someone to install plain rsync and only also install sqlite-using rsync and/or mysql-using modules if they want the extra functionality. There is also a plan to eventually have the db code map the inodes in the db to paths for things like rename optimizations. That said, all these patches currently do is cache checksums. The db patch's default strict checking only uses a cached inode's info if the size+mtime+ctime all match what we knew about the file when it was cached (which makes it pretty safe). If you switch to a more lax algorithm (no ctime) you need to be extra sure the files don't get updated in some way as to leave the file matching the laxer inode info (e.g. only let rsync make changes to the files and/or make sure that modify timestamps always increase so that there is no chance of accidentally matching an older inode record). If you're wondering how an mtime-using algorithm helps your use case, keep in mind that the mtimes don't need to match between hosts, just between each host's files and its db cache (and any non-matching or missing ones get (re)computed to the new checksum). I'll also point out that if you want to use sqlite, I recommend you use the very latest db.diff (from the git patches repo) since it has a change that alleviates locking contention between the multiple rsync processes in a single copy (you can't really share the db between simultaneous rsync copies due to sqlite's poor multi-process locking -- use mysql for that). The rsyncdb manpage has info on initializing the db, noting mounts, maintenance, etc. The other patches might also be useful to you, so feel free to check them out: https://git.samba.org/?p=rsync-patches.git ..wayne.. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20140313/c2d5b1fd/attachment.html>