Evan Harris
2005-Aug-18 09:46 UTC
Question and feature requests for processor bound systems
Is there any way to disable the checksum block search in rsync, or to somehow optimize it for systems that are processor-bound in addition to being network bound? I'm using rsync on very low power embedded systems to rsync files that are sometimes comparatively large (sometimes a few hundred megs in size or larger), and am finding that just the operation of the checksumming one such file on the sender is taking tens of minutes. The systems in question have processors on the order of a pentium 166, and the tests that I did the other day syncing a single ~500meg file was between 15 and 20 minutes just for the checksum calculation. When these systems are potentially battery powered, the cost of keeping the system up for long periods at full processor utilization is very expensive in power terms. I couldn't find any such option, and I was trying to come up with a way to reduce that cpu-bound problem without completely abandoning rsync. So here are some proposed solutions that I put in as feature requests to help avoid this issue. Option 1: Add an option, maybe --optimize-append, that would optimize the checksum search by telling it that it can assume that files are probably just appended to, like logfiles. This would make rsync not do checksums on the files at all except for very rudimentary checking. I would think a good algoritm might be to checksum only the first and last block of an existing file, and if those two blocks are the same, assume all intervening data is also the same and just transfer the remaining data. This is basically a hint that the file is only being appended to. Then if either of those blocks don't match, fall back to the full checksum algorithm. Option 2: Add an option, maybe --checksum-block-skip=N, that would tell rsync that when checksumming the file, to only checksum every Nth block. This would still allow allow keeping most of the advantages of the rsync, but would allow cpu-bound systems to speed up the checksumming process at the expense of possibly not detecting file differences if the differences fall in between blocks that are checksummed. This would basically be a hint that the only changes the file should contain would be insertions or deletions of data within the file, but no updates of blocks in-place. This would also help on systems that are disk-bound in addition to being network and cpu-bound in that it doesn't have to read every block of the file to send checksums. Option 3: Add an option, maybe --checksum-block-bytes=N, that would tell rsync to only checksym the first N bytes of every block. This would probably be used with a very large --block-size. This would be a hint that the file should have no insertions or deletions of data, but only in-place updates with large blocks, or possibly appended additions. This also would help disk-bound systems. Option 4: Add an option, maybe --optimize-cpu, or --weak-checksums that would tell rsync to only use weak checksums up until the point in the file where the weak checksums first differ, and then fallback to the normal weak and strong checksums from there on. This is a hint that most likely the file is appended to, but will still catch most occurances where a file was modified. All of these options might also benefit from another option that says to only apply these optimizations to files over a certain size, or where the automatic blocksize is over a cerain size. Obviously, these optimizations would all be for systems with comparatively low cpu power, but as average filesizes continue to get larger and larger, they would also benefit even much faster systems when used on very large (several gigabytes and up) files. In the process of testing this, I also found out that the timeout setting I had in the receiver side of ten minutes wasn't sufficient. So I was also wondering if it would be possible to add an option to make rsync, when used in daemon mode and not over another shell transport, use some form of tcp keepalives during long-running processes. This could allow me to reduce the timeout to a smaller value like 2 minutes, but still not let the rsync connection die as long as the remote system still had a "live" connection even when one end was waiting on the other for very long operations (like this long-running checksum) and there was no other connection traffic. Thoughts? Evan
Jan-Benedict Glaw
2005-Aug-18 11:03 UTC
Question and feature requests for processor bound systems
On Thu, 2005-08-18 04:45:21 -0500, Evan Harris <eharris@puremagic.com> wrote:> Is there any way to disable the checksum block search in rsync, or to > somehow optimize it for systems that are processor-bound in addition to > being network bound?By design, rsync trades CPU power for bandwidth.> Option 4: Add an option, maybe --optimize-cpu, or --weak-checksums that > would tell rsync to only use weak checksums up until the point in the file > where the weak checksums first differ, and then fallback to the normal weak > and strong checksums from there on. This is a hint that most likely the > file is appended to, but will still catch most occurances where a file was > modified.Option 4: tar over netcat. MfG, JBG -- Jan-Benedict Glaw jbglaw@lug-owl.de . +49-172-7608481 _ O _ "Eine Freie Meinung in einem Freien Kopf | Gegen Zensur | Gegen Krieg _ _ O f?r einen Freien Staat voll Freier B?rger" | im Internet! | im Irak! O O O ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA)); -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: Digital signature Url : http://lists.samba.org/archive/rsync/attachments/20050818/ec7098fe/attachment.bin
Wayne Davison
2005-Aug-18 18:34 UTC
Question and feature requests for processor bound systems
On Thu, Aug 18, 2005 at 04:45:21AM -0500, Evan Harris wrote:> Is there any way to disable the checksum block search in rsync, or to > somehow optimize it for systems that are processor-bound in addition > to being network bound?The --whole-file option (-W) disables the rsync algorithm entirely, but not the full-file checksum to verify that the file was transferred correctly.> Option 1: Add an option, maybe --optimize-append, that would optimize > the checksum search by telling it that it can assume that files are > probably just appended to, like logfiles.The CVS source (also available in the "nightly" tar files) has the --append option that only transfers files that have gotten longer (or are new), starting the transfer after all the existing data. That should save some checksum processing, but, rsync still includes all the old data in the full-file checksum that verifies that the file was sent correctly. ..wayne..