thr3ads.net - rsync - rsync is slowing down [Apr 2004]

If this information is useful, please help other people find it:
Share via:

Phil Howard

2004-Apr-03 19:23 UTC

rsync is slowing down

The cause is, of course, that the tree being syncronized ie getting larger,
so of course rsync is slowing down.  But in the case of my particular file
tree, there is a way it could be speeded up, but this would obviously also
need a change in the rsync protocol to accomplish it.  Any tree that has
major unchanged subbranches would benefit from this.

The file tree I'm syncronizing in this case has archived data that is being
deposited under a YYYY/MM/DD directory structure.  Hundred to thousands of
files are added each day, and I'm even considering breaking it down further
by hour.  In theory, I could do the syncronizing by date.  On the receiving
side, which is also the initiating side, I could have it learn when the last
time it fulling completed syncronizing, and re-run that date as well as any
subsequent dates, rather than the entire tree (which could easily reach a
quarter million files or more per year).

But I have one catch.  Occaisionally, an older file is updated.  And that
older file needs to be syncronized as well.  The receiving/initiating side
won't know whether an old file is, or is not, updated.

What I think would be an improvement in rsync speed in this scenario, and
some similar ones where lots of tree branches are not updated for extended
periods of time, is to collect the timestamps (and checksums if that is
enabled) for each entire branch, hash them, and transfer only the hash of
that metadata.  It would need to be a strong hash like MD5 of SHA1 since
if the hashes are equal, the tree branch would be skipped, and none of the
filenames within would be transferred.  For branches where the hash is not
equal, then the same hashing would be done recursively on the sub-branches
until either unchanged sub-branches are found and skipped, or changed files
are found (and transferred).

The catch with this mechanism is that nothing would be exchanged between
the rsync processes until the entire tree had been scanned and all the time
stamps collected (worse if doing checksums).  On slow (relative to the total
volume of data to be syncronized) networks, though, this could still be a
major time savings, as well as traffic savings.  But it clearly would have
to have a special option to enable it.

Has anything like this been considered before?

-- 
-----------------------------------------------------------------------------
| Phil Howard KA9WGN       | http://linuxhomepage.com/      http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/   http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------

Wayne Davison

2004-Apr-03 20:24 UTC

head link

rsync is slowing down

You can implement such optimizations on top of rsync using either
excludes or the --files-from option.  For instance, if the sending
side maintained an exclude file of old directories that didn't need
to be transferred, you could write a script that would look for
updated items and remove the appropriate exclusion.  An exclude list
would have to be grabbed first from the remote side before it could
be used, though.

Using --files-from lets you use a remote list of items directly.
If you had a shell script like this:

#!/bin/sh
cd /starting/path || exit
touch /some/touchfile.new
for x in *; do
    case $x in
    [0-9][0-9][0-9][0-9])
	find $x -newer /some/touchfile ! -type d
	;;
    *)
	echo $x
	;;
    esac
done >/some/files-from.new
mv /some/files-from.new /some/files-from
mv /some/touchfile.new /some/touchfile

You could run an rsync command like this:

rsync -ar --files-from=:/some/files-from remotehost:/starting/path /to

The above script uses a separate touchfile from the files-from file
so that it won't miss files that got modified during the time that
the find is running but missed by this run.  It also avoids having
find report modified directories because rsync is being run with the
-r option.  You'd have to pre-create the /some/touchfile before the
first run.

You could, of course, construct your own script that implemented a
more complex update-check that was more like the one you suggested.
You might also need a more complex dir-scan algorithm, but that's
totally in your control.

..wayne..

Maybe Matching Threads

Search for more maybe matching threads

rsync - Apr 2004 - rsync is slowing down

rsync is slowing down

rsync is slowing down

Maybe Matching Threads