The cause is, of course, that the tree being syncronized ie getting larger,
so of course rsync is slowing down. But in the case of my particular file
tree, there is a way it could be speeded up, but this would obviously also
need a change in the rsync protocol to accomplish it. Any tree that has
major unchanged subbranches would benefit from this.
The file tree I'm syncronizing in this case has archived data that is being
deposited under a YYYY/MM/DD directory structure. Hundred to thousands of
files are added each day, and I'm even considering breaking it down further
by hour. In theory, I could do the syncronizing by date. On the receiving
side, which is also the initiating side, I could have it learn when the last
time it fulling completed syncronizing, and re-run that date as well as any
subsequent dates, rather than the entire tree (which could easily reach a
quarter million files or more per year).
But I have one catch. Occaisionally, an older file is updated. And that
older file needs to be syncronized as well. The receiving/initiating side
won't know whether an old file is, or is not, updated.
What I think would be an improvement in rsync speed in this scenario, and
some similar ones where lots of tree branches are not updated for extended
periods of time, is to collect the timestamps (and checksums if that is
enabled) for each entire branch, hash them, and transfer only the hash of
that metadata. It would need to be a strong hash like MD5 of SHA1 since
if the hashes are equal, the tree branch would be skipped, and none of the
filenames within would be transferred. For branches where the hash is not
equal, then the same hashing would be done recursively on the sub-branches
until either unchanged sub-branches are found and skipped, or changed files
are found (and transferred).
The catch with this mechanism is that nothing would be exchanged between
the rsync processes until the entire tree had been scanned and all the time
stamps collected (worse if doing checksums). On slow (relative to the total
volume of data to be syncronized) networks, though, this could still be a
major time savings, as well as traffic savings. But it clearly would have
to have a special option to enable it.
Has anything like this been considered before?
--
-----------------------------------------------------------------------------
| Phil Howard KA9WGN | http://linuxhomepage.com/ http://ham.org/ |
| (first name) at ipal.net | http://phil.ipal.org/ http://ka9wgn.ham.org/ |
-----------------------------------------------------------------------------