Hi,
I recently came across a situation where "rsync --inplace" performs
very poorly. If both the source and destination files contain long sequences of
identical blocks, but not necessarily in the same location, the sender can spend
an inordinate amount of CPU time finding matching blocks.
In my case, I came across this problem while backing up multi-hundred-gigabyte
MySQL database files. There could be periods of *hours* where the sender was not
reading anything from disk or writing anything over the network.
Please consider the attached patch. It alleviates this problem by discarding
hash table entries on the sender that can't possibly be used during an
--inplace update.
I've included some test results below demonstrating the problem and the
improvement this patch provides.
Regards,
Michael
------------------------------------------------------------
All tests were conducted with files generated as follows:
$ perl -e 'print "\x00" x (10*1024*1024); print "\xff"
x (10*1024*1024)' >a
$ perl -e 'print "\xff" x (10*1024*1024); print "\x00"
x (10*1024*1024)' >b
$ md5sum a b
07c84be14041575befb779ca6dee16ab a
fc8fc3324a22639ff61d063e28385962 b
In other words, "a" contains 10 MiB of zero bits followed by 10 MiB of
one bits, while "b" contains 10 MiB of one bits followed by 10 MiB of
zero bits.
Prior to each test, the filesystem was synced and the kernel page cache cleared
(echo 3 >/proc/sys/vm/drop_caches).
Running "rsync --inplace" between these takes a little time:
$ time ./rsync-unpatched -vv --no-whole-file --checksum --inplace a b
delta-transmission enabled
a
total: matches=2291 hash_hits=10483476 false_alarms=0 data=10487904
sent 10,499,720 bytes received 27,564 bytes 113,808.48 bytes/sec
total size is 20,971,520 speedup is 1.99
real 1m32.170s
user 1m31.756s
sys 0m0.137s
$ md5sum a b
07c84be14041575befb779ca6dee16ab a
07c84be14041575befb779ca6dee16ab b
With the patch, this time is significantly reduced:
$ time ./rsync-patched -vv --no-whole-file --checksum --inplace a b
delta-transmission enabled
a
total: matches=2291 hash_hits=2292 false_alarms=0 data=10487904
sent 10,499,720 bytes received 27,564 bytes 7,018,189.33 bytes/sec
total size is 20,971,520 speedup is 1.99
real 0m0.628s
user 0m0.389s
sys 0m0.043s
$ md5sum a b
07c84be14041575befb779ca6dee16ab a
07c84be14041575befb779ca6dee16ab b
The behaviour of rsync without --inplace is hardly affected. Unpatched:
$ time ./rsync-unpatched -vv --no-whole-file --checksum a b
delta-transmission enabled
a
total: matches=4582 hash_hits=4582 false_alarms=0 data=4288
sent 22,716 bytes received 27,564 bytes 33,520.00 bytes/sec
total size is 20,971,520 speedup is 417.09
real 0m0.579s
user 0m0.297s
sys 0m0.053s
$ md5sum a b
07c84be14041575befb779ca6dee16ab a
07c84be14041575befb779ca6dee16ab b
Patched:
$ time ./rsync-patched -vv --no-whole-file --checksum a b
delta-transmission enabled
a
total: matches=4582 hash_hits=4582 false_alarms=0 data=4288
sent 22,715 bytes received 27,564 bytes 33,519.33 bytes/sec
total size is 20,971,520 speedup is 417.10
real 0m0.598s
user 0m0.314s
sys 0m0.043s
$ md5sum a b
07c84be14041575befb779ca6dee16ab a
07c84be14041575befb779ca6dee16ab b
------------------------------------------------------------
Michael Chapman (1):
Discard unusable hash table entries
match.c | 25 ++++++++++++++++---------
1 file changed, 16 insertions(+), 9 deletions(-)
--
1.8.3.1