thr3ads.net - rsync - rsync performance on large files strongly depends on file's (dis)similarity [Apr 2014]

If this information is useful, please help other people find it:
Share via:

Thomas Knauth

2014-Apr-11 11:35 UTC

rsync performance on large files strongly depends on file's (dis)similarity

Hi list,

I've found this post on rsync's expected performance for large files:

https://lists.samba.org/archive/rsync/2007-January/017033.html

I have a related but different observation to share: with files in the
multi-gigabyte-range, I've noticed that rsync's runtime also depends
on how much the source/destination diverge, i.e., synchronization is
faster if the files are similar. However, this is not just because
less data must be transferred.

For example, on an 8 GiB file with 10% updates, rsync takes 390
seconds. With 50% updates, it takes about 1400 seconds, and at 90%
updates about 2400 seconds.

My current explanation, and it would be awesome if someone more
knowledgeable than me could confirm, is this: with very large files,
we'd expect a certain level of false alarms, i.e., weak checksum
matches, but strong checksum does not. However, with large files that
are very similar, a weak match is much more likely to be confirmed
with a matching strong checksum. Contrary, with large files that are
very dissimilar a weak match is much less likely to be confirmed with
a strong checksum, exactly because the files are very different from
each other. rsync ends up computing lots of strong checksums, which do
not result in a match.

Is this a valid/reasonable explanation? Can someone else confirm this
relationship between rsync's computational overhead and the file's
(dis)similarity?

Thanks,
Thomas.

Thomas Knauth

2014-Apr-11 13:09 UTC

head link

rsync performance on large files strongly depends on file's (dis)similarity

Maybe an alternative explanation is that a high degree of similarity
allows to skip more bytes on the sender. For each matched block, the
sender can does not need to compute any checksums, weak or strong, for
the next S bytes, where S is the block size.

As the number of matched blocks decreases, i.e., dissimilarity
increases, the number of computed checksums grows. This relationship
is especially apparent for large files, where many strong (and
expensive) checksum must be computed, due to many false alarms.

On Fri, Apr 11, 2014 at 1:35 PM, Thomas Knauth <thomas.knauth at gmx.de>
wrote:> Hi list,
>
> I've found this post on rsync's expected performance for large
files:
>
> https://lists.samba.org/archive/rsync/2007-January/017033.html
>
> I have a related but different observation to share: with files in the
> multi-gigabyte-range, I've noticed that rsync's runtime also
depends
> on how much the source/destination diverge, i.e., synchronization is
> faster if the files are similar. However, this is not just because
> less data must be transferred.
>
> For example, on an 8 GiB file with 10% updates, rsync takes 390
> seconds. With 50% updates, it takes about 1400 seconds, and at 90%
> updates about 2400 seconds.
>
> My current explanation, and it would be awesome if someone more
> knowledgeable than me could confirm, is this: with very large files,
> we'd expect a certain level of false alarms, i.e., weak checksum
> matches, but strong checksum does not. However, with large files that
> are very similar, a weak match is much more likely to be confirmed
> with a matching strong checksum. Contrary, with large files that are
> very dissimilar a weak match is much less likely to be confirmed with
> a strong checksum, exactly because the files are very different from
> each other. rsync ends up computing lots of strong checksums, which do
> not result in a match.
>
> Is this a valid/reasonable explanation? Can someone else confirm this
> relationship between rsync's computational overhead and the file's
> (dis)similarity?
>
> Thanks,
> Thomas.

Apparently Analagous Threads

Search for more reasonably related threads

rsync - Apr 2014 - rsync performance on large files strongly depends on file's (dis)similarity

rsync performance on large files strongly depends on file's (dis)similarity

rsync performance on large files strongly depends on file's (dis)similarity

Apparently Analagous Threads