Thomas Knauth
2014-Apr-11 11:35 UTC
rsync performance on large files strongly depends on file's (dis)similarity
Hi list, I've found this post on rsync's expected performance for large files: https://lists.samba.org/archive/rsync/2007-January/017033.html I have a related but different observation to share: with files in the multi-gigabyte-range, I've noticed that rsync's runtime also depends on how much the source/destination diverge, i.e., synchronization is faster if the files are similar. However, this is not just because less data must be transferred. For example, on an 8 GiB file with 10% updates, rsync takes 390 seconds. With 50% updates, it takes about 1400 seconds, and at 90% updates about 2400 seconds. My current explanation, and it would be awesome if someone more knowledgeable than me could confirm, is this: with very large files, we'd expect a certain level of false alarms, i.e., weak checksum matches, but strong checksum does not. However, with large files that are very similar, a weak match is much more likely to be confirmed with a matching strong checksum. Contrary, with large files that are very dissimilar a weak match is much less likely to be confirmed with a strong checksum, exactly because the files are very different from each other. rsync ends up computing lots of strong checksums, which do not result in a match. Is this a valid/reasonable explanation? Can someone else confirm this relationship between rsync's computational overhead and the file's (dis)similarity? Thanks, Thomas.
Thomas Knauth
2014-Apr-11 13:09 UTC
rsync performance on large files strongly depends on file's (dis)similarity
Maybe an alternative explanation is that a high degree of similarity allows to skip more bytes on the sender. For each matched block, the sender can does not need to compute any checksums, weak or strong, for the next S bytes, where S is the block size. As the number of matched blocks decreases, i.e., dissimilarity increases, the number of computed checksums grows. This relationship is especially apparent for large files, where many strong (and expensive) checksum must be computed, due to many false alarms. On Fri, Apr 11, 2014 at 1:35 PM, Thomas Knauth <thomas.knauth at gmx.de> wrote:> Hi list, > > I've found this post on rsync's expected performance for large files: > > https://lists.samba.org/archive/rsync/2007-January/017033.html > > I have a related but different observation to share: with files in the > multi-gigabyte-range, I've noticed that rsync's runtime also depends > on how much the source/destination diverge, i.e., synchronization is > faster if the files are similar. However, this is not just because > less data must be transferred. > > For example, on an 8 GiB file with 10% updates, rsync takes 390 > seconds. With 50% updates, it takes about 1400 seconds, and at 90% > updates about 2400 seconds. > > My current explanation, and it would be awesome if someone more > knowledgeable than me could confirm, is this: with very large files, > we'd expect a certain level of false alarms, i.e., weak checksum > matches, but strong checksum does not. However, with large files that > are very similar, a weak match is much more likely to be confirmed > with a matching strong checksum. Contrary, with large files that are > very dissimilar a weak match is much less likely to be confirmed with > a strong checksum, exactly because the files are very different from > each other. rsync ends up computing lots of strong checksums, which do > not result in a match. > > Is this a valid/reasonable explanation? Can someone else confirm this > relationship between rsync's computational overhead and the file's > (dis)similarity? > > Thanks, > Thomas.
Reasonably Related Threads
- similarity matrix conversion to dissimilarity
- Question
- how to group a large list of strings into categories based on string similarity?
- Rsync: Re: patch to enable faster mirroring of large filesyst ems
- Clustering with R - efficient processing of large sparse data sets (text data)