thr3ads.net - rsync - inefficient: --checksum calculation shouldn't be done for new files [Jul 2011]

If this information is useful, please help other people find it:
Share via:

Carlos Carvalho

2011-Jul-03 00:46 UTC

inefficient: --checksum calculation shouldn't be done for new files

When --checksum is used they're calculated in both ends to see if the
file should be transfered. This is of course not necessary if the file
doesn't exist in the destination. However, the checksum is still
calculated by the sender, which is often a very large overhead.

Would it be possible to avoid it?

Jamie Lokier

2011-Jul-03 23:00 UTC

head link

inefficient: --checksum calculation shouldn't be done for new files

Carlos Carvalho wrote:> When --checksum is used they're calculated in both ends to see if the
> file should be transfered. This is of course not necessary if the file
> doesn't exist in the destination. However, the checksum is still
> calculated by the sender, which is often a very large overhead.
> 
> Would it be possible to avoid it?
Doesn't the receiver use the checksum to verify it received the file
with no errors?

-- Jamie

Wayne Davison

2011-Jul-05 00:10 UTC

head link

inefficient: --checksum calculation shouldn't be done for new files

On Sat, Jul 2, 2011 at 5:46 PM, Carlos Carvalho <carlos at
fisica.ufpr.br>wrote:
> When --checksum is used they're calculated in both ends to see if the
file
> should be transfered. This is of course not necessary if the file
doesn't
> exist in the destination. However, the checksum is still calculated by the
> sender, which is often a very large overhead.
>
> Would it be possible to avoid it?

To do so would involve adding an extra round-trip request to a transfer, so
it is feasible, but is not currently supported.

Such a feature would look like this:

Instead of the sender including checksum information for all files in the
file-list, it would send a checksum-less list, and let the generator look
for files that already exist on the receiver, at which point the generator
would send a new request to the sender to ask for the checksum for the file.
 While waiting for the sender's checksum, it would compute its own checksum
on the current file and then wait around for the sender's value, at which
point it would compare them, and either request a file transfer or handle
the up-to-date file.

That all sounds interesting, but would require a new --favor-missing-files
(or some such) option to tell rsync to use the alternate checksum method.
 It would be interesting to try something like that and see how much time it
saves in checksum generating vs time it consumes in round-trip lag.

As for what is currently possible, see
the patches/db.diff, patches/checksum-reading.diff,
patches/checksum-updating.diff, and (possibly) patches/checksum-xattrs.diff
patches for example ways to make the checksum sending more efficient.  This
presumes that the sender is something that has rarely-changing files, and
that caching the checksums based on a more stringent time method (down to
the ctime in the case of the first 3 patches) is something that would help
your use case.  Of all those choices, using db.diff (possibly with a sqlite
DB) is a pretty nice solution that could speed up checksum transfers by a
huge amount while also avoiding sprinkling around a bunch of checksum files
(or xattrs).

..wayne..
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20110704/6bd43a2f/attachment.html>

Maybe Matching Threads

Search for more apparently analagous threads

rsync - Jul 2011 - inefficient: --checksum calculation shouldn't be done for new files

inefficient: --checksum calculation shouldn't be done for new files

inefficient: --checksum calculation shouldn't be done for new files

inefficient: --checksum calculation shouldn't be done for new files

Maybe Matching Threads