thr3ads.net - rsync - rsync algorithm for large files [Sep 2009]

If this information is useful, please help other people find it:
Share via:

eharvey at lyricsemiconductors.com

2009-Sep-04 22:00 UTC

rsync algorithm for large files

I thought rsync, would calculate checksums of large files that have changed
timestamps or filesizes, and send only the chunks which changed.  Is this
not correct?  My goal is to come up with a reasonable (fast and efficient)
way for me to daily incrementally backup my Parallels virtual machine (a
directory structure containing mostly small files, and one 20G file)



I?m on OSX 10.5, using rsync 2.6.9, and the destination machine has the same
versions.  I configured ssh keys, and this is my result:



(Initial sync)

time rsync -a --delete MyVirtualMachine/ myserver:MyVirtualMachine/

                20G

                ~30minutes



(Second time I ran it, with no changes to the VM)

time rsync -a --delete MyVirtualMachine/ myserver:MyVirtualMachine/

                2 seconds



(Then I made some minor changes inside the VM, and I want to send just the
changed blocks)

time rsync -a --delete MyVirtualMachine/ myserver:MyVirtualMachine/

                After waiting 50 minutes, I cancelled the job.



Why does it take longer the 3rd time I run it?  Shouldn?t the performance
always be **at least** as good as the initial sync?



Thanks for any help?
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20090904/49024dd7/attachment.html>

Matthias Schniedermeyer

2009-Sep-04 22:34 UTC

head link

rsync algorithm for large files

On 04.09.2009 18:00, eharvey at lyricsemiconductors.com
wrote:> 
> Why does it take longer the 3rd time I run it?  Shouldn?t the performance
> always be **at least** as good as the initial sync?
Not per se.

First you have to determine THAT the file has changed, then the file is 
synced if there was a change. At least that's what you have to do when 
the file-size is unchanged and only the timestamp is differs.
(Which is unfortunatly often the case for Virtual Machine Images)

Worst case: Takes double the time if the change is at end of the file.

When the filesize differs rsync immediatly knows that the file has 
actual changes and starts the sync right away.

If i understand '--ignore-times' correctly it forces rsync to always 
regard the files as changed and so start a sync right away, without 
first checking for changes.

There are also some other options that may or may not have a speed 
impact for you:
--inplace, so that rsync doesn't create a tmp-copy that is later moved over 
the previous file on the target-site.
--whole-file, so that rsync doesn't use delta-transfer but rather copies 
the whole file.

Also you may to separate the small from the large files with:
--min-size
--max-size
So you can use different options for the small/large file(s).

Bis denn

-- 
Real Programmers consider "what you see is what you get" to be just as
bad a concept in Text Editors as it is in women. No, the Real Programmer
wants a "you asked for it, you got it" text editor -- complicated, 
cryptic, powerful, unforgiving, dangerous.

Shachar Shemesh

2009-Sep-05 05:55 UTC

head link

rsync algorithm for large files

eharvey at lyricsemiconductors.com wrote:>
> I thought rsync, would calculate checksums of large files that have 
> changed timestamps or filesizes, and send only the chunks which 
> changed.  Is this not correct?  My goal is to come up with a 
> reasonable (fast and efficient) way for me to daily incrementally 
> backup my Parallels virtual machine (a directory structure containing 
> mostly small files, and one 20G file)
>
>  
>
> I?m on OSX 10.5, using rsync 2.6.9, and the destination machine has 
> the same versions.  I configured ssh keys, and this is my result:
>Upgrade to rsync 3 at least.

Rsync keeps a hash of the blocks of sliding hashes. For older versions 
of rsync, the has was of a constant size. This meant that files over 3GB 
in size had a high chance of hash collisions. For a 20G file, the 
collisions alone might be the cause of your trouble.

Newer rsyncs detect when the hash gets too big, and increase the has 
size accordingly, thus avoiding the collisions.

In other words - upgrade both sides (but specifically the sender).

Shachar

-- 
Shachar Shemesh
Lingnu Open Source Consulting Ltd.
http://www.lingnu.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20090905/aebb747c/attachment.html>

Seemingly Similar Threads

Search for more seemingly similar threads

rsync - Sep 2009 - rsync algorithm for large files

rsync algorithm for large files

rsync algorithm for large files

rsync algorithm for large files

Seemingly Similar Threads