thr3ads.net - rsync - Discussion about the detect-renamed patch [Oct 2007]

If this information is useful, please help other people find it:
Share via:

Charles Perreault

2007-Oct-31 05:07 UTC

Discussion about the detect-renamed patch

Hello,

I heard recently about the detect-renamed patch for rsync.  I was about 
to code something similar, I need it badly, but decided to give the 
patch a try first.  It seems to work well but it's rename detection 
scheme seems to be limited.  From my tests, it seems to detect that a 
file has changed name only if the file remains in the same base dir.  It 
detects that a file has moved from one dir to another only if it keeps 
the same name.

I was wondering who is/are the author(s) of the patch (it's not written 
in the patch file) as I'd like to discuss with them about possible 
enhancement.

Thank you,

Charles Perreault

Matt McCutchen

2007-Nov-06 03:27 UTC

head link

Discussion about the detect-renamed patch

On Tue, 2007-10-30 at 21:25 -0400, Charles Perreault
wrote:> I heard recently about the detect-renamed patch for rsync.  I was about 
> to code something similar, I need it badly, but decided to give the 
> patch a try first.  It seems to work well but it's rename detection 
> scheme seems to be limited.  From my tests, it seems to detect that a 
> file has changed name only if the file remains in the same base dir.  It 
> detects that a file has moved from one dir to another only if it keeps 
> the same name.
--detect-renamed is supposed to be able to detect a rename even when the
file both moves and changes basename, and in my test, it did so.  If you
have a reproducible case in which it doesn't, let's figure out what is
going on.
> I was wondering who is/are the author(s) of the patch (it's not written
> in the patch file) as I'd like to discuss with them about possible 
> enhancement.
It appears that the patch was written by Wayne Davison, the current
rsync maintainer.  Discussion about enhancements can occur on this list.

Matt

Matt McCutchen

2007-Nov-07 04:09 UTC

head link

Discussion about the detect-renamed patch

On Tue, 2007-11-06 at 22:10 -0500, Charles Perreault
wrote:> now input the following to test moving into a new folder :
> 
> $ mkdir src/dir1
> $ mv src/file2 src/dir1/
> $ rsync --detect-renamed -avz --delete src/ dest/
> building file list ... done
> deleting file2
> ./
> dir1/
> dir1/file2
> 
> sent 24031 bytes  received 54 bytes  48170.00 bytes/sec
> total size is 35653  speedup is 1.48
If you pass three -v options, you'll see that rsync does in fact detect
the rename.  However, it does not accept the destination file as
identical to the source file in lieu of performing a transfer because
that would be risky (use the --detect-renamed-lax option provided by
detect-renamed-lax.diff if you really want this behavior).  Instead,
rsync just uses the destination file as a basis for the source file.  On
a remote copy, the delta-transfer algorithm uses the basis to decrease
the amount of network traffic, but on a local copy such as yours, the
delta-transfer algorithm is off by default (since decreasing
interprocess traffic is not an objective) so detection of a rename has
no effect on the number of bytes sent.
> 4- on receiver, find a match in the tree if possible for each item of
> the new files list, and use the match as a base for the
> synchronization (a fuzzy / bayesian approach might be used later to
> find an approximal good match, but that ain't my goal right now)
The current patch does the matching the other way around: for each
extraneous destination file D, it looks for a matching source file S and
uses D as a basis for S if S is new.  Your approach is more
straightforward, can detect copies, offers more flexibility in choosing
the best basis for each new source file, and would make it natural to
combine --detect-renamed and --fuzzy; it requires holding a complete
listing of the destination in memory, while the current one holds a
listing of the source.  I would be interested to see an implementation
of your approach.
> 5- on receiver, locally copy the matched files where they should be if
> they existed
> 6- compute the checksums for modified files and for all the matched
> files copied at step 5 (mark them dirty/not up to date)
"Copying" the matched destination files and "mark[ing] them
dirty"
corresponds to rsync's current behavior of hard-linking them into the
partial dirs for the matching source files.  It's not clear to me why
you are computing checksums, or are you just referring to the block
checksums computed by the generator for the delta-transfer algorithm?
> No information about current hierarchy of the files neither on sender
> or receiver is needed to build the list and the tree.
What do you mean?  Steps 1 and 2 scan the entire hierarchy on both
sides.

Matt

Matt McCutchen

2007-Nov-07 22:13 UTC

head link

Discussion about the detect-renamed patch

On Wed, 2007-11-07 at 09:13 -0500, Charles Perreault
wrote:> For what I read about that lax patch, using it is risky.  In fact
> that's not what I want at all, I want the content to be checked using
> block checksums and the delta-transfer algorithm if the detection did
> a mistake, like the detect-renamed patch does.  But I want if possible
> to find a match for every new file the sender has and not the receiver
> in order to decrease the amount of network traffic, and there the
> current patch seems to be weak.
I know you want to be able to detect copies as well as renames.  Are you
saying anything more than that?
> But I also did the same test over network.  Here's a log with -vvv.
> This log shows --delete-after because I thought maybe something was
> wrong with the default behaviour (I tested --delete too) but that
> didn't change anything :
> 
> $ mkdir src/dir1
> $ mv src/file2 src/dir1/
> $ rsync -avvv --delete-after --detect-renamed src nas1:/home/user
> sent 25316 bytes  received 60 bytes  50752.00 bytes/sec
> total size is 35653  speedup is 1.40
> _exit_cleanup(code=0, file=main.c, line=977): about to call exit(0)
> 
> As you can see, the whole file was transferred again.
I cannot reproduce this.  For me, rsync 2.6.9 + its detect-renamed.diff
and the latest development rsync + its detect-renamed.diff both detect
the rename correctly.
> As to memory usage, it depends on what option you choose
> (detect-renamed or detect-copied), my patch could do both.  The former
> only search a match in extraneous files on the receiver
Good point: if turnover is low and only renames are being detected, your
approach (in combination with incremental recursion) may use much less
memory than the current one.
>  and the later, indeed, needs to hold a complete listing of
> destination in memory.  If the current patch already needs a listing
> of the source, well my method would use about the same amount of
> memory on average
Yes, a complete listing of one side.
>  (in the order of O(n)).
Please don't describe the memory usage as O(n).  The constant factor in
front of the number of files is really important as it can be the
difference between a smooth run and death by swapping.
> > > No information about current hierarchy of the files neither on
sender
> > > or receiver is needed to build the list and the tree.    
> > 
> > What do you mean?  Steps 1 and 2 scan the entire hierarchy on both
> > sides.
> Yes it is scanned, but not used in the matching process.  Filenames
> and paths are only relevant if more than one match is found (size +
> mod time) to discriminate the best match, otherwise it's useless.
OK, so what's your point?  The filenames and paths have to be held in
memory so that the matching destination files can actually be accessed
when it comes time to use their data for delta transfers.

Matt

Reasonably Related Threads

Search for more maybe matching threads

rsync - Oct 2007 - Discussion about the detect-renamed patch

Discussion about the detect-renamed patch

Discussion about the detect-renamed patch

Discussion about the detect-renamed patch

Discussion about the detect-renamed patch

Reasonably Related Threads