Hello, I heard recently about the detect-renamed patch for rsync. I was about to code something similar, I need it badly, but decided to give the patch a try first. It seems to work well but it's rename detection scheme seems to be limited. From my tests, it seems to detect that a file has changed name only if the file remains in the same base dir. It detects that a file has moved from one dir to another only if it keeps the same name. I was wondering who is/are the author(s) of the patch (it's not written in the patch file) as I'd like to discuss with them about possible enhancement. Thank you, Charles Perreault
On Tue, 2007-10-30 at 21:25 -0400, Charles Perreault wrote:> I heard recently about the detect-renamed patch for rsync. I was about > to code something similar, I need it badly, but decided to give the > patch a try first. It seems to work well but it's rename detection > scheme seems to be limited. From my tests, it seems to detect that a > file has changed name only if the file remains in the same base dir. It > detects that a file has moved from one dir to another only if it keeps > the same name.--detect-renamed is supposed to be able to detect a rename even when the file both moves and changes basename, and in my test, it did so. If you have a reproducible case in which it doesn't, let's figure out what is going on.> I was wondering who is/are the author(s) of the patch (it's not written > in the patch file) as I'd like to discuss with them about possible > enhancement.It appears that the patch was written by Wayne Davison, the current rsync maintainer. Discussion about enhancements can occur on this list. Matt
On Tue, 2007-11-06 at 22:10 -0500, Charles Perreault wrote:> now input the following to test moving into a new folder : > > $ mkdir src/dir1 > $ mv src/file2 src/dir1/ > $ rsync --detect-renamed -avz --delete src/ dest/ > building file list ... done > deleting file2 > ./ > dir1/ > dir1/file2 > > sent 24031 bytes received 54 bytes 48170.00 bytes/sec > total size is 35653 speedup is 1.48If you pass three -v options, you'll see that rsync does in fact detect the rename. However, it does not accept the destination file as identical to the source file in lieu of performing a transfer because that would be risky (use the --detect-renamed-lax option provided by detect-renamed-lax.diff if you really want this behavior). Instead, rsync just uses the destination file as a basis for the source file. On a remote copy, the delta-transfer algorithm uses the basis to decrease the amount of network traffic, but on a local copy such as yours, the delta-transfer algorithm is off by default (since decreasing interprocess traffic is not an objective) so detection of a rename has no effect on the number of bytes sent.> 4- on receiver, find a match in the tree if possible for each item of > the new files list, and use the match as a base for the > synchronization (a fuzzy / bayesian approach might be used later to > find an approximal good match, but that ain't my goal right now)The current patch does the matching the other way around: for each extraneous destination file D, it looks for a matching source file S and uses D as a basis for S if S is new. Your approach is more straightforward, can detect copies, offers more flexibility in choosing the best basis for each new source file, and would make it natural to combine --detect-renamed and --fuzzy; it requires holding a complete listing of the destination in memory, while the current one holds a listing of the source. I would be interested to see an implementation of your approach.> 5- on receiver, locally copy the matched files where they should be if > they existed > 6- compute the checksums for modified files and for all the matched > files copied at step 5 (mark them dirty/not up to date)"Copying" the matched destination files and "mark[ing] them dirty" corresponds to rsync's current behavior of hard-linking them into the partial dirs for the matching source files. It's not clear to me why you are computing checksums, or are you just referring to the block checksums computed by the generator for the delta-transfer algorithm?> No information about current hierarchy of the files neither on sender > or receiver is needed to build the list and the tree.What do you mean? Steps 1 and 2 scan the entire hierarchy on both sides. Matt
On Wed, 2007-11-07 at 09:13 -0500, Charles Perreault wrote:> For what I read about that lax patch, using it is risky. In fact > that's not what I want at all, I want the content to be checked using > block checksums and the delta-transfer algorithm if the detection did > a mistake, like the detect-renamed patch does. But I want if possible > to find a match for every new file the sender has and not the receiver > in order to decrease the amount of network traffic, and there the > current patch seems to be weak.I know you want to be able to detect copies as well as renames. Are you saying anything more than that?> But I also did the same test over network. Here's a log with -vvv. > This log shows --delete-after because I thought maybe something was > wrong with the default behaviour (I tested --delete too) but that > didn't change anything : > > $ mkdir src/dir1 > $ mv src/file2 src/dir1/ > $ rsync -avvv --delete-after --detect-renamed src nas1:/home/user> sent 25316 bytes received 60 bytes 50752.00 bytes/sec > total size is 35653 speedup is 1.40 > _exit_cleanup(code=0, file=main.c, line=977): about to call exit(0) > > As you can see, the whole file was transferred again.I cannot reproduce this. For me, rsync 2.6.9 + its detect-renamed.diff and the latest development rsync + its detect-renamed.diff both detect the rename correctly.> As to memory usage, it depends on what option you choose > (detect-renamed or detect-copied), my patch could do both. The former > only search a match in extraneous files on the receiverGood point: if turnover is low and only renames are being detected, your approach (in combination with incremental recursion) may use much less memory than the current one.> and the later, indeed, needs to hold a complete listing of > destination in memory. If the current patch already needs a listing > of the source, well my method would use about the same amount of > memory on averageYes, a complete listing of one side.> (in the order of O(n)).Please don't describe the memory usage as O(n). The constant factor in front of the number of files is really important as it can be the difference between a smooth run and death by swapping.> > > No information about current hierarchy of the files neither on sender > > > or receiver is needed to build the list and the tree. > > > > What do you mean? Steps 1 and 2 scan the entire hierarchy on both > > sides. > Yes it is scanned, but not used in the matching process. Filenames > and paths are only relevant if more than one match is found (size + > mod time) to discriminate the best match, otherwise it's useless.OK, so what's your point? The filenames and paths have to be held in memory so that the matching destination files can actually be accessed when it comes time to use their data for delta transfers. Matt