The rsync 2.5.6 TODO file mentions the need for hard link test cases. Here is one in which a linked file is unnecessarily transferred in full. # Setup initial directories mkdir src dest dd if=/dev/zero bs=1024 count=10000 of=src/a 2>/dev/null rsync -a src/. dest/. ln src/a src/b # At this point, a & b exist in src; only a exists in dest. rsync -aHv src/. dest/. building file list ... done ./ b => a wrote 78 bytes read 20 bytes 196.00 bytes/sec total size is 20480000 speedup is 208979.59 The above is GOOD behavior; only the file metadata was transferred, and the link was made in dest, as expected. Now try the failure case: # Setup initial directories rm dest/a # At this point, a & b exist in src; only b exists in dest. rsync -aHv src/. dest/. building file list ... done ./ a b => a wrote 10241366 bytes read 36 bytes 6827601.33 bytes/sec total size is 20480000 speedup is 2.00 The above is BAD (nonoptimal) behavior; the entire file is transferred, even though it could simply have been linked. It seems that "a" is transferred before it is determined that a suitable equivalent (linked) file "b" already exists. I suspect that this has to do with handling the file list in a sorted order; when the missing filename is encountered first, it is transferred in full. Not being familiar with the rsync protocol or source code, I can't say whether this should be fixed on the client or server side. --Pete
On Tue, Nov 25, 2003 at 03:30:53PM -0800, Pete Wenzel wrote:> The rsync 2.5.6 TODO file mentions the need for hard link test cases. > Here is one in which a linked file is unnecessarily transferred in full. > > # Setup initial directories > mkdir src dest > dd if=/dev/zero bs=1024 count=10000 of=src/a 2>/dev/null > rsync -a src/. dest/. > ln src/a src/b > # At this point, a & b exist in src; only a exists in dest. > rsync -aHv src/. dest/. > building file list ... done > ./ > b => a > wrote 78 bytes read 20 bytes 196.00 bytes/sec > total size is 20480000 speedup is 208979.59 > > The above is GOOD behavior; only the file metadata was transferred, and > the link was made in dest, as expected. > > Now try the failure case: > > # Setup initial directories > rm dest/a > # At this point, a & b exist in src; only b exists in dest. > rsync -aHv src/. dest/. > building file list ... done > ./ > a > b => a > wrote 10241366 bytes read 36 bytes 6827601.33 bytes/sec > total size is 20480000 speedup is 2.00 > > The above is BAD (nonoptimal) behavior; the entire file is transferred, > even though it could simply have been linked. It seems that "a" is > transferred before it is determined that a suitable equivalent (linked) > file "b" already exists. > > I suspect that this has to do with handling the file list in a sorted > order; when the missing filename is encountered first, it is transferred > in full. Not being familiar with the rsync protocol or source code, I > can't say whether this should be fixed on the client or server side.Actually, this is because hardlinks are detected as each file is considered for transfer. In order to get transfer-optimal we would have to create the hardlink table in a seperate loop after the flist sort but before anything else and add a status field to know whether any of the links had been transfered. Then the logic for dealing with the hardlinks would have to be made much more complex. I'm not sure that would be worth the cost in terms of delay. -- ________________________________________________________________ J.W. Schultz Pegasystems Technologies email address: jw@pegasys.ws Remember Cernan and Schmitt
John Van Essen
2003-Nov-27 21:22 UTC
How hard links are processed (was Test case for hard link failure)
On Tue, 25 Nov 2003, Pete Wenzel <pmwenzel@yahoo.com> wrote: [ ... ]> The above is BAD (nonoptimal) behavior; the entire file is transferred, > even though it could simply have been linked. It seems that "a" is > transferred before it is determined that a suitable equivalent (linked) > file "b" already exists. > > I suspect that this has to do with handling the file list in a sorted > order; when the missing filename is encountered first, it is transferred > in full. Not being familiar with the rsync protocol or source code, I > can't say whether this should be fixed on the client or server side.I ran into this exact same situation a while back, and took a good long look at the code in hlink.c to figure out what the problem was. Here is my take at an explanation of how rsync processes hardlinks. This could be added as a section to the (newly-restored) rsync doc file at http://www.pegasys.ws/how-rsync-works.html Additional comments follow... --------------------- start of documentation ----------------------- When a -H or --hard-links option is used to preserve hard links on the destination (receiver), the following steps to construct the hardlink list (named "hlink_list") happen after the file list is received: - An array to hold pointers to all the file list entries is allocated. - It is initialized to each file list entry in list-order. - It is sorted (using qsort) with a 3-part comparison test: 1 - device number 2 - inode number 3 - file name (full path) This list of pointers now has hardlinked files grouped together and sorted alphabetically. As rsync processes each file in the main loop, each file's hardlink status is determined using the hardlink list: - A binary search for the file is done using the 3-part comparison test. - The immediately preceding entry in the hardlink list is examined: - If it has the same dev/inode as the current entry, then the current file of interest is to be hardlinked to at least one other file and nothing is done at this time (a message is printed in verbose mode). - Otherwise, it is either a non-hardlinked file, or it is the first file in a hardlink group. Either way it is treated as a normal file and if it is not present, it is fetched from the sender. After rsync has processed all the files, an additional post-processing step is performed. For each group of hardlinked files, the first file is left alone, and for each of the remaining files, any existing file is unlinked, and a hardlink is created to its immediate predecessor. --------------------- end of documentation ------------------------- This explains the situation where if A and B are hardlinked, and B is present but A is not, A will get transferred, B will get *deleted*, and B is then hardlinked to A. That's exactly what I see happening. If A is present, and B is not, then no transfer is needed and only the hardlink is done. There is no easy solution. You can't resort the hardlink groups to put an existing file at the top because that messes up the binary search. The only way I thought of to solve this is to pre-process the hardlink list and for each group of same dev/inode files, if the first file does not exist, examine the remaining entries to see if any of them exist. If one is found, then hardlink the first file to the existing file. Now, if the first file matches its source file when rsync processes it, then it won't get needlessly fetched. I decided not to tackle this, since it requires more in-depth knowledge of rsync to utilize its various support routines, and I just don't feel that comfortable with it. So my rsyncs take an extra 20 minutes each night needlessly transferring hundreds of megabytes of log files. :-/ If anyone else cares to take on this project, be my guest! -- John Van Essen Univ of MN Alumnus <vanes002@umn.edu>
Reasonably Related Threads
- date created attribute doesn't seem right
- [Bug 10334] New: rsync doesn't log hardlink-copies using --link-dest
- Skipping hardlinks in a copy
- Lots of zero-byte hard link files in cur (and new/tmp), cannot see messages in folder
- DO NOT REPLY [Bug 3692] New: regression: symlinks are created as hardlinks with --link-dest