Andrew Gideon
2023-Jul-08 19:19 UTC
Is this the best way to combine --times and --link-dest along with a de-duplicator?
I'm copying files using --link-dest to avoid duplication. I'm also using a de-duplicator (rmlint) to further reduce duplication. For files that are duplicates, I've rmlint set to use the timestamp of the oldest file. This ends up with starting conditions where the source of a copy might have been: [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/SRC/{a,b} 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b [root at archive3 tmp]# while the previously copied (and de-duplicated) result is: [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/DEST0/{a,b} 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b [root at archive3 tmp]# Note that files 'a' and 'b' in the result directory DEST0 share an inode (are hardlinked) while they are separate inodes (with identical content but different timestamps) in the source directory SRC. If I create a new copy using --link-dest as follows, I get my desired behavior [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} ls: cannot access /tmp/SRC/c: No such file or directory ls: cannot access /tmp/DEST*/c: No such file or directory 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b [root at archive3 tmp]# rsync --itemize-changes -rlpgoD --size-only --link-dest=/tmp/DEST0 /tmp/SRC/ /tmp/DEST1 [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} ls: cannot access /tmp/SRC/c: No such file or directory ls: cannot access /tmp/DEST*/c: No such file or directory 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST1/a 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST1/b 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b [root at archive3 tmp]# Note that DEST1/a and tmp/DEST1/b share an inode because of --link-dest and the fact that the source 'a' matches DEST0/a while the source 'b' matches DEST0/b (comparing with --size-only). What I don't like from this set of options to rsync is what happens with 'c' in this case: [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} ls: cannot access /tmp/DEST*/c: No such file or directory 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b 75691117 -rw-r--r--. 1 0 0 4 2023-07-08 10:00:00.000000000 -0400 /tmp/SRC/c [root at archive3 tmp]# rsync --itemize-changes -rlpgoD --size-only --link-dest=/tmp/DEST0 /tmp/SRC/ /tmp/DEST1 >f+++++++++ c [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST1/a 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST1/b 75693529 -rw-r--r--. 1 0 0 4 2023-07-08 14:50:08.670091884 -0400 /tmp/DEST1/c 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b 75691117 -rw-r--r--. 1 0 0 4 2023-07-08 10:00:00.000000000 -0400 /tmp/SRC/c [root at archive3 tmp]# Note that the timestamp of 'c' was not preserved in the copy. While in the case of 'a' and 'b' I didn't care which of two timestamps were used, I do want the timestamp taken from one of the source files; I just don't care which. The copy of 'c' breaks this as the timestamp of DEST1/c is the time of the copy; not of SRC/c. The solution should be obvious: add --times (or replace -rlpgoD with -a). However, this breaks the --link-dest behavior [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} ls: cannot access /tmp/SRC/c: No such file or directory ls: cannot access /tmp/DEST*/c: No such file or directory 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b [root at archive3 tmp]# rsync --itemize-changes -rlpgoD --size-only --times --link-dest=/tmp/DEST0 /tmp/SRC/ /tmp/DEST1 .d..t...... ./ cf..t...... b [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} ls: cannot access /tmp/SRC/c: No such file or directory ls: cannot access /tmp/DEST*/c: No such file or directory 75676810 -rw-r--r--. 3 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 3 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676810 -rw-r--r--. 3 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST1/a 75689925 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/DEST1/b 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b [root at archive3 tmp]# Because rsync is preserving the timestamp of SRC/b in DEST1/b, DEST1/b no longer shares the inode of DEST1/a and DEST0/{a,b}. That's reasonable for these options, but not what I want. I want to use one of the source timestamps. I may not care which, but it should be one of them. I don't want the timestamp on a copied file to be the time of the copy. I've come up with a solution which works but which feels like cheating or abusing --modify-window. [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} ls: cannot access /tmp/DEST*/c: No such file or directory 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 2 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b 75690561 -rw-r--r--. 1 0 0 0 2023-07-08 10:00:00.000000000 -0400 /tmp/SRC/c [root at archive3 tmp]# rsync --itemize-changes -rlpgoD --times --modify-window=99999 --size-only --link-dest=/tmp/DEST0 /tmp/SRC/ /tmp/DEST1 >f+++++++++ c [root at archive3 tmp]# ls -lin --time-style=full-iso /tmp/{SRC,DEST*}/{a,b,c} 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/a 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST0/b 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST1/a 75676810 -rw-r--r--. 4 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/DEST1/b 75689931 -rw-r--r--. 1 0 0 0 2023-07-08 10:00:00.000000000 -0400 /tmp/DEST1/c 75676805 -rw-r--r--. 1 0 0 4 2023-07-08 12:36:42.955935831 -0400 /tmp/SRC/a 75687257 -rw-r--r--. 1 0 0 4 2023-07-08 14:18:45.497699620 -0400 /tmp/SRC/b 75690561 -rw-r--r--. 1 0 0 0 2023-07-08 10:00:00.000000000 -0400 /tmp/SRC/c [root at archive3 tmp]# In this case, all is well. DEST*/{a,b} share an inode and DEST1/c has the timestamp from SRC/c. The use of --times caused the timestamp of DEST1/c to be correct while --modify-window=99999 appears to have let DEST1/b share an inode with DEST0/{a,b} and DEST1/a since the timestamp of SRC/b is within 99999 seconds of the timestamp on DEST0/b. It works, and it makes sense. Is this really the proper way to do this? It feels like cheating because I'm using --modify-window to affect a particular result (copy vs. link of 'b') as opposed to choosing matches. I understand that "choosing matches" is in the case determining the result (copy vs. link), but ... it still feels like a misuse. Is there a better approach? Am I nuts? Thanks.