Bernard A Badger
2002-May-15 14:00 UTC
Does any rsync-based diff, rmdup, cvs software exist?
I'd like to be able to run GNU-diff type comparisons, but use R-sync technology to make it efficient to see what's different without transmitting all that data. Another thing I like to do using rsync protocol, is what I call rmdup -- remove duplicates. This would allow me to recursively (like diff -r) compare files in two (!!MUST BE!!) different directories and remove one (or the other) of the duplicates. Again, the rsync protocol could be useful in configuration management, for computing the "deltas" that must be stored. Have these ideas been implemented?
> I'd like to be able to run GNU-diff type comparisons, > but use R-sync technology to make it efficient to see what's > different without transmitting all that data.Rsync is great at synchronizing data between a source and destination. For diff-like comparisons, perhaps something like CVS might be more apropriate.> Another thing I like to do using rsync protocol, > is what I call rmdup -- remove duplicates. > This would allow me to recursively (like diff -r) compare files in > two (!!MUST BE!!) different directories and remove one (or the other) > of the duplicates.A shell script that does something similar to what you want without using rsync.... #!/bin/sh # Our md5 checksum program (rsync uses md4, but the concept is the same) MD5=md5sum # On RedHat 7.1 #MD5=md5 # In *bsd # Inventory the source directory cd $SOURCE_DIR src=/var/tmp/find.$$.src find -x -type f -print | xargs $MD5 | awk {print $2, $1} | sort > $src # Inventory the destination directory cd $DESTINATION_DIR dst=/var/tmp/find.$$.dst find -x -type f -print | xargs $MD5 | awk {print $2, $1} | sort > $dst # Remove duplicates in the destination directory cd $DESTINATION_DIR comm -12 $src $dst | sed -e 's/ .*//' | xargs rm -i # rm $src $dst Note: "comm -12" does a line by line comparison of the two checksum lists. The output is lines common to both files. If a filename/checksum matchs for both the source and destination directory, the file in the destination directory is the "duplicate" (per the definition in the e-mail) and is piped to "xargs rm" for removal. Note: Configuring for use with source or destination directory on a remote host would include the strategic use of rsh or ssh. The good news is that because only a list of checksums is needed for comparison, the bandwidth needed between servers is minimized (like rsync).> Again, the rsync protocol could be useful in configuration management, > for computing the "deltas" that must be stored.CVS (or even RCS) is more useful for configuration management and updates of text files. It also archives changes over time. As far as I'm aware (without looking at source code), rsync does block-level comparisons, not line-by-line. -- Eric Ziegast