Hi Harald,
On Mon, Dec 12, 2016 at 02:31:03PM +0100, Harald Dunkel
wrote:> Have you considered to introduce a "deduplicate mode" for
> rsync, replacing duplicate files in the destination directory
> by hard links?
For a month now I have been successfully using the offline
deduplication feature that is currently experimental in XFS to
reduce the size of my rsnapshot backups. Some more info:
http://strugglers.net/~andy/blog/2017/01/10/xfs-reflinks-and-deduplication/
rsync is hardlinking together files that do not change between two
backup runs, but reflinks are also allowing me to deduplicate files
that cycle between known content, also partially-identical files and
identical regions across multiple different directories (so from
different hosts, for example).
At the moment it is saving me about 27% volume. This is of course
totally dependent on what you are backing up.
Also do note that examining the whole tree of files is really hard
on the storage as it hits it with a large amount of random IO.
Especially with slow rotational storage it may well be cheaper just
to buy more capacity. Personally I am using SSD so the performance
vs. capacity trade-off is different.
Not speaking for the rsync developers but deduplicating all files
within a directory would need rsync to read all files in a directory
which is something it wouldn't normally do unless they are going to
be the target for a file transfer. Since other utilities already
exist for examining files and hardlinking dupes together, or indeed
doing it inside the filesystem on a block/extent level basis, maybe
it is not appropriate to put the feature inside rsync.
Cheers,
Andy