thr3ads.net - rsync - Use rsync's checksums to deduplicate across backups [Nov 2011]

If this information is useful, please help other people find it:
Share via:

Alex Waite

2011-Nov-03 01:09 UTC

Use rsync's checksums to deduplicate across backups

I apologize if this has already been discussed before, but as of
yet I have been unable to find any info on the topic.
    I have a very simple (and common) disk based backup system using
rsync, hard links, and a little bit of perl to glue it together.
Remote machines are backed up regularly using hardlinks across each
snapshot to reduce disk usage.
    Recently I learned that rsync does a checksum of every file
transferred.  I thought it might be interesting to record the path and
checksum of each file in a table.  On future backups, the checksum of
a file being backed up could be looked up in the table.  If there's a
matching checksum, a hard link will be created to the match instead of
storing a new copy.  This means that the use of hard link won't be
limited to just the immediately preceding snapshot (as is the case
with my current setup).  Instead a hard link could be created to an
identical file located in a different machine's snapshot.
    My initial concerns were that doing the checksums would be too CPU
expensive, but if rsync is already doing them then that isn't a
concern.  My next thought was that the checksums would be susceptible
to collisions, thus leading to potential data loss by linking to a
non-identical file.  However, from what I've read on wikipedia, rsync
does both a MD5 and a rolling checksum.  These two together make it
/very/ unlikely to have a collision, thus accidentally linking to a
non-identical file is unlikely.
    Is this approach even possible, or am I missing something?  I know
my labs have a lot of duplicate data across many machines, so this
could save me hundreds of GiBs, maybe even a TiB or two.
    If this is possible, how can I save the resulting checksum of a
file from rsync?
   Thank you for your time.  I look forward to hearing your thoughts.

---Alex

Chris Dunlop

2011-Nov-03 01:55 UTC

head link

Use rsync's checksums to deduplicate across backups

On 2011-11-03, Alex Waite <alexqw85 at gmail.com>
wrote:>     I apologize if this has already been discussed before, but as of
> yet I have been unable to find any info on the topic.
>     I have a very simple (and common) disk based backup system using
> rsync, hard links, and a little bit of perl to glue it together.
> Remote machines are backed up regularly using hardlinks across each
> snapshot to reduce disk usage.
>     Recently I learned that rsync does a checksum of every file
> transferred.  I thought it might be interesting to record the path and
> checksum of each file in a table.  On future backups, the checksum of
> a file being backed up could be looked up in the table.  If there's a
> matching checksum, a hard link will be created to the match instead of
> storing a new copy.  This means that the use of hard link won't be
> limited to just the immediately preceding snapshot (as is the case
> with my current setup).  Instead a hard link could be created to an
> identical file located in a different machine's snapshot.
>     My initial concerns were that doing the checksums would be too CPU
> expensive, but if rsync is already doing them then that isn't a
> concern.  My next thought was that the checksums would be susceptible
> to collisions, thus leading to potential data loss by linking to a
> non-identical file.  However, from what I've read on wikipedia, rsync
> does both a MD5 and a rolling checksum.  These two together make it
> /very/ unlikely to have a collision, thus accidentally linking to a
> non-identical file is unlikely.
>     Is this approach even possible, or am I missing something?  I know
> my labs have a lot of duplicate data across many machines, so this
> could save me hundreds of GiBs, maybe even a TiB or two.
>     If this is possible, how can I save the resulting checksum of a
> file from rsync?
>    Thank you for your time.  I look forward to hearing your thoughts.
>
> ---Alex
Not a direct answer, but this may do what you want:

  gitweb.samba.org/?p=rsync-patches.git;a=blob;f=link-by-hash.diff

  This patch adds the --link-by-hash=DIR option, which hard links received
  files in a link farm arranged by MD4 file hash.  The result is that the system
  will only store one copy of the unique contents of each file, regardless of
  the file's name.


Cheers,

Chris

Johannes Totz

2011-Nov-03 13:19 UTC

head link

Use rsync's checksums to deduplicate across backups

On 03/11/2011 01:09, Alex Waite wrote:>     I apologize if this has already been discussed before, but as of
> yet I have been unable to find any info on the topic.
>     I have a very simple (and common) disk based backup system using
> rsync, hard links, and a little bit of perl to glue it together.
> Remote machines are backed up regularly using hardlinks across each
> snapshot to reduce disk usage.
>     Recently I learned that rsync does a checksum of every file
> transferred.  I thought it might be interesting to record the path and
> checksum of each file in a table.  On future backups, the checksum of
> a file being backed up could be looked up in the table.  If there's a
> matching checksum, a hard link will be created to the match instead of
> storing a new copy.  This means that the use of hard link won't be
> limited to just the immediately preceding snapshot (as is the case
> with my current setup).  Instead a hard link could be created to an
> identical file located in a different machine's snapshot.
>     My initial concerns were that doing the checksums would be too CPU
> expensive, but if rsync is already doing them then that isn't a
> concern.  My next thought was that the checksums would be susceptible
> to collisions, thus leading to potential data loss by linking to a
> non-identical file.  However, from what I've read on wikipedia, rsync
> does both a MD5 and a rolling checksum.  These two together make it
> /very/ unlikely to have a collision, thus accidentally linking to a
> non-identical file is unlikely.
>     Is this approach even possible, or am I missing something?  I know
> my labs have a lot of duplicate data across many machines, so this
> could save me hundreds of GiBs, maybe even a TiB or two.
>     If this is possible, how can I save the resulting checksum of a
> file from rsync?
>    Thank you for your time.  I look forward to hearing your thoughts.
Check out backuppc.sourceforge.net, it's perl-based backup tool,
using rsync and doing exactly what you ask for.

Carlos Carvalho

2011-Nov-03 14:35 UTC

head link

Use rsync's checksums to deduplicate across backups

Alex Waite (alexqw85 at gmail.com) wrote on 2 November 2011 20:09:
 >    Recently I learned that rsync does a checksum of every file
 >transferred.  I thought it might be interesting to record the path and
 >checksum of each file in a table.  On future backups, the checksum of
 >a file being backed up could be looked up in the table.  If there's a
 >matching checksum, a hard link will be created to the match instead of
 >storing a new copy.  This means that the use of hard link won't be
 >limited to just the immediately preceding snapshot (as is the case
 >with my current setup).  Instead a hard link could be created to an
 >identical file located in a different machine's snapshot.
...
 >    Is this approach even possible, or am I missing something?  I know
 >my labs have a lot of duplicate data across many machines, so this
 >could save me hundreds of GiBs, maybe even a TiB or two.

It is but the management of it all is up to you; it's not rsync's job.

 >    If this is possible, how can I save the resulting checksum of a
 >file from rsync?

You'll have to use at least rsync v3 in the source machines and in the
backup one you need v3.1. Configure --out-format with %C to have the
md5 in the log. Note that rsync only puts the md5 when it pulls the
file (or you use -c); if it does a hardlink itself the md5 is not
computed, so it's not put in the log.

Andrea Gelmini

2011-Nov-19 13:43 UTC

head link

Use rsync's checksums to deduplicate across backups

2011/11/3 Alex Waite <alexqw85 at gmail.com>:> ? ?Recently I learned that rsync does a checksum of every file
> transferred. ?I thought it might be interesting to record the path and
> checksum of each file in a table. ?On future backups, the checksum of
I guess you can be interested in these projects:
- lessfs:? deduplication and compression via fuse (it's a mature project);
- Bup:? it uses git fs to store backup (young but very powerfull).

Ciao,
Andrea

----------------
? lessfs.com
? github.com/apenwarr/bup

Maybe Matching Threads

Search for more reasonably related threads

rsync - Nov 2011 - Use rsync's checksums to deduplicate across backups

Use rsync's checksums to deduplicate across backups

Use rsync's checksums to deduplicate across backups

Use rsync's checksums to deduplicate across backups

Use rsync's checksums to deduplicate across backups

Use rsync's checksums to deduplicate across backups

Maybe Matching Threads