I apologize if this has already been discussed before, but as of yet I have been unable to find any info on the topic. I have a very simple (and common) disk based backup system using rsync, hard links, and a little bit of perl to glue it together. Remote machines are backed up regularly using hardlinks across each snapshot to reduce disk usage. Recently I learned that rsync does a checksum of every file transferred. I thought it might be interesting to record the path and checksum of each file in a table. On future backups, the checksum of a file being backed up could be looked up in the table. If there's a matching checksum, a hard link will be created to the match instead of storing a new copy. This means that the use of hard link won't be limited to just the immediately preceding snapshot (as is the case with my current setup). Instead a hard link could be created to an identical file located in a different machine's snapshot. My initial concerns were that doing the checksums would be too CPU expensive, but if rsync is already doing them then that isn't a concern. My next thought was that the checksums would be susceptible to collisions, thus leading to potential data loss by linking to a non-identical file. However, from what I've read on wikipedia, rsync does both a MD5 and a rolling checksum. These two together make it /very/ unlikely to have a collision, thus accidentally linking to a non-identical file is unlikely. Is this approach even possible, or am I missing something? I know my labs have a lot of duplicate data across many machines, so this could save me hundreds of GiBs, maybe even a TiB or two. If this is possible, how can I save the resulting checksum of a file from rsync? Thank you for your time. I look forward to hearing your thoughts. ---Alex
On 2011-11-03, Alex Waite <alexqw85 at gmail.com> wrote:> I apologize if this has already been discussed before, but as of > yet I have been unable to find any info on the topic. > I have a very simple (and common) disk based backup system using > rsync, hard links, and a little bit of perl to glue it together. > Remote machines are backed up regularly using hardlinks across each > snapshot to reduce disk usage. > Recently I learned that rsync does a checksum of every file > transferred. I thought it might be interesting to record the path and > checksum of each file in a table. On future backups, the checksum of > a file being backed up could be looked up in the table. If there's a > matching checksum, a hard link will be created to the match instead of > storing a new copy. This means that the use of hard link won't be > limited to just the immediately preceding snapshot (as is the case > with my current setup). Instead a hard link could be created to an > identical file located in a different machine's snapshot. > My initial concerns were that doing the checksums would be too CPU > expensive, but if rsync is already doing them then that isn't a > concern. My next thought was that the checksums would be susceptible > to collisions, thus leading to potential data loss by linking to a > non-identical file. However, from what I've read on wikipedia, rsync > does both a MD5 and a rolling checksum. These two together make it > /very/ unlikely to have a collision, thus accidentally linking to a > non-identical file is unlikely. > Is this approach even possible, or am I missing something? I know > my labs have a lot of duplicate data across many machines, so this > could save me hundreds of GiBs, maybe even a TiB or two. > If this is possible, how can I save the resulting checksum of a > file from rsync? > Thank you for your time. I look forward to hearing your thoughts. > > ---AlexNot a direct answer, but this may do what you want: http://gitweb.samba.org/?p=rsync-patches.git;a=blob;f=link-by-hash.diff This patch adds the --link-by-hash=DIR option, which hard links received files in a link farm arranged by MD4 file hash. The result is that the system will only store one copy of the unique contents of each file, regardless of the file's name. Cheers, Chris
On 03/11/2011 01:09, Alex Waite wrote:> I apologize if this has already been discussed before, but as of > yet I have been unable to find any info on the topic. > I have a very simple (and common) disk based backup system using > rsync, hard links, and a little bit of perl to glue it together. > Remote machines are backed up regularly using hardlinks across each > snapshot to reduce disk usage. > Recently I learned that rsync does a checksum of every file > transferred. I thought it might be interesting to record the path and > checksum of each file in a table. On future backups, the checksum of > a file being backed up could be looked up in the table. If there's a > matching checksum, a hard link will be created to the match instead of > storing a new copy. This means that the use of hard link won't be > limited to just the immediately preceding snapshot (as is the case > with my current setup). Instead a hard link could be created to an > identical file located in a different machine's snapshot. > My initial concerns were that doing the checksums would be too CPU > expensive, but if rsync is already doing them then that isn't a > concern. My next thought was that the checksums would be susceptible > to collisions, thus leading to potential data loss by linking to a > non-identical file. However, from what I've read on wikipedia, rsync > does both a MD5 and a rolling checksum. These two together make it > /very/ unlikely to have a collision, thus accidentally linking to a > non-identical file is unlikely. > Is this approach even possible, or am I missing something? I know > my labs have a lot of duplicate data across many machines, so this > could save me hundreds of GiBs, maybe even a TiB or two. > If this is possible, how can I save the resulting checksum of a > file from rsync? > Thank you for your time. I look forward to hearing your thoughts.Check out http://backuppc.sourceforge.net/, it's perl-based backup tool, using rsync and doing exactly what you ask for.
Alex Waite (alexqw85 at gmail.com) wrote on 2 November 2011 20:09: > Recently I learned that rsync does a checksum of every file >transferred. I thought it might be interesting to record the path and >checksum of each file in a table. On future backups, the checksum of >a file being backed up could be looked up in the table. If there's a >matching checksum, a hard link will be created to the match instead of >storing a new copy. This means that the use of hard link won't be >limited to just the immediately preceding snapshot (as is the case >with my current setup). Instead a hard link could be created to an >identical file located in a different machine's snapshot. ... > Is this approach even possible, or am I missing something? I know >my labs have a lot of duplicate data across many machines, so this >could save me hundreds of GiBs, maybe even a TiB or two. It is but the management of it all is up to you; it's not rsync's job. > If this is possible, how can I save the resulting checksum of a >file from rsync? You'll have to use at least rsync v3 in the source machines and in the backup one you need v3.1. Configure --out-format with %C to have the md5 in the log. Note that rsync only puts the md5 when it pulls the file (or you use -c); if it does a hardlink itself the md5 is not computed, so it's not put in the log.
2011/11/3 Alex Waite <alexqw85 at gmail.com>:> ? ?Recently I learned that rsync does a checksum of every file > transferred. ?I thought it might be interesting to record the path and > checksum of each file in a table. ?On future backups, the checksum ofI guess you can be interested in these projects: - lessfs:? deduplication and compression via fuse (it's a mature project); - Bup:? it uses git fs to store backup (young but very powerfull). Ciao, Andrea ---------------- ? http://www.lessfs.com ? https://github.com/apenwarr/bup