Helge Jensen
2004-Dec-03 11:33 UTC
PROPOSAL: --link-hash-dest, additional linking of files to their HASH values
I'm using the excellent --link-dest option on rsync to do every-day full backups of a number of servers. These servers share most of the file-system content, but of course has some variations. In order to reduce the space used for each backup, I was thinking about linking all files in the backups to their hash in a separate directory, allowing only one storage of each file-value, reducing my backup-needs with ~5Gb per machince. Further this would allow rysnc to represent moved files without any space-usage at all, for example rotated backups and user-moved files fit this description. I realize, that there may be problems with file-system performance on such large directories, but that could probably be workarounded by using some prefix-subdirs. For cleanup, one could find all the files in the --link-hash-dest directory with only 1 link. Any comments? -- Helge
Paul Slootman
2004-Dec-03 11:57 UTC
PROPOSAL: --link-hash-dest, additional linking of files to their HASH values
On Fri 03 Dec 2004, Helge Jensen wrote:> > In order to reduce the space used for each backup, I was thinking about > linking all files in the backups to their hash in a separate directory, > allowing only one storage of each file-value, reducing my backup-needs > with ~5Gb per machince. > > Further this would allow rysnc to represent moved files without any > space-usage at all, for example rotated backups and user-moved files fit > this description.Yes, this would be a great addition! Of course, you would also need to add to the hash meta data such as owner / group / file modes...> I realize, that there may be problems with file-system performance on > such large directories, but that could probably be workarounded by using > some prefix-subdirs.Yes, 2-4 levels should be sufficient for most, I wonder whether it would be worth it to make this configurable. Paul Slootman
Wayne Davison
2004-Dec-03 16:26 UTC
PROPOSAL: --link-hash-dest, additional linking of files to their HASH values
On Fri, Dec 03, 2004 at 12:29:34PM +0100, Helge Jensen wrote:> In order to reduce the space used for each backup, I was thinking about > linking all files in the backups to their hash in a separate directory, > allowing only one storage of each file-value, reducing my backup-needs > with ~5Gb per machince.One possibility is to use BackupPC, which has created a perl script that can talk the rsync protocol (among other file-acquisition methods) with the machines that you wish to backup. It pools all the identical files together (regardless of each file's attributes, which are kept separate from the file's data) and even compresses the data. We link to the BackupPC project on our resources webpage: http://rsync.samba.org/resources.html Another first-step toward your goal is the link-by-hash.diff file in the patches directory. This code modifies rsync to link multiple files together in the specified directory based on their hash, but the newest file's attributes are the only file attributes that are saved (thus, if you need to restore a file, you may need to set the permissions, ownership, etc. manually). The patch needs some work (see the mailing list for the prior discussion on the subject), so it is not yet ready for use in a production environment. ..wayne..