thr3ads.net - rsync - Thought on large files [Jan 2008]

If this information is useful, please help other people find it:
Share via:

Brendan Grieve

2008-Jan-23 04:44 UTC

Thought on large files

Hi There,

I've been toying around with the code of rsync on and off for a while, 
and I had a thought that I would like some comments on. Its to do with 
very large files and disk space.

One of the common uses of rsync is to use it as a backup program. A 
client connects to the rsync server, and sends over any changed files. 
If the client has very large files that have changed marginally, then 
rsync efficiently only sends the changed bits.

On the server side, one may have it set up to create 'snapshots' of the 
existing data there by hardlinking that data to another directory 
periodically. Theres plenty of documentation on the web how to do this 
so I won't go into it further.

This is very effective and uses quite little disk space since a file 
that does not change effectively doesn't take up any more disk space 
(not much more anyway), even if it exists now in many snapshots.

One place where this falls down is if the file is very large. Lets say 
the file, whatever it is, is a 10Gb file, and that some small amount of 
data changes in it. This is efficiently sent accross by rsync, BUT the 
rsync server side will correctly break the hard-link and create a new 
file with the changed bits. This means, if even 1 byte of that 10Gb file 
changes, you now have to store that whole file again.

I won't get into the whole issue of why one would have big files etc... 
I see it all the time, especially in the Microsoft world, with Outlook 
PST files, and Microsoft Exchange Database files.

What my thoughts were is that if the server could transparently break a 
large file into chunks and store them that way, then one can still make 
use of hard-links efficiently.

For example, going back to a 10Gb Exchange Database file, its likely not 
going to change too much during use. So if the server stored the huge 
clumsy 'priv1.edb' as:
  .priv1.edb._somemagicstring_.1
  .priv1.edb._somemagicstring_.2
etc...

and intelligently only broke the 'hard-links' of the bits that actually 
change, then it all works well. One could have an option to enable this 
for files bigger than a certain size, and break them into specific sized 
chunks.

One could quite rightly argue that this changes rsync from a tool that 
synchronizes data between places to a dedicated backup tool (as the two 
sides will now have physically different data), however I could see it 
being useful, especially since it wouldn't need changes on the client 
side as the server still presents it as just one file.

What are your comments? Good idea? Stupid idea? Been done before? Does 
anyone have some hints about where in the code I should look to make 
these changes so I can test it out?



Brendan Grieve

Matt McCutchen

2008-Jan-23 06:39 UTC

head link

Thought on large files

On Wed, 2008-01-23 at 13:38 +0900, Brendan Grieve wrote:> Lets say 
> the file, whatever it is, is a 10Gb file, and that some small amount of 
> data changes in it. This is efficiently sent accross by rsync, BUT the 
> rsync server side will correctly break the hard-link and create a new 
> file with the changed bits. This means, if even 1 byte of that 10Gb file 
> changes, you now have to store that whole file again.
> What my thoughts were is that if the server could transparently break a 
> large file into chunks and store them that way, then one can still make 
> use of hard-links efficiently.
This is a fine idea, but I don't think support for this should be added
to rsync.  Instead, I suggest that you use rdiff-backup
( nongnu.org/rdiff-backup ), a backup tool that stores an
ordinary latest snapshot of the source along with reverse deltas for
previous snapshots and redundant attribute information both in its own
format.

Matt

Matt McCutchen

2008-Jan-24 05:06 UTC

head link

Thought on large files

On Thu, 2008-01-24 at 13:54 +0900, Brendan Grieve wrote:> I had a look at rdiff-backup, but I was trying to get something that
> spoke native rsync (IE, not to force any change on the client side).
To achieve this, you can have the client push to an rsync daemon and
then have the daemon call rdiff-backup so that the rdiff-backup part
happens entirely on the server.  The idea is the same as the
daemon-and-rsnapshot setup I described in the following message, but
with rdiff-backup in place of rsnapshot as the backend:

lists.samba.org/archive/rsync/2007-December/019470.html
> After some thought I think the best place to put such a change would
> be at the filesystem level. For example, if one had a FUSE filesystem
> that simply ran on top of an existing one, that wrote its files as I
> described (or uses diff-like methods), but presents a clean filesystem
> for rsync (or indeed any tool) to make use of. I believe I may look in
> that direction instead of hacking rsync.
You could do that, but note that the rsync receiver won't explicitly
tell the filesystem what files are similar, so you'll have to either
keep a big hashtable to help you coalesce identical blocks globally or
use some kludge like looking at what other files the receiver has open
while it is writing the destination file.

Matt

Maybe Matching Threads

Search for more seemingly similar threads

rsync - Jan 2008 - Thought on large files

Thought on large files

Thought on large files

Thought on large files

Maybe Matching Threads