thr3ads.net - rsync - efficient file appends [Dec 2001]

If this information is useful, please help other people find it:
Share via:

rsync@ka9q.net

2001-Dec-12 19:35 UTC

efficient file appends

Hi. When I discovered rsync, it immediately became one of my most
indispensable utilities. It's a real godsend on bandwidth-limited
links, especially digital cellular.

It works remarkably well in the general case, but I think the
algorithm could be improved for one very important special case.

Many (or even most) of the updated files I transfer with rsync change
only by stuff being appending to the end. Examples of such files
include system logs and (especially) email archives in mbox format.

Rsync correctly handles these files, of course, but I think it could
do so more efficiently. Right now, the receiver sends back a list of
checksums for the blocks it has, and this checksum list can grow quite
long when the file is large. I often see transfers of large mailboxes
where the appendage of one small email message to the sender's copy
results in a reverse transfer of checksum blocks that is much larger
than the new message.

It seems to me that this situation is common enough that the rsync
protocol should look for it as a special case. Once the protocol has
determined from differing timestamps and/or lengths that a file needs
to be synchronized, the receiver should return a hash (and length) of
its copy of the entire file to the sender.  The sender then computes
the hash for the corresponding leading segment of its copy. If they
match, the sender simply sends the newly appended data and instructs
the receiver to append it to its copy.

I just joined this list, and I couldn't find any obvious discussion of
this issue in the archives. My apologies if it has already been
discussed.

Phil Karn

Martin Pool

2001-Dec-13 10:18 UTC

head link

efficient file appends

On 12 Dec 2001, rsync@ka9q.net wrote:
> It seems to me that this situation is common enough that the rsync
> protocol should look for it as a special case. Once the protocol has
> determined from differing timestamps and/or lengths that a file needs
> to be synchronized, the receiver should return a hash (and length) of
> its copy of the entire file to the sender.
That's a good point and an interesting idea.  Naively implemented it
would add another round trip, which would probably not be worthwhile,
but there might be a better solution.  You might like to read the
technical report if you have not already.

-- 
Martin

David Bolen

2001-Dec-13 10:28 UTC

head link

efficient file appends

rsync@ka9q.net [rsync@ka9q.net] writes:
> It seems to me that this situation is common enough that the rsync
> protocol should look for it as a special case. Once the protocol has
> determined from differing timestamps and/or lengths that a file needs
> to be synchronized, the receiver should return a hash (and length) of
> its copy of the entire file to the sender.  The sender then computes
> the hash for the corresponding leading segment of its copy. If they
> match, the sender simply sends the newly appended data and instructs
> the receiver to append it to its copy.
While potentially a useful option, you wouldn't want the protocol to
automatically always check for it, since it would preclude rsync on
the sending side from being able to use part of the original file when
transmitting the newly added data to the receiver.  While perhaps not
helpful for log files, it can be a big win for other files, even if
the current copy on the receiver matches the sender's initial portion.
So at best, you'd only want to enable this option if the only thing
for the entire set of files in a given run were files known to expand
this way.

Alternatively, even with rsync the way it is today, what I do is
manually bump up the blocksize to something large (say 16 or 32K).
This results in far fewer blocks for the checksum algorithm (from
perhaps 10-45x depending on original file size based on the default
dynamic blocksize selection) and thus minimizes the meta data
transmitted for the common portion of the file.  It works pretty well
for me with database transaction log files which get pretty big.  You
can probably find some past e-mail on the subject in the list by
looking for threads about rsync blocksize.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l@fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/

David Bolen

2001-Dec-13 14:14 UTC

head link

efficient file appends

rsync@ka9q.net [rsync@ka9q.net] writes:
> >While potentially a useful option, you wouldn't want the protocol
to
> >automatically always check for it, since it would preclude rsync on
>
> This extension need not break any existing mechanism; if the hash of
> the receiver's copy of the file doesn't match the start of the
> sender's file, the protocol would continue as before.
Well, my point was that even if it does match, you might still want
the protocol to continue as before.  For example, if you have a file
that grows, but tends to contain similar information.  In that case,
you still want the per-block checksum information from the destination
because that way the source can use that information to minimize the
amount of new information to transmit.  Without having the per-block
information, it can't tell how to extract data from the current copy
at the destination to re-use for the new data rather than sending the
new data directly.  Not a big deal for appending log files (as long as
they have changing date strings), but not necessarily something to
have enabled by default.
> >Alternatively, even with rsync the way it is today, what I do is
> >manually bump up the blocksize to something large (say 16 or 32K).
> 
> This sounds like an excellent idea, and I'll give it a try. As the
> blocksize reaches the receiver's file size, the scheme essentially
> approaches my idea.
Hmm, I've never tried _really_ large block sizes (I thought I had
problems if I got close to 64K, but I may be mis-remembering).  The
one drawback to the larger block sizes is that if you do encounter any
differences, you'll retransmit more information than necessary, but if
you do beforehand it's definitely just appended dat that won't be the
case.

-- David

/-----------------------------------------------------------------------\
 \               David Bolen            \   E-mail: db3l@fitlinxx.com  /
  |             FitLinxx, Inc.            \  Phone: (203) 708-5192    |
 /  860 Canal Street, Stamford, CT  06902   \  Fax: (203) 316-5150     \
\-----------------------------------------------------------------------/

Maybe Matching Threads

Search for more reasonably related threads

rsync - Dec 2001 - efficient file appends

efficient file appends

efficient file appends

efficient file appends

efficient file appends

Maybe Matching Threads