Hi. When I discovered rsync, it immediately became one of my most indispensable utilities. It's a real godsend on bandwidth-limited links, especially digital cellular. It works remarkably well in the general case, but I think the algorithm could be improved for one very important special case. Many (or even most) of the updated files I transfer with rsync change only by stuff being appending to the end. Examples of such files include system logs and (especially) email archives in mbox format. Rsync correctly handles these files, of course, but I think it could do so more efficiently. Right now, the receiver sends back a list of checksums for the blocks it has, and this checksum list can grow quite long when the file is large. I often see transfers of large mailboxes where the appendage of one small email message to the sender's copy results in a reverse transfer of checksum blocks that is much larger than the new message. It seems to me that this situation is common enough that the rsync protocol should look for it as a special case. Once the protocol has determined from differing timestamps and/or lengths that a file needs to be synchronized, the receiver should return a hash (and length) of its copy of the entire file to the sender. The sender then computes the hash for the corresponding leading segment of its copy. If they match, the sender simply sends the newly appended data and instructs the receiver to append it to its copy. I just joined this list, and I couldn't find any obvious discussion of this issue in the archives. My apologies if it has already been discussed. Phil Karn
On 12 Dec 2001, rsync@ka9q.net wrote:> It seems to me that this situation is common enough that the rsync > protocol should look for it as a special case. Once the protocol has > determined from differing timestamps and/or lengths that a file needs > to be synchronized, the receiver should return a hash (and length) of > its copy of the entire file to the sender.That's a good point and an interesting idea. Naively implemented it would add another round trip, which would probably not be worthwhile, but there might be a better solution. You might like to read the technical report if you have not already. -- Martin
rsync@ka9q.net [rsync@ka9q.net] writes:> It seems to me that this situation is common enough that the rsync > protocol should look for it as a special case. Once the protocol has > determined from differing timestamps and/or lengths that a file needs > to be synchronized, the receiver should return a hash (and length) of > its copy of the entire file to the sender. The sender then computes > the hash for the corresponding leading segment of its copy. If they > match, the sender simply sends the newly appended data and instructs > the receiver to append it to its copy.While potentially a useful option, you wouldn't want the protocol to automatically always check for it, since it would preclude rsync on the sending side from being able to use part of the original file when transmitting the newly added data to the receiver. While perhaps not helpful for log files, it can be a big win for other files, even if the current copy on the receiver matches the sender's initial portion. So at best, you'd only want to enable this option if the only thing for the entire set of files in a given run were files known to expand this way. Alternatively, even with rsync the way it is today, what I do is manually bump up the blocksize to something large (say 16 or 32K). This results in far fewer blocks for the checksum algorithm (from perhaps 10-45x depending on original file size based on the default dynamic blocksize selection) and thus minimizes the meta data transmitted for the common portion of the file. It works pretty well for me with database transaction log files which get pretty big. You can probably find some past e-mail on the subject in the list by looking for threads about rsync blocksize. -- David /-----------------------------------------------------------------------\ \ David Bolen \ E-mail: db3l@fitlinxx.com / | FitLinxx, Inc. \ Phone: (203) 708-5192 | / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \-----------------------------------------------------------------------/
rsync@ka9q.net [rsync@ka9q.net] writes:> >While potentially a useful option, you wouldn't want the protocol to > >automatically always check for it, since it would preclude rsync on > > This extension need not break any existing mechanism; if the hash of > the receiver's copy of the file doesn't match the start of the > sender's file, the protocol would continue as before.Well, my point was that even if it does match, you might still want the protocol to continue as before. For example, if you have a file that grows, but tends to contain similar information. In that case, you still want the per-block checksum information from the destination because that way the source can use that information to minimize the amount of new information to transmit. Without having the per-block information, it can't tell how to extract data from the current copy at the destination to re-use for the new data rather than sending the new data directly. Not a big deal for appending log files (as long as they have changing date strings), but not necessarily something to have enabled by default.> >Alternatively, even with rsync the way it is today, what I do is > >manually bump up the blocksize to something large (say 16 or 32K). > > This sounds like an excellent idea, and I'll give it a try. As the > blocksize reaches the receiver's file size, the scheme essentially > approaches my idea.Hmm, I've never tried _really_ large block sizes (I thought I had problems if I got close to 64K, but I may be mis-remembering). The one drawback to the larger block sizes is that if you do encounter any differences, you'll retransmit more information than necessary, but if you do beforehand it's definitely just appended dat that won't be the case. -- David /-----------------------------------------------------------------------\ \ David Bolen \ E-mail: db3l@fitlinxx.com / | FitLinxx, Inc. \ Phone: (203) 708-5192 | / 860 Canal Street, Stamford, CT 06902 \ Fax: (203) 316-5150 \ \-----------------------------------------------------------------------/