T.J. Crowder
2012-Aug-10 16:03 UTC
Question about --partial-dir and aborted transfers of large files
Apologies to the list, the title of this thread is completely wrong. It should be something like "Question about --partial-dir and aborted transfers of large files". Let's see if this mailing list program will allow me to change it... -- T.J. On 10 August 2012 15:28, T.J. Crowder <tj at crowdersoftware.com> wrote:> Hi all, > > rsync is a fantastic tool. :-) I'm blown away with what I've seen so far. > > I have a question about --partial-dir transfers. I've read through this > thread: > http://lists.samba.org/archive/rsync/2011-July/026575.html > ...but while similar, I don't think it's quite the same, and I didn't find > my answer there. > > The short(ish) version: > > 1. Am I correct in inferring that when rsync sees data for a file in the > --partial-dir directory, it applies its delta transfer algorithm to the > partial file? > > 2. And that this is _instead of_ applying it to the real target file? (Not > a nifty three-way combination.) > > If so, it would appear that this means a large amount of unnecessary data > may end up being transferred in the second sync of a large file if you > interrupt the first sync. Is there an option or some such to address this? > If not, would it be feasible to add? (Details on how I see that working > below, and I may be able to pitch in.) > > The long version: > > Sometimes I need to sync very large files (VM disk images) using ssh, > during an eight-hour time window. With my connection to the target server, > eight hours is unlikely to be enough, so I'll have to interrupt the sync > and continue it in the next day's window. Sometimes, the VM disk image will > be changed again in the meantime, but this isn't necessary to trigger the > behavior I mentioned above. (It is a case I'll have to handle.) > > I've run a few experiments with rsync in this area, and it looks like it > causes a fair bit of unnecessary data transfer. > > Here's how I caused that: > > 1. I created a file with 100,000 lines of text with exactly the same > length, and put it in both the source and destination. > > 2. In the source copy, I modified the first 20K lines. So roughly 20% of > the file has been changed. I didn't change the *length* of the lines (in > any of these experiments), because I'm trying to emulate a VM disk file > which is conveniently organized into fixed-size blocks. > > 3. I started a sync: > > rsync -avr --partial-dir=.rstmp src username at server:/dest/ > > ...and cancelled it part-way through. This leaves a partial file in my > .rstmp directory as expected. (In my case, just the first few hundred > lines.) > > 4. I restarted the sync, allowing it to complete. > > The second sync ended up transferring nearly the entire file, basically > the whole 100K lines minus the few hundred from the first sync. The 80K of > unchanged lines were transferred, whereas if I hadn't interrupted the first > sync, they wouldn't have been. > > I followed up with this experiment: > > 1. Starting with a synced file, I changed 20K lines in the *middle* of the > file rather than at the beginning. > > 2. I started a sync and cancelled it part-way through, after about the > same amount of time as the previous experiment. This leaves a partial file > in my .rstmp directory as expected -- but it's a LOT bigger, rsync has > quite intelligently copied the unchanged beginning of the file locally on > the target machine, up until the first change, and then transferred the > changed data after that -- which is when I interrupted it. > > 3. I started the sync again and let it continue, and it sent all of the > rest of the file, the vast majority of which was already present in the > original target file. > > In subsequent experiments, I was able to determine that if I changed part > of the file that had already been transferred into the partial file (say, > changing line 1 between steps 2 and 3 above), rsync was very smart about > that, just transferring the changed bit without re-transferring everything > in-between. That's why it seems to me it uses the full delta-transfer > algorithm on the partial -- or at least some version of it. > > All of this seems to suggest that the partial file is created by copying > the target file up to the first change and then applying changes -- but > that if you interrupt it, because the partial file is shorter than the > source file, all of the remaining source file is transferred. > > Armed with that information, I tried to box clever: I thought "If I know > I'm going to be doing one of these big files, maybe I could just copy the > target to the .rstmp on the target machine in advance, so the > delta-transfer applies to it." Unfortunately, though, cancelling the > transfer early truncates the partial file. Drat. It wouldn't have been > particularly elegant, but still would have been a workaround for now. > > If I'm right about all of the above (which I wouldn't put money on), it > seems like it would be possible to address this in a logically simple way. > Logically simple doesn't equate to being simple in code, of course. :-) The > idea being, basically, that when referring to blocks in the target partial > file (whether for determining the checksum of the block or transferring the > data), if the target partial file is missing the block entirely, use the > equivalent block from the actual target file -- so for checksum purposes, > that tells us whether it changed, and for data transfer purposes if it > didn't change, we know we can copy it locally on the target server. > > If there isn't already an option to address this, would it be feasible to > do? I may be able to pitch in if so. > > Thanks in advance, > -- > T.J. Crowder >-------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20120810/478c4878/attachment.html>
Wayne Davison
2012-Aug-12 17:41 UTC
Question about --partial-dir and aborted transfers of large files
On Fri, Aug 10, 2012 at 9:03 AM, T.J. Crowder <tj at crowdersoftware.com>wrote:> 1. Am I correct in inferring that when rsync sees data for a file in the > --partial-dir directory, it applies its delta transfer algorithm to the > partial file? >2. And that this is _instead of_ applying it to the real target file? (Not> a nifty three-way combination.) >Yes. The current code behaves the same as if you had specified --partial (as far as the next transfer goes), just without actually being destructive of the destination file. I have imagined making the code pretend that the partial file and any destination file are concatenated together for the purpose of generating checksums. That would allow content references to both files, but rsync would need to be enhanced to open both files in both the generator and the receiver and be able to figure out what read goes where (which shouldn't be too hard). I'd suggest that the code read the partial file first, padding out the end of its data to an even checksum-sized unit so that the destination file starts on a even checksum boundary (so that the code never needs to combine data from two files in a single checksum or copy reference). If so, it would appear that this means a large amount of unnecessary data> may end up being transferred in the second sync of a large file if you > interrupt the first sync. >It all depends on where you interrupt it and how much data matches in the remaining portion of the destination file. It does give you the option of discarding the partial data if it is too short to be useful, or possibly doing your own concatenation of the whole (or trailing portion) of the destination file onto the partial file, should you want to tweak things before resuming the transfer. ..wayne.. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.samba.org/pipermail/rsync/attachments/20120812/1da0e5c6/attachment.html>