thr3ads.net - rsync - Question about --partial-dir and aborted transfers of large files [Aug 2012]

If this information is useful, please help other people find it:
Share via:

T.J. Crowder

2012-Aug-10 16:03 UTC

Question about --partial-dir and aborted transfers of large files

Apologies to the list, the title of this thread is completely wrong. It
should be something like "Question about --partial-dir and aborted
transfers of large files". Let's see if this mailing list program will
allow me to change it...

-- T.J.


On 10 August 2012 15:28, T.J. Crowder <tj at crowdersoftware.com> wrote:
> Hi all,
>
> rsync is a fantastic tool. :-) I'm blown away with what I've seen
so far.
>
> I have a question about --partial-dir transfers. I've read through this
> thread:
> http://lists.samba.org/archive/rsync/2011-July/026575.html
> ...but while similar, I don't think it's quite the same, and I
didn't find
> my answer there.
>
> The short(ish) version:
>
> 1. Am I correct in inferring that when rsync sees data for a file in the
> --partial-dir directory, it applies its delta transfer algorithm to the
> partial file?
>
> 2. And that this is _instead of_ applying it to the real target file? (Not
> a nifty three-way combination.)
>
> If so, it would appear that this means a large amount of unnecessary data
> may end up being transferred in the second sync of a large file if you
> interrupt the first sync. Is there an option or some such to address this?
> If not, would it be feasible to add? (Details on how I see that working
> below, and I may be able to pitch in.)
>
> The long version:
>
> Sometimes I need to sync very large files (VM disk images) using ssh,
> during an eight-hour time window. With my connection to the target server,
> eight hours is unlikely to be enough, so I'll have to interrupt the
sync
> and continue it in the next day's window. Sometimes, the VM disk image
will
> be changed again in the meantime, but this isn't necessary to trigger
the
> behavior I mentioned above. (It is a case I'll have to handle.)
>
> I've run a few experiments with rsync in this area, and it looks like
it
> causes a fair bit of unnecessary data transfer.
>
> Here's how I caused that:
>
>  1. I created a file with 100,000 lines of text with exactly the same
> length, and put it in both the source and destination.
>
> 2. In the source copy, I modified the first 20K lines. So roughly 20% of
> the file has been changed. I didn't change the *length* of the lines
(in
> any of these experiments), because I'm trying to emulate a VM disk file
> which is conveniently organized into fixed-size blocks.
>
> 3. I started a sync:
>
> rsync -avr --partial-dir=.rstmp src username at server:/dest/
>
> ...and cancelled it part-way through. This leaves a partial file in my
> .rstmp directory as expected. (In my case, just the first few hundred
> lines.)
>
> 4. I restarted the sync, allowing it to complete.
>
> The second sync ended up transferring nearly the entire file, basically
> the whole 100K lines minus the few hundred from the first sync. The 80K of
> unchanged lines were transferred, whereas if I hadn't interrupted the
first
> sync, they wouldn't have been.
>
> I followed up with this experiment:
>
> 1. Starting with a synced file, I changed 20K lines in the *middle* of the
> file rather than at the beginning.
>
> 2. I started a sync and cancelled it part-way through, after about the
> same amount of time as the previous experiment. This leaves a partial file
> in my .rstmp directory as expected -- but it's a LOT bigger, rsync has
> quite intelligently copied the unchanged beginning of the file locally on
> the target machine, up until the first change, and then transferred the
> changed data after that -- which is when I interrupted it.
>
> 3. I started the sync again and let it continue, and it sent all of the
> rest of the file, the vast majority of which was already present in the
> original target file.
>
> In subsequent experiments, I was able to determine that if I changed part
> of the file that had already been transferred into the partial file (say,
> changing line 1 between steps 2 and 3 above), rsync was very smart about
> that, just transferring the changed bit without re-transferring everything
> in-between. That's why it seems to me it uses the full delta-transfer
> algorithm on the partial -- or at least some version of it.
>
> All of this seems to suggest that the partial file is created by copying
> the target file up to the first change and then applying changes -- but
> that if you interrupt it, because the partial file is shorter than the
> source file, all of the remaining source file is transferred.
>
> Armed with that information, I tried to box clever: I thought "If I
know
> I'm going to be doing one of these big files, maybe I could just copy
the
> target to the .rstmp on the target machine in advance, so the
> delta-transfer applies to it." Unfortunately, though, cancelling the
> transfer early truncates the partial file. Drat. It wouldn't have been
> particularly elegant, but still would have been a workaround for now.
>
> If I'm right about all of the above (which I wouldn't put money
on), it
> seems like it would be possible to address this in a logically simple way.
> Logically simple doesn't equate to being simple in code, of course. :-)
The
> idea being, basically, that when referring to blocks in the target partial
> file (whether for determining the checksum of the block or transferring the
> data), if the target partial file is missing the block entirely, use the
> equivalent block from the actual target file -- so for checksum purposes,
> that tells us whether it changed, and for data transfer purposes if it
> didn't change, we know we can copy it locally on the target server.
>
> If there isn't already an option to address this, would it be feasible
to
> do? I may be able to pitch in if so.
>
> Thanks in advance,
> --
> T.J. Crowder
>-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20120810/478c4878/attachment.html>

Wayne Davison

2012-Aug-12 17:41 UTC

head link

Question about --partial-dir and aborted transfers of large files

On Fri, Aug 10, 2012 at 9:03 AM, T.J. Crowder <tj at
crowdersoftware.com>wrote:
> 1. Am I correct in inferring that when rsync sees data for a file in the
> --partial-dir directory, it applies its delta transfer algorithm to the
> partial file?
>2. And that this is _instead of_ applying it to the real target file?
(Not> a nifty three-way combination.)
>
Yes.  The current code behaves the same as if you had specified --partial
(as far as the next transfer goes), just without actually being destructive
of the destination file.

I have imagined making the code pretend that the partial file and any
destination file are concatenated together for the purpose of generating
checksums.  That would allow content references to both files, but rsync
would need to be enhanced to open both files in both the generator and the
receiver and be able to figure out what read goes where (which shouldn't be
too hard).  I'd suggest that the code read the partial file first, padding
out the end of its data to an even checksum-sized unit so that the
destination file starts on a even checksum boundary (so that the code never
needs to combine data from two files in a single checksum or copy
reference).

If so, it would appear that this means a large amount of unnecessary
data> may end up being transferred in the second sync of a large file if you
> interrupt the first sync.
>
It all depends on where you interrupt it and how much data matches in the
remaining portion of the destination file.  It does give you the option of
discarding the partial data if it is too short to be useful, or possibly
doing your own concatenation of the whole (or trailing portion) of the
destination file onto the partial file, should you want to tweak things
before resuming the transfer.

..wayne..
-------------- next part --------------
An HTML attachment was scrubbed...
URL:
<http://lists.samba.org/pipermail/rsync/attachments/20120812/1da0e5c6/attachment.html>

Possibly Parallel Threads

Search for more seemingly similar threads

rsync - Aug 2012 - Question about --partial-dir and aborted transfers of large files

Question about --partial-dir and aborted transfers of large files

Question about --partial-dir and aborted transfers of large files

Possibly Parallel Threads